[Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files

Thu Jun 1 22:03:43 UTC 2006

Hi All,

I'm a newbie to the whole BioJava(X) API and was hoping to get some
clarification on several issues that I'm having.
I am developing a parser that would take as input "NCBI Incremental
ASN.1 Sequence Updates to Genbank" files (
ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the
ASN2GB converter (
ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert
resulting sequences to a format parsable by BioJava(X) (
http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where
my problems start.

ISSUE 1:
I've tried to parse all of the formats that ASN2GB outputs ( GenBank
(default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML),
tiny seq (XML) ) using either BioJava or BioJavaX API.  Only GenBank
format is recognized by the
"RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with
some exceptions that I'll describe in issue #2.  This is the code that
I'm using to parse, for example, the EMBL output:

BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb"));
Namespace gbNspace = (Namespace)
RichObjectFactory.getObject(SimpleNamespace.class, new
Object[]{"gbSpace"} );
RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace);
while (gbSeqs.hasNext()) {
  try {
           RichSequence rs = gbSeqs.nextRichSequence();
           // Further processing or RichSequence object from here

       } catch (BioException be){
           be.printStackTrace();
       }
}

The multi-sequence EMBL file looks like this:
---------------------------------------------------------------------------------
ID   DQ472184  standard; DNA; INV; 546 BP.
XX
AC   DQ472184;
XX
SV   DQ472184.1
DT   15-MAY-2006
XX
DE   Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene,
DE   complete cds.
XX
KW   .
XX
OS   Trypanosoma cruzi strain CL Brener
OC   Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
OC   Schizotrypanum.
XX
RN   [1]
RP   1-546
RA   De Melo L.D.B.;
RT   "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
RL   Unpublished.
XX
RN   [2]
RP   1-546
RA   De Melo L.D.B.;
RT   ;
RL   Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
RL   Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
RL   de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ
RL   21949-900, Brazil
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..546
FT                   /organism="Trypanosoma cruzi strain CL Brener"
FT                   /mol_type="genomic DNA"
FT                   /strain="CL Brener"
FT                   /db_xref="taxon:353153"
FT   gene            <1..>546
FT                   /gene="ARC21"
FT                   /note="TcARC21"
FT   mRNA            <1..>546
FT                   /gene="ARC21"
FT                   /product="actin-related protein 3"
FT   CDS             1..546
FT                   /gene="ARC21"
FT                   /note="actin-binding protein; ARPC3 21 kDa; putative
FT                   member of Arp2/3 complex"
FT                   /codon_start=1
FT                   /product="actin-related protein 3"
FT                   /protein_id="ABF13401.1"
FT                   /db_xref="GI:93360014"
FT                   /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG
FT                   EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH
FT                   SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL
FT                   FPEKDGTGNKFWMAFAKRPFLASS"
     atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg        60
     cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt       120
     gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc       180
     cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg       240
     acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat       300
     tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg       360
     tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca       420
     aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag       480
     aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct       540
     agttag                                                                  546
//
ID   DQ472185  standard; DNA; INV; 543 BP.
XX
AC   DQ472185;
XX
SV   DQ472185.1
DT   15-MAY-2006
XX
DE   Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene,
DE   complete cds.
XX
KW   .
XX
OS   Trypanosoma cruzi strain CL Brener
OC   Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
OC   Schizotrypanum.
XX
RN   [1]
RP   1-543
RA   De Melo L.D.B.;
RT   "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
RL   Unpublished.
XX
RN   [2]
RP   1-543
RA   De Melo L.D.B.;
RT   ;
RL   Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
RL   Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
RL   de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ
RL   21949-900, Brazil
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..543
FT                   /organism="Trypanosoma cruzi strain CL Brener"
FT                   /mol_type="genomic DNA"
FT                   /strain="CL Brener"
FT                   /db_xref="taxon:353153"
FT   gene            <1..>543
FT                   /gene="ARC20"
FT                   /note="TcARC20"
FT   mRNA            <1..>543
FT                   /gene="ARC20"
FT                   /product="actin-related protein 4"
FT   CDS             1..543
FT                   /gene="ARC20"
FT                   /note="actin-binding protein; ARPC4 20 kDa; putative
FT                   member of Arp2/3 complex"
FT                   /codon_start=1
FT                   /product="actin-related protein 4"
FT                   /protein_id="ABF13402.1"
FT                   /db_xref="GI:93360016"
FT                   /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH
FT                   LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV
FT                   GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA
FT                   MKLNVNQRARRAAMEFFLALNFT"
     atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg        60
     tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt       120
     gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata       180
     cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc       240
     atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt       300
     ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga       360
     tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt       420
     attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg       480
     aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca       540
     tga                                                                     543
//
-----------------------------------------------------------------------
I get an exception message "Could Not Read Sequence".  Same thing
happens if I use the readINSDSetDNA reader instead of readEMBLDNA one
with the following INSDset file (beginning of the file):

<?xml version="1.0"?>
<!DOCTYPE INSDSeq PUBLIC "-//NCBI//INSD INSDSeq/EN" "INSD_INSDSeq.dtd">
<INSDSeq>
  <INSDSeq_locus>DQ022078</INSDSeq_locus>
  <INSDSeq_length>16729</INSDSeq_length>
  <INSDSeq_moltype>DNA</INSDSeq_moltype>
  <INSDSeq_topology>linear</INSDSeq_topology>
  <INSDSeq_division>ENV</INSDSeq_division>
  <INSDSeq_update-date>15-MAY-2006</INSDSeq_update-date>
  <INSDSeq_create-date>15-MAY-2006</INSDSeq_create-date>
  <INSDSeq_definition>Uncultured bacterium WWRS-2005 putative
aminoglycoside phosphotransferase (a3.001), putative oxidoreductase
(a3.002), putative oxidoreductase (a3.003), putative beta-lactamase
class C (estA3), putative permease (a3.005), putative transmembrane
signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone
acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative
asparaginase (a3.010), hypothetical protein (a3.011), hypothetical
protein (a3.012), putative membrane protease subunit (a3.013),
putative haloalkane dehalogenase (a3.014), putative transcriptional
regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and
hypothetical protein (a3.017) genes, complete cds</INSDSeq_definition>
  <INSDSeq_primary-accession>DQ022078</INSDSeq_primary-accession>
  <INSDSeq_other-seqids>
    <INSDSeqid>gb|DQ022078.1|</INSDSeqid>
    <INSDSeqid>gi|71842722</INSDSeqid>
  </INSDSeq_other-seqids>
  <INSDSeq_keywords>
    <INSDKeyword>ENV</INSDKeyword>
  </INSDSeq_keywords>
  <INSDSeq_references>
    <INSDReference>
      <INSDReference_reference>?</INSDReference_reference>
      <INSDReference_position>1..16729</INSDReference_position>
      <INSDReference_authors>
        <INSDAuthor>Schmeisser,C.</INSDAuthor>
        <INSDAuthor>Elend,C.</INSDAuthor>
        <INSDAuthor>Streit,W.R.</INSDAuthor>
      </INSDReference_authors>
      <INSDReference_title>Isolation and biochemical characterization
of two novel metagenome derived esterases</INSDReference_title>
      <INSDReference_journal>Appl. Environ. Microbiol. 0:0-0
(2006)</INSDReference_journal>
    </INSDReference>
    <INSDReference>
      <INSDReference_reference>?</INSDReference_reference>
      <INSDReference_position>1..16729</INSDReference_position>
      <INSDReference_authors>
        <INSDAuthor>Schmeisser,C.</INSDAuthor>
        <INSDAuthor>Elend,C.</INSDAuthor>
        <INSDAuthor>Streit,W.R.</INSDAuthor>
      </INSDReference_authors>
      <INSDReference_journal>Submitted (29-APR-2005) to the
EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University
Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057,
Germany</INSDReference_journal>
    </INSDReference>
  </INSDSeq_references>

So my question is wether the ASN2GB produces output that's
incompatible with BioJava parsers or is there a problem with the
sequence themselves or the problems with the majority of parsers???
Could it be that I'm using the API wrongly for the above formats,
although GenBank parser works as advertised with some exceptions
below:

ISSUE #2:
When I try to parse GenBank files using the following code:

BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb"));
Namespace gbNspace = (Namespace)
RichObjectFactory.getObject(SimpleNamespace.class, new
Object[]{"gbSpace"} );
RichSequenceIterator gbSeqs =
RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace);
while (gbSeqs.hasNext()) {
  try {
           RichSequence rs = gbSeqs.nextRichSequence();
           // Further processing or RichSequence object from here

       } catch (BioException be){
           be.printStackTrace();
       }
}

Genbank file in question:

LOCUS       BC074905                 838 bp    mRNA    linear   PRI 15-APR-2006
DEFINITION  Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038
            IMAGE:30915482), complete cds.
ACCESSION   BC074905
VERSION     BC074905.2  GI:50959825
KEYWORDS    MGC.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 838)
  AUTHORS   Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G.,
            Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D.,
            Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K.,
            Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F.,
            Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L.,
            Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L.,
            Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S.,
            Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J.,
            Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J.,
            McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S.,
            Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W.,
            Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A.,
            Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S.,
            Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y.,
            Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D.,
            Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M.,
            Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E.,
            Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A.
  CONSRTM   Mammalian Gene Collection Program Team
  TITLE     Generation and initial analysis of more than 15,000 full-length
            human and mouse cDNA sequences
  JOURNAL   Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002)
   PUBMED   12477932
REFERENCE   2  (bases 1 to 838)
  CONSRTM   NIH MGC Project
  TITLE     Direct Submission
  JOURNAL   Submitted (25-JUN-2004) National Institutes of Health, Mammalian
            Gene Collection (MGC), Bethesda, MD 20892-2590, USA
  REMARK    NIH-MGC Project URL: http://mgc.nci.nih.gov
COMMENT     On Aug 4, 2004 this sequence version replaced gi:49901832.
            Contact: MGC help desk
            Email: cgapbs-r at mail.nih.gov
            Tissue Procurement: Genome Sequence Centre, British Columbia Cancer
            Center
            cDNA Library Preparation: British Columbia Cancer Research Center
            cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL)
            DNA Sequencing by: Genome Sequence Centre,
            BC Cancer Agency, Vancouver, BC, Canada
            info at bcgsc.bc.ca
            Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson
            Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen
            Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel
            Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave
            Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth
            Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao,
            Kim MacDonald,  Mike R. Mayo, Josh Moran, Diana Palmquist, JR
            Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang,
            Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra.

            Clone distribution: MGC clone distribution information can be found
            through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov
            Series: IRBU Plate: 4 Row: C Column: 3.

            Differences found between this sequence and the human reference
            genome (build 36) are described in misc_difference features below.
FEATURES             Location/Qualifiers
     source          1..838
                     /organism="Homo sapiens"
                     /mol_type="mRNA"
                     /db_xref="taxon:9606"
                     /clone="MGC:104038 IMAGE:30915482"
                     /tissue_type="Lung, PCR rescued clones"
                     /clone_lib="NIH_MGC_273"
                     /lab_host="DH10B"
                     /note="Vector: pCR4 Topo TA with reversed insert"
     gene            1..838
                     /gene="KLK14"
                     /note="synonym: KLK-L6"
                     /db_xref="GeneID:43847"
                     /db_xref="HGNC:6362"
                     /db_xref="IMGT/GENE-DB:6362"
                     /db_xref="MIM:606135"
     CDS             49..804
                     /gene="KLK14"
                     /codon_start=1
                     /product="KLK14 protein"
                     /protein_id="AAH74905.1"
                     /db_xref="GI:50959826"
                     /db_xref="GeneID:43847"
                     /db_xref="HGNC:6362"
                     /db_xref="IMGT/GENE-DB:6362"
                     /db_xref="MIM:606135"
                     /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA
                     GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN
                     YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA
                     SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV
                     SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK"
     misc_difference 98
                     /gene="KLK14"
                     /note="'G' in cDNA is 'A' in the human genome; amino acid
                     difference: 'R' in cDNA, 'Q' in the human genome."
     misc_difference 133
                     /gene="KLK14"
                     /note="'T' in cDNA is 'C' in the human genome; amino acid
                     difference: 'Y' in cDNA, 'H' in the human genome."
ORIGIN
        1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg
       61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag
      121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg
      181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact
      241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg
      301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac
      361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg
      421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga
      481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc
      541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg
      601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct
      661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc
      721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt
      781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc
//

I get the following exception:

java.lang.IllegalArgumentException: Authors string cannot be null
org.biojava.bio.BioException: Could not read sequence
        at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
        at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107)
        at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258)
        at exonhit.parsers.GenBankParser.main(GenBankParser.java:341)
Caused by: java.lang.IllegalArgumentException: Authors string cannot be null
        at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76)
        at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356)
        at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)

-----------------------------------------------------------------------

I'm trying to see what could be the problem with this particular
sequence.  Looks to me like the AUTHORS portion is not getting parsed
correctly.  Any ideas would be greatly appreciated!

-- 
Best Regards,

Seth Johnson
Senior Bioinformatics Associate

Ph: (202) 470-0900
Fx: (775) 251-0358