[Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files
Seth Johnson
johnson.biotech at gmail.com
Thu Jun 1 22:03:43 UTC 2006
Hi All,
I'm a newbie to the whole BioJava(X) API and was hoping to get some
clarification on several issues that I'm having.
I am developing a parser that would take as input "NCBI Incremental
ASN.1 Sequence Updates to Genbank" files (
ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the
ASN2GB converter (
ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert
resulting sequences to a format parsable by BioJava(X) (
http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where
my problems start.
ISSUE 1:
I've tried to parse all of the formats that ASN2GB outputs ( GenBank
(default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML),
tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank
format is recognized by the
"RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with
some exceptions that I'll describe in issue #2. This is the code that
I'm using to parse, for example, the EMBL output:
BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb"));
Namespace gbNspace = (Namespace)
RichObjectFactory.getObject(SimpleNamespace.class, new
Object[]{"gbSpace"} );
RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace);
while (gbSeqs.hasNext()) {
try {
RichSequence rs = gbSeqs.nextRichSequence();
// Further processing or RichSequence object from here
} catch (BioException be){
be.printStackTrace();
}
}
The multi-sequence EMBL file looks like this:
---------------------------------------------------------------------------------
ID DQ472184 standard; DNA; INV; 546 BP.
XX
AC DQ472184;
XX
SV DQ472184.1
DT 15-MAY-2006
XX
DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene,
DE complete cds.
XX
KW .
XX
OS Trypanosoma cruzi strain CL Brener
OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
OC Schizotrypanum.
XX
RN [1]
RP 1-546
RA De Melo L.D.B.;
RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
RL Unpublished.
XX
RN [2]
RP 1-546
RA De Melo L.D.B.;
RT ;
RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ
RL 21949-900, Brazil
XX
FH Key Location/Qualifiers
FH
FT source 1..546
FT /organism="Trypanosoma cruzi strain CL Brener"
FT /mol_type="genomic DNA"
FT /strain="CL Brener"
FT /db_xref="taxon:353153"
FT gene <1..>546
FT /gene="ARC21"
FT /note="TcARC21"
FT mRNA <1..>546
FT /gene="ARC21"
FT /product="actin-related protein 3"
FT CDS 1..546
FT /gene="ARC21"
FT /note="actin-binding protein; ARPC3 21 kDa; putative
FT member of Arp2/3 complex"
FT /codon_start=1
FT /product="actin-related protein 3"
FT /protein_id="ABF13401.1"
FT /db_xref="GI:93360014"
FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG
FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH
FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL
FT FPEKDGTGNKFWMAFAKRPFLASS"
atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60
cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120
gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180
cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240
acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300
tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360
tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420
aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480
aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540
agttag 546
//
ID DQ472185 standard; DNA; INV; 543 BP.
XX
AC DQ472185;
XX
SV DQ472185.1
DT 15-MAY-2006
XX
DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene,
DE complete cds.
XX
KW .
XX
OS Trypanosoma cruzi strain CL Brener
OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
OC Schizotrypanum.
XX
RN [1]
RP 1-543
RA De Melo L.D.B.;
RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
RL Unpublished.
XX
RN [2]
RP 1-543
RA De Melo L.D.B.;
RT ;
RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ
RL 21949-900, Brazil
XX
FH Key Location/Qualifiers
FH
FT source 1..543
FT /organism="Trypanosoma cruzi strain CL Brener"
FT /mol_type="genomic DNA"
FT /strain="CL Brener"
FT /db_xref="taxon:353153"
FT gene <1..>543
FT /gene="ARC20"
FT /note="TcARC20"
FT mRNA <1..>543
FT /gene="ARC20"
FT /product="actin-related protein 4"
FT CDS 1..543
FT /gene="ARC20"
FT /note="actin-binding protein; ARPC4 20 kDa; putative
FT member of Arp2/3 complex"
FT /codon_start=1
FT /product="actin-related protein 4"
FT /protein_id="ABF13402.1"
FT /db_xref="GI:93360016"
FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH
FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV
FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA
FT MKLNVNQRARRAAMEFFLALNFT"
atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60
tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120
gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180
cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240
atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300
ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360
tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420
attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480
aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540
tga 543
//
-----------------------------------------------------------------------
I get an exception message "Could Not Read Sequence". Same thing
happens if I use the readINSDSetDNA reader instead of readEMBLDNA one
with the following INSDset file (beginning of the file):
<?xml version="1.0"?>
<!DOCTYPE INSDSeq PUBLIC "-//NCBI//INSD INSDSeq/EN" "INSD_INSDSeq.dtd">
<INSDSeq>
<INSDSeq_locus>DQ022078</INSDSeq_locus>
<INSDSeq_length>16729</INSDSeq_length>
<INSDSeq_moltype>DNA</INSDSeq_moltype>
<INSDSeq_topology>linear</INSDSeq_topology>
<INSDSeq_division>ENV</INSDSeq_division>
<INSDSeq_update-date>15-MAY-2006</INSDSeq_update-date>
<INSDSeq_create-date>15-MAY-2006</INSDSeq_create-date>
<INSDSeq_definition>Uncultured bacterium WWRS-2005 putative
aminoglycoside phosphotransferase (a3.001), putative oxidoreductase
(a3.002), putative oxidoreductase (a3.003), putative beta-lactamase
class C (estA3), putative permease (a3.005), putative transmembrane
signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone
acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative
asparaginase (a3.010), hypothetical protein (a3.011), hypothetical
protein (a3.012), putative membrane protease subunit (a3.013),
putative haloalkane dehalogenase (a3.014), putative transcriptional
regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and
hypothetical protein (a3.017) genes, complete cds</INSDSeq_definition>
<INSDSeq_primary-accession>DQ022078</INSDSeq_primary-accession>
<INSDSeq_other-seqids>
<INSDSeqid>gb|DQ022078.1|</INSDSeqid>
<INSDSeqid>gi|71842722</INSDSeqid>
</INSDSeq_other-seqids>
<INSDSeq_keywords>
<INSDKeyword>ENV</INSDKeyword>
</INSDSeq_keywords>
<INSDSeq_references>
<INSDReference>
<INSDReference_reference>?</INSDReference_reference>
<INSDReference_position>1..16729</INSDReference_position>
<INSDReference_authors>
<INSDAuthor>Schmeisser,C.</INSDAuthor>
<INSDAuthor>Elend,C.</INSDAuthor>
<INSDAuthor>Streit,W.R.</INSDAuthor>
</INSDReference_authors>
<INSDReference_title>Isolation and biochemical characterization
of two novel metagenome derived esterases</INSDReference_title>
<INSDReference_journal>Appl. Environ. Microbiol. 0:0-0
(2006)</INSDReference_journal>
</INSDReference>
<INSDReference>
<INSDReference_reference>?</INSDReference_reference>
<INSDReference_position>1..16729</INSDReference_position>
<INSDReference_authors>
<INSDAuthor>Schmeisser,C.</INSDAuthor>
<INSDAuthor>Elend,C.</INSDAuthor>
<INSDAuthor>Streit,W.R.</INSDAuthor>
</INSDReference_authors>
<INSDReference_journal>Submitted (29-APR-2005) to the
EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University
Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057,
Germany</INSDReference_journal>
</INSDReference>
</INSDSeq_references>
So my question is wether the ASN2GB produces output that's
incompatible with BioJava parsers or is there a problem with the
sequence themselves or the problems with the majority of parsers???
Could it be that I'm using the API wrongly for the above formats,
although GenBank parser works as advertised with some exceptions
below:
ISSUE #2:
When I try to parse GenBank files using the following code:
BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb"));
Namespace gbNspace = (Namespace)
RichObjectFactory.getObject(SimpleNamespace.class, new
Object[]{"gbSpace"} );
RichSequenceIterator gbSeqs =
RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace);
while (gbSeqs.hasNext()) {
try {
RichSequence rs = gbSeqs.nextRichSequence();
// Further processing or RichSequence object from here
} catch (BioException be){
be.printStackTrace();
}
}
Genbank file in question:
LOCUS BC074905 838 bp mRNA linear PRI 15-APR-2006
DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038
IMAGE:30915482), complete cds.
ACCESSION BC074905
VERSION BC074905.2 GI:50959825
KEYWORDS MGC.
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 838)
AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G.,
Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D.,
Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K.,
Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F.,
Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L.,
Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L.,
Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S.,
Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J.,
Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J.,
McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S.,
Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W.,
Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A.,
Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S.,
Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y.,
Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D.,
Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M.,
Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E.,
Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A.
CONSRTM Mammalian Gene Collection Program Team
TITLE Generation and initial analysis of more than 15,000 full-length
human and mouse cDNA sequences
JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002)
PUBMED 12477932
REFERENCE 2 (bases 1 to 838)
CONSRTM NIH MGC Project
TITLE Direct Submission
JOURNAL Submitted (25-JUN-2004) National Institutes of Health, Mammalian
Gene Collection (MGC), Bethesda, MD 20892-2590, USA
REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov
COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832.
Contact: MGC help desk
Email: cgapbs-r at mail.nih.gov
Tissue Procurement: Genome Sequence Centre, British Columbia Cancer
Center
cDNA Library Preparation: British Columbia Cancer Research Center
cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL)
DNA Sequencing by: Genome Sequence Centre,
BC Cancer Agency, Vancouver, BC, Canada
info at bcgsc.bc.ca
Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson
Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen
Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel
Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave
Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth
Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao,
Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR
Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang,
Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra.
Clone distribution: MGC clone distribution information can be found
through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov
Series: IRBU Plate: 4 Row: C Column: 3.
Differences found between this sequence and the human reference
genome (build 36) are described in misc_difference features below.
FEATURES Location/Qualifiers
source 1..838
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"
/clone="MGC:104038 IMAGE:30915482"
/tissue_type="Lung, PCR rescued clones"
/clone_lib="NIH_MGC_273"
/lab_host="DH10B"
/note="Vector: pCR4 Topo TA with reversed insert"
gene 1..838
/gene="KLK14"
/note="synonym: KLK-L6"
/db_xref="GeneID:43847"
/db_xref="HGNC:6362"
/db_xref="IMGT/GENE-DB:6362"
/db_xref="MIM:606135"
CDS 49..804
/gene="KLK14"
/codon_start=1
/product="KLK14 protein"
/protein_id="AAH74905.1"
/db_xref="GI:50959826"
/db_xref="GeneID:43847"
/db_xref="HGNC:6362"
/db_xref="IMGT/GENE-DB:6362"
/db_xref="MIM:606135"
/translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA
GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN
YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA
SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV
SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK"
misc_difference 98
/gene="KLK14"
/note="'G' in cDNA is 'A' in the human genome; amino acid
difference: 'R' in cDNA, 'Q' in the human genome."
misc_difference 133
/gene="KLK14"
/note="'T' in cDNA is 'C' in the human genome; amino acid
difference: 'Y' in cDNA, 'H' in the human genome."
ORIGIN
1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg
61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag
121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg
181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact
241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg
301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac
361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg
421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga
481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc
541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg
601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct
661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc
721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt
781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc
//
I get the following exception:
java.lang.IllegalArgumentException: Authors string cannot be null
org.biojava.bio.BioException: Could not read sequence
at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107)
at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258)
at exonhit.parsers.GenBankParser.main(GenBankParser.java:341)
Caused by: java.lang.IllegalArgumentException: Authors string cannot be null
at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76)
at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356)
at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
-----------------------------------------------------------------------
I'm trying to see what could be the problem with this particular
sequence. Looks to me like the AUTHORS portion is not getting parsed
correctly. Any ideas would be greatly appreciated!
--
Best Regards,
Seth Johnson
Senior Bioinformatics Associate
Ph: (202) 470-0900
Fx: (775) 251-0358
More information about the Biojava-l
mailing list