[Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files
mark.schreiber at novartis.com
mark.schreiber at novartis.com
Fri Jun 2 02:24:58 UTC 2006
Hi Seth -
The BioJavaX parsers are still quite new and have not been heavily tested
so your experiences can help us quite a lot. The parsers where initially
designed to be quite strict and follow the GenBank etc specifications.
However, there are often minor variations to those specs which cause
things to break.
To help us find the bugs can you make sure you are using the very latest
version of biojava from CVS, for example I was under the impression that
the author = null problem had been solved. In each case an example file
and the full stack trace is very useful as well. In some cases you have
provided these so we have a starting point.
Also, if you have ideas on ways to fix the problems your suggestions would
be greatly appreciated. We only have a very small team of active
developers many of whom are unfortunately very busy just now.
Hopefully we can get to this soon.
- Mark
"Seth Johnson" <johnson.biotech at gmail.com>
Sent by: biojava-l-bounces at lists.open-bio.org
06/02/2006 06:03 AM
To: biojava-l at lists.open-bio.org
cc: (bcc: Mark Schreiber/GP/Novartis)
Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1
daily update files
Hi All,
I'm a newbie to the whole BioJava(X) API and was hoping to get some
clarification on several issues that I'm having.
I am developing a parser that would take as input "NCBI Incremental
ASN.1 Sequence Updates to Genbank" files (
ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the
ASN2GB converter (
ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert
resulting sequences to a format parsable by BioJava(X) (
http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where
my problems start.
ISSUE 1:
I've tried to parse all of the formats that ASN2GB outputs ( GenBank
(default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML),
tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank
format is recognized by the
"RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with
some exceptions that I'll describe in issue #2. This is the code that
I'm using to parse, for example, the EMBL output:
BufferedReader inBuf = new BufferedReader(new
FileReader("embl_output.emb"));
Namespace gbNspace = (Namespace)
RichObjectFactory.getObject(SimpleNamespace.class, new
Object[]{"gbSpace"} );
RichSequenceIterator gbSeqs =
RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace);
while (gbSeqs.hasNext()) {
try {
RichSequence rs = gbSeqs.nextRichSequence();
// Further processing or RichSequence object from here
} catch (BioException be){
be.printStackTrace();
}
}
The multi-sequence EMBL file looks like this:
---------------------------------------------------------------------------------
ID DQ472184 standard; DNA; INV; 546 BP.
XX
AC DQ472184;
XX
SV DQ472184.1
DT 15-MAY-2006
XX
DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21)
gene,
DE complete cds.
XX
KW .
XX
OS Trypanosoma cruzi strain CL Brener
OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
OC Schizotrypanum.
XX
RN [1]
RP 1-546
RA De Melo L.D.B.;
RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
RL Unpublished.
XX
RN [2]
RP 1-546
RA De Melo L.D.B.;
RT ;
RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do
Rio
RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro,
RJ
RL 21949-900, Brazil
XX
FH Key Location/Qualifiers
FH
FT source 1..546
FT /organism="Trypanosoma cruzi strain CL Brener"
FT /mol_type="genomic DNA"
FT /strain="CL Brener"
FT /db_xref="taxon:353153"
FT gene <1..>546
FT /gene="ARC21"
FT /note="TcARC21"
FT mRNA <1..>546
FT /gene="ARC21"
FT /product="actin-related protein 3"
FT CDS 1..546
FT /gene="ARC21"
FT /note="actin-binding protein; ARPC3 21 kDa; putative
FT member of Arp2/3 complex"
FT /codon_start=1
FT /product="actin-related protein 3"
FT /protein_id="ABF13401.1"
FT /db_xref="GI:93360014"
FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG
FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH
FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL
FT FPEKDGTGNKFWMAFAKRPFLASS"
atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60
cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt
120
gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc
180
cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg
240
acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat
300
tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg
360
tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca
420
aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag
480
aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct
540
agttag 546
//
ID DQ472185 standard; DNA; INV; 543 BP.
XX
AC DQ472185;
XX
SV DQ472185.1
DT 15-MAY-2006
XX
DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20)
gene,
DE complete cds.
XX
KW .
XX
OS Trypanosoma cruzi strain CL Brener
OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
OC Schizotrypanum.
XX
RN [1]
RP 1-543
RA De Melo L.D.B.;
RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
RL Unpublished.
XX
RN [2]
RP 1-543
RA De Melo L.D.B.;
RT ;
RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do
Rio
RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro,
RJ
RL 21949-900, Brazil
XX
FH Key Location/Qualifiers
FH
FT source 1..543
FT /organism="Trypanosoma cruzi strain CL Brener"
FT /mol_type="genomic DNA"
FT /strain="CL Brener"
FT /db_xref="taxon:353153"
FT gene <1..>543
FT /gene="ARC20"
FT /note="TcARC20"
FT mRNA <1..>543
FT /gene="ARC20"
FT /product="actin-related protein 4"
FT CDS 1..543
FT /gene="ARC20"
FT /note="actin-binding protein; ARPC4 20 kDa; putative
FT member of Arp2/3 complex"
FT /codon_start=1
FT /product="actin-related protein 4"
FT /protein_id="ABF13402.1"
FT /db_xref="GI:93360016"
FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH
FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV
FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA
FT MKLNVNQRARRAAMEFFLALNFT"
atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60
tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt
120
gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata
180
cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc
240
atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt
300
ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga
360
tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt
420
attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg
480
aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca
540
tga 543
//
-----------------------------------------------------------------------
I get an exception message "Could Not Read Sequence". Same thing
happens if I use the readINSDSetDNA reader instead of readEMBLDNA one
with the following INSDset file (beginning of the file):
<?xml version="1.0"?>
<!DOCTYPE INSDSeq PUBLIC "-//NCBI//INSD INSDSeq/EN" "INSD_INSDSeq.dtd">
<INSDSeq>
<INSDSeq_locus>DQ022078</INSDSeq_locus>
<INSDSeq_length>16729</INSDSeq_length>
<INSDSeq_moltype>DNA</INSDSeq_moltype>
<INSDSeq_topology>linear</INSDSeq_topology>
<INSDSeq_division>ENV</INSDSeq_division>
<INSDSeq_update-date>15-MAY-2006</INSDSeq_update-date>
<INSDSeq_create-date>15-MAY-2006</INSDSeq_create-date>
<INSDSeq_definition>Uncultured bacterium WWRS-2005 putative
aminoglycoside phosphotransferase (a3.001), putative oxidoreductase
(a3.002), putative oxidoreductase (a3.003), putative beta-lactamase
class C (estA3), putative permease (a3.005), putative transmembrane
signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone
acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative
asparaginase (a3.010), hypothetical protein (a3.011), hypothetical
protein (a3.012), putative membrane protease subunit (a3.013),
putative haloalkane dehalogenase (a3.014), putative transcriptional
regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and
hypothetical protein (a3.017) genes, complete cds</INSDSeq_definition>
<INSDSeq_primary-accession>DQ022078</INSDSeq_primary-accession>
<INSDSeq_other-seqids>
<INSDSeqid>gb|DQ022078.1|</INSDSeqid>
<INSDSeqid>gi|71842722</INSDSeqid>
</INSDSeq_other-seqids>
<INSDSeq_keywords>
<INSDKeyword>ENV</INSDKeyword>
</INSDSeq_keywords>
<INSDSeq_references>
<INSDReference>
<INSDReference_reference>?</INSDReference_reference>
<INSDReference_position>1..16729</INSDReference_position>
<INSDReference_authors>
<INSDAuthor>Schmeisser,C.</INSDAuthor>
<INSDAuthor>Elend,C.</INSDAuthor>
<INSDAuthor>Streit,W.R.</INSDAuthor>
</INSDReference_authors>
<INSDReference_title>Isolation and biochemical characterization
of two novel metagenome derived esterases</INSDReference_title>
<INSDReference_journal>Appl. Environ. Microbiol. 0:0-0
(2006)</INSDReference_journal>
</INSDReference>
<INSDReference>
<INSDReference_reference>?</INSDReference_reference>
<INSDReference_position>1..16729</INSDReference_position>
<INSDReference_authors>
<INSDAuthor>Schmeisser,C.</INSDAuthor>
<INSDAuthor>Elend,C.</INSDAuthor>
<INSDAuthor>Streit,W.R.</INSDAuthor>
</INSDReference_authors>
<INSDReference_journal>Submitted (29-APR-2005) to the
EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University
Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057,
Germany</INSDReference_journal>
</INSDReference>
</INSDSeq_references>
So my question is wether the ASN2GB produces output that's
incompatible with BioJava parsers or is there a problem with the
sequence themselves or the problems with the majority of parsers???
Could it be that I'm using the API wrongly for the above formats,
although GenBank parser works as advertised with some exceptions
below:
ISSUE #2:
When I try to parse GenBank files using the following code:
BufferedReader inBuf = new BufferedReader(new
FileReader("genbank_output.gb"));
Namespace gbNspace = (Namespace)
RichObjectFactory.getObject(SimpleNamespace.class, new
Object[]{"gbSpace"} );
RichSequenceIterator gbSeqs =
RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace);
while (gbSeqs.hasNext()) {
try {
RichSequence rs = gbSeqs.nextRichSequence();
// Further processing or RichSequence object from here
} catch (BioException be){
be.printStackTrace();
}
}
Genbank file in question:
LOCUS BC074905 838 bp mRNA linear PRI
15-APR-2006
DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038
IMAGE:30915482), complete cds.
ACCESSION BC074905
VERSION BC074905.2 GI:50959825
KEYWORDS MGC.
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 838)
AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G.,
Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M.,
Schuler,G.D.,
Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F.,
Bhat,N.K.,
Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J.,
Hsieh,F.,
Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L.,
Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L.,
Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S.,
Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A.,
Peters,G.J.,
Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J.,
McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S.,
Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W.,
Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A.,
Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S.,
Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y.,
Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D.,
Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J.,
Myers,R.M.,
Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E.,
Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A.
CONSRTM Mammalian Gene Collection Program Team
TITLE Generation and initial analysis of more than 15,000
full-length
human and mouse cDNA sequences
JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002)
PUBMED 12477932
REFERENCE 2 (bases 1 to 838)
CONSRTM NIH MGC Project
TITLE Direct Submission
JOURNAL Submitted (25-JUN-2004) National Institutes of Health,
Mammalian
Gene Collection (MGC), Bethesda, MD 20892-2590, USA
REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov
COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832.
Contact: MGC help desk
Email: cgapbs-r at mail.nih.gov
Tissue Procurement: Genome Sequence Centre, British Columbia
Cancer
Center
cDNA Library Preparation: British Columbia Cancer Research
Center
cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL)
DNA Sequencing by: Genome Sequence Centre,
BC Cancer Agency, Vancouver, BC, Canada
info at bcgsc.bc.ca
Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson
Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen
Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel
Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave
Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio,
Ruth
Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy
Liao,
Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR
Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang,
Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco
Marra.
Clone distribution: MGC clone distribution information can be
found
through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov
Series: IRBU Plate: 4 Row: C Column: 3.
Differences found between this sequence and the human
reference
genome (build 36) are described in misc_difference features
below.
FEATURES Location/Qualifiers
source 1..838
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"
/clone="MGC:104038 IMAGE:30915482"
/tissue_type="Lung, PCR rescued clones"
/clone_lib="NIH_MGC_273"
/lab_host="DH10B"
/note="Vector: pCR4 Topo TA with reversed insert"
gene 1..838
/gene="KLK14"
/note="synonym: KLK-L6"
/db_xref="GeneID:43847"
/db_xref="HGNC:6362"
/db_xref="IMGT/GENE-DB:6362"
/db_xref="MIM:606135"
CDS 49..804
/gene="KLK14"
/codon_start=1
/product="KLK14 protein"
/protein_id="AAH74905.1"
/db_xref="GI:50959826"
/db_xref="GeneID:43847"
/db_xref="HGNC:6362"
/db_xref="IMGT/GENE-DB:6362"
/db_xref="MIM:606135"
/translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA
GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN
YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA
SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV
SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK"
misc_difference 98
/gene="KLK14"
/note="'G' in cDNA is 'A' in the human genome; amino
acid
difference: 'R' in cDNA, 'Q' in the human genome."
misc_difference 133
/gene="KLK14"
/note="'T' in cDNA is 'C' in the human genome; amino
acid
difference: 'Y' in cDNA, 'H' in the human genome."
ORIGIN
1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat
gttcctcctg
61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga
tgagaacaag
121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc
cctgctggcg
181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg
ggtcatcact
241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa
cctgaggagg
301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc
caactacaac
361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc
acggatcggg
421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac
ctcctgccga
481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc
tctgcaatgc
541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag
aaccatcacg
601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca
gggtgactct
661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg
aatggagcgc
721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag
aagctggatt
781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc
//
I get the following exception:
java.lang.IllegalArgumentException: Authors string cannot be null
org.biojava.bio.BioException: Could not read sequence
at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
at
exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107)
at
exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258)
at exonhit.parsers.GenBankParser.main(GenBankParser.java:341)
Caused by: java.lang.IllegalArgumentException: Authors string cannot be
null
at
org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76)
at
org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356)
at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
-----------------------------------------------------------------------
I'm trying to see what could be the problem with this particular
sequence. Looks to me like the AUTHORS portion is not getting parsed
correctly. Any ideas would be greatly appreciated!
--
Best Regards,
Seth Johnson
Senior Bioinformatics Associate
Ph: (202) 470-0900
Fx: (775) 251-0358
_______________________________________________
Biojava-l mailing list - Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l
More information about the Biojava-l
mailing list