[Biojava-dev] ParseException on db_xref of RefSeq GPFF protein files
Tjeerd Boerman
twboerman at gmail.com
Thu Apr 26 14:37:06 UTC 2012
Hello,
When parsing the file at
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/complete.104.protein.gpff.gz
with BioJava 1.8.2, an exception occurs:
---begin exception---
org.biojava.bio.seq.io.ParseException:
A Exception Has Occurred During Parsing.
Please submit the details that follow to biojava-l at biojava.org or post a
bug report to http://bugzilla.open-bio.org/
Format_object=org.biojavax.bio.seq.io.GenbankFormat
Accession=YP_004256772
Id=325284232
Comments=Bad dbxref
Parse_block=FEATURES Location/Qualifierssource 1..1174/organism
"Deinococcus proteolyticus MRP"/strain "MRP"/isolation_source
"feces"/host "Lama glama"/culture_collection "DSMZ:DSM
20540"/db_xref "taxon:693977"/plasmid "pDEIPR02"/collected_by "M.
Kobatake MRP"Protein 1..1174/product "hypothetical
protein"/calculated_mol_wt 129910Region 332..>674/region_name
"COG1002"/note "Type II restriction enzyme, methylase subunits
[Defense mechanisms]"/db_xref "CDD:31206"CDS 1..1174/locus_tag
"Deipr_2283"/coded_by "complement(NC_015162.1:211..3735)"/note
"COGs: COG1002 Type II restriction enzyme methylase
subunits;
KEGG: plm:Plim_2985 hypothetical protein;
SPTR: Type II restriction endonuclease"/transl_table 11/db_xref
"InterPro:DNA methylase, N-6 adenine-specific,
conserved site"/db_xref "InterPro:N6 adenine-specific DNA
methyltransferase, N12 class"/db_xref "GeneID:10257767"
Stack trace follows ....
at
org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:470)
at loader.refseq.RefSeqTest.processFile(RefSeqTest.java:109)
at loader.refseq.RefSeqTest.testBioJava(RefSeqTest.java:57)
at loader.refseq.RefSeqTest.<init>(RefSeqTest.java:45)
at loader.refseq.RefSeqTest.main(RefSeqTest.java:39)
---end exception---
Every db_xref is matched with regular expression "^([^:]+):(\S+)$",
which enforces that the identifier after the colon contains no
whitespaces. Unfortunately, some db_xref identifiers for Interpro do
contain whitespaces, for example in CDS 1..1174 of protein YP_004256772:
/db_xref="InterPro:DNA methylase, N-6
adenine-specific,
conserved site"
/db_xref="InterPro:N6 adenine-specific DNA
methyltransferase, N12 class"
The Genbank format specification (
http://www.ncbi.nlm.nih.gov/genbank/collab/db_xref ) does not mention
this format, it only defines the Interpro cross-reference as:
/db_xref="InterPro:IPR002928"
My guess is that either the GenbankFormat parser is not compatible with
the GenPept format, or RefSeq is taking some liberties with the Genbank
specification. Any help would be appreciated!
Best regards,
Tjeerd
More information about the biojava-dev
mailing list