[Biojava-dev] ParseException on db_xref of RefSeq GPFF protein files
George Waldon
gwaldon at geneinfinity.org
Fri Apr 27 13:53:37 UTC 2012
Hi Tjeerd,
This is an error in the GenBank file formatting. You should contact
NCBI and ask them to fix it.
- George
Quoting Tjeerd Boerman <twboerman at gmail.com>:
> Hello,
>
> When parsing the file at
>
> ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/complete.104.protein.gpff.gz
>
> with BioJava 1.8.2, an exception occurs:
>
> ---begin exception---
> org.biojava.bio.seq.io.ParseException:
>
> A Exception Has Occurred During Parsing.
> Please submit the details that follow to biojava-l at biojava.org or
> post a bug report to http://bugzilla.open-bio.org/
>
> Format_object=org.biojavax.bio.seq.io.GenbankFormat
> Accession=YP_004256772
> Id=325284232
> Comments=Bad dbxref
> Parse_block=FEATURES Location/Qualifierssource 1..1174/organism
> "Deinococcus proteolyticus MRP"/strain "MRP"/isolation_source
> "feces"/host "Lama glama"/culture_collection "DSMZ:DSM
> 20540"/db_xref "taxon:693977"/plasmid "pDEIPR02"/collected_by
> "M. Kobatake MRP"Protein 1..1174/product "hypothetical
> protein"/calculated_mol_wt 129910Region 332..>674/region_name
> "COG1002"/note "Type II restriction enzyme, methylase subunits
> [Defense mechanisms]"/db_xref "CDD:31206"CDS 1..1174/locus_tag
> "Deipr_2283"/coded_by "complement(NC_015162.1:211..3735)"/note
> "COGs: COG1002 Type II restriction enzyme methylase
> subunits;
> KEGG: plm:Plim_2985 hypothetical protein;
> SPTR: Type II restriction endonuclease"/transl_table 11/db_xref
> "InterPro:DNA methylase, N-6 adenine-specific,
> conserved site"/db_xref "InterPro:N6 adenine-specific DNA
> methyltransferase, N12 class"/db_xref "GeneID:10257767"
> Stack trace follows ....
>
> at
> org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:470)
> at loader.refseq.RefSeqTest.processFile(RefSeqTest.java:109)
> at loader.refseq.RefSeqTest.testBioJava(RefSeqTest.java:57)
> at loader.refseq.RefSeqTest.<init>(RefSeqTest.java:45)
> at loader.refseq.RefSeqTest.main(RefSeqTest.java:39)
> ---end exception---
>
>
> Every db_xref is matched with regular expression "^([^:]+):(\S+)$",
> which enforces that the identifier after the colon contains no
> whitespaces. Unfortunately, some db_xref identifiers for Interpro do
> contain whitespaces, for example in CDS 1..1174 of protein
> YP_004256772:
>
> /db_xref="InterPro:DNA methylase, N-6 adenine-specific,
> conserved site"
> /db_xref="InterPro:N6 adenine-specific DNA
> methyltransferase, N12 class"
>
> The Genbank format specification (
> http://www.ncbi.nlm.nih.gov/genbank/collab/db_xref ) does not
> mention this format, it only defines the Interpro cross-reference as:
>
> /db_xref="InterPro:IPR002928"
>
>
> My guess is that either the GenbankFormat parser is not compatible
> with the GenPept format, or RefSeq is taking some liberties with the
> Genbank specification. Any help would be appreciated!
>
> Best regards,
> Tjeerd
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>
--------------------------------
George Waldon
More information about the biojava-dev
mailing list