[Biojava-dev] ParseException on db_xref of RefSeq GPFF protein files

Tjeerd Boerman twboerman at gmail.com
Fri Apr 27 15:10:15 UTC 2012


Hi,

I had already issued an email to NCBI, and I got a response just now 
that they are looking into it. So I guess we'll wait and see.

Regards,
Tjeerd

On 04/27/2012 03:53 PM, George Waldon wrote:
> Hi Tjeerd,
>
> This is an error in the GenBank file formatting. You should contact 
> NCBI and ask them to fix it.
>
> - George
>
> Quoting Tjeerd Boerman <twboerman at gmail.com>:
>
>> Hello,
>>
>> When parsing the file at
>>
>> ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/complete.104.protein.gpff.gz 
>>
>>
>> with BioJava 1.8.2, an exception occurs:
>>
>> ---begin exception---
>> org.biojava.bio.seq.io.ParseException:
>>
>> A Exception Has Occurred During Parsing.
>> Please submit the details that follow to biojava-l at biojava.org or 
>> post a bug report to http://bugzilla.open-bio.org/
>>
>> Format_object=org.biojavax.bio.seq.io.GenbankFormat
>> Accession=YP_004256772
>> Id=325284232
>> Comments=Bad dbxref
>> Parse_block=FEATURES   Location/Qualifierssource   1..1174/organism  
>>  "Deinococcus proteolyticus MRP"/strain   "MRP"/isolation_source   
>> "feces"/host   "Lama glama"/culture_collection   "DSMZ:DSM 
>> 20540"/db_xref   "taxon:693977"/plasmid   "pDEIPR02"/collected_by   
>> "M. Kobatake MRP"Protein   1..1174/product   "hypothetical 
>> protein"/calculated_mol_wt   129910Region   332..>674/region_name   
>> "COG1002"/note   "Type II restriction enzyme, methylase subunits
>> [Defense mechanisms]"/db_xref   "CDD:31206"CDS   1..1174/locus_tag   
>> "Deipr_2283"/coded_by   "complement(NC_015162.1:211..3735)"/note   
>> "COGs: COG1002 Type II restriction enzyme methylase
>> subunits;
>> KEGG: plm:Plim_2985 hypothetical protein;
>> SPTR: Type II restriction endonuclease"/transl_table   11/db_xref   
>> "InterPro:DNA methylase, N-6 adenine-specific,
>> conserved site"/db_xref   "InterPro:N6 adenine-specific DNA
>> methyltransferase, N12 class"/db_xref   "GeneID:10257767"
>> Stack trace follows ....
>>
>>     at 
>> org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:470)
>>     at loader.refseq.RefSeqTest.processFile(RefSeqTest.java:109)
>>     at loader.refseq.RefSeqTest.testBioJava(RefSeqTest.java:57)
>>     at loader.refseq.RefSeqTest.<init>(RefSeqTest.java:45)
>>     at loader.refseq.RefSeqTest.main(RefSeqTest.java:39)
>> ---end exception---
>>
>>
>> Every db_xref is matched with regular expression "^([^:]+):(\S+)$", 
>> which enforces that the identifier after the colon contains no 
>> whitespaces. Unfortunately, some db_xref identifiers for Interpro do 
>> contain whitespaces, for example in CDS 1..1174 of protein YP_004256772:
>>
>>                      /db_xref="InterPro:DNA methylase, N-6 
>> adenine-specific,
>>                      conserved site"
>>                      /db_xref="InterPro:N6 adenine-specific DNA
>>                      methyltransferase, N12 class"
>>
>> The Genbank format specification ( 
>> http://www.ncbi.nlm.nih.gov/genbank/collab/db_xref ) does not mention 
>> this format, it only defines the Interpro cross-reference as:
>>
>> /db_xref="InterPro:IPR002928"
>>
>>
>> My guess is that either the GenbankFormat parser is not compatible 
>> with the GenPept format, or RefSeq is taking some liberties with the 
>> Genbank specification. Any help would be appreciated!
>>
>> Best regards,
>> Tjeerd
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
>
>
>
> --------------------------------
> George Waldon
>
>



More information about the biojava-dev mailing list