[Biojava-dev] ParseException on db_xref of RefSeq GPFF protein files
Tjeerd Boerman
twboerman at gmail.com
Fri Apr 27 15:10:15 UTC 2012
Hi,
I had already issued an email to NCBI, and I got a response just now
that they are looking into it. So I guess we'll wait and see.
Regards,
Tjeerd
On 04/27/2012 03:53 PM, George Waldon wrote:
> Hi Tjeerd,
>
> This is an error in the GenBank file formatting. You should contact
> NCBI and ask them to fix it.
>
> - George
>
> Quoting Tjeerd Boerman <twboerman at gmail.com>:
>
>> Hello,
>>
>> When parsing the file at
>>
>> ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/complete.104.protein.gpff.gz
>>
>>
>> with BioJava 1.8.2, an exception occurs:
>>
>> ---begin exception---
>> org.biojava.bio.seq.io.ParseException:
>>
>> A Exception Has Occurred During Parsing.
>> Please submit the details that follow to biojava-l at biojava.org or
>> post a bug report to http://bugzilla.open-bio.org/
>>
>> Format_object=org.biojavax.bio.seq.io.GenbankFormat
>> Accession=YP_004256772
>> Id=325284232
>> Comments=Bad dbxref
>> Parse_block=FEATURES Location/Qualifierssource 1..1174/organism
>> "Deinococcus proteolyticus MRP"/strain "MRP"/isolation_source
>> "feces"/host "Lama glama"/culture_collection "DSMZ:DSM
>> 20540"/db_xref "taxon:693977"/plasmid "pDEIPR02"/collected_by
>> "M. Kobatake MRP"Protein 1..1174/product "hypothetical
>> protein"/calculated_mol_wt 129910Region 332..>674/region_name
>> "COG1002"/note "Type II restriction enzyme, methylase subunits
>> [Defense mechanisms]"/db_xref "CDD:31206"CDS 1..1174/locus_tag
>> "Deipr_2283"/coded_by "complement(NC_015162.1:211..3735)"/note
>> "COGs: COG1002 Type II restriction enzyme methylase
>> subunits;
>> KEGG: plm:Plim_2985 hypothetical protein;
>> SPTR: Type II restriction endonuclease"/transl_table 11/db_xref
>> "InterPro:DNA methylase, N-6 adenine-specific,
>> conserved site"/db_xref "InterPro:N6 adenine-specific DNA
>> methyltransferase, N12 class"/db_xref "GeneID:10257767"
>> Stack trace follows ....
>>
>> at
>> org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:470)
>> at loader.refseq.RefSeqTest.processFile(RefSeqTest.java:109)
>> at loader.refseq.RefSeqTest.testBioJava(RefSeqTest.java:57)
>> at loader.refseq.RefSeqTest.<init>(RefSeqTest.java:45)
>> at loader.refseq.RefSeqTest.main(RefSeqTest.java:39)
>> ---end exception---
>>
>>
>> Every db_xref is matched with regular expression "^([^:]+):(\S+)$",
>> which enforces that the identifier after the colon contains no
>> whitespaces. Unfortunately, some db_xref identifiers for Interpro do
>> contain whitespaces, for example in CDS 1..1174 of protein YP_004256772:
>>
>> /db_xref="InterPro:DNA methylase, N-6
>> adenine-specific,
>> conserved site"
>> /db_xref="InterPro:N6 adenine-specific DNA
>> methyltransferase, N12 class"
>>
>> The Genbank format specification (
>> http://www.ncbi.nlm.nih.gov/genbank/collab/db_xref ) does not mention
>> this format, it only defines the Interpro cross-reference as:
>>
>> /db_xref="InterPro:IPR002928"
>>
>>
>> My guess is that either the GenbankFormat parser is not compatible
>> with the GenPept format, or RefSeq is taking some liberties with the
>> Genbank specification. Any help would be appreciated!
>>
>> Best regards,
>> Tjeerd
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
>
>
>
> --------------------------------
> George Waldon
>
>
More information about the biojava-dev
mailing list