[Biojava-dev] ParseException on db_xref of RefSeq GPFF protein files
Tjeerd Boerman
twboerman at gmail.com
Tue May 1 21:37:39 UTC 2012
Hey,
I just got a response from NCBI:
> Hello,
> This is a formatting error and it will be fixed.
> Best,
> Majda
Regards,
Tjeerd
On 4/27/2012 5:10 PM, Tjeerd Boerman wrote:
> Hi,
>
> I had already issued an email to NCBI, and I got a response just now
> that they are looking into it. So I guess we'll wait and see.
>
> Regards,
> Tjeerd
>
> On 04/27/2012 03:53 PM, George Waldon wrote:
>> Hi Tjeerd,
>>
>> This is an error in the GenBank file formatting. You should contact
>> NCBI and ask them to fix it.
>>
>> - George
>>
>> Quoting Tjeerd Boerman <twboerman at gmail.com>:
>>
>>> Hello,
>>>
>>> When parsing the file at
>>>
>>> ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/complete.104.protein.gpff.gz
>>>
>>>
>>> with BioJava 1.8.2, an exception occurs:
>>>
>>> ---begin exception---
>>> org.biojava.bio.seq.io.ParseException:
>>>
>>> A Exception Has Occurred During Parsing.
>>> Please submit the details that follow to biojava-l at biojava.org or
>>> post a bug report to http://bugzilla.open-bio.org/
>>>
>>> Format_object=org.biojavax.bio.seq.io.GenbankFormat
>>> Accession=YP_004256772
>>> Id=325284232
>>> Comments=Bad dbxref
>>> Parse_block=FEATURES Location/Qualifierssource 1..1174/organism
>>> "Deinococcus proteolyticus MRP"/strain "MRP"/isolation_source
>>> "feces"/host "Lama glama"/culture_collection "DSMZ:DSM
>>> 20540"/db_xref "taxon:693977"/plasmid "pDEIPR02"/collected_by
>>> "M. Kobatake MRP"Protein 1..1174/product "hypothetical
>>> protein"/calculated_mol_wt 129910Region 332..>674/region_name
>>> "COG1002"/note "Type II restriction enzyme, methylase subunits
>>> [Defense mechanisms]"/db_xref "CDD:31206"CDS 1..1174/locus_tag
>>> "Deipr_2283"/coded_by "complement(NC_015162.1:211..3735)"/note
>>> "COGs: COG1002 Type II restriction enzyme methylase
>>> subunits;
>>> KEGG: plm:Plim_2985 hypothetical protein;
>>> SPTR: Type II restriction endonuclease"/transl_table 11/db_xref
>>> "InterPro:DNA methylase, N-6 adenine-specific,
>>> conserved site"/db_xref "InterPro:N6 adenine-specific DNA
>>> methyltransferase, N12 class"/db_xref "GeneID:10257767"
>>> Stack trace follows ....
>>>
>>> at
>>> org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:470)
>>> at loader.refseq.RefSeqTest.processFile(RefSeqTest.java:109)
>>> at loader.refseq.RefSeqTest.testBioJava(RefSeqTest.java:57)
>>> at loader.refseq.RefSeqTest.<init>(RefSeqTest.java:45)
>>> at loader.refseq.RefSeqTest.main(RefSeqTest.java:39)
>>> ---end exception---
>>>
>>>
>>> Every db_xref is matched with regular expression "^([^:]+):(\S+)$",
>>> which enforces that the identifier after the colon contains no
>>> whitespaces. Unfortunately, some db_xref identifiers for Interpro do
>>> contain whitespaces, for example in CDS 1..1174 of protein
>>> YP_004256772:
>>>
>>> /db_xref="InterPro:DNA methylase, N-6
>>> adenine-specific,
>>> conserved site"
>>> /db_xref="InterPro:N6 adenine-specific DNA
>>> methyltransferase, N12 class"
>>>
>>> The Genbank format specification (
>>> http://www.ncbi.nlm.nih.gov/genbank/collab/db_xref ) does not
>>> mention this format, it only defines the Interpro cross-reference as:
>>>
>>> /db_xref="InterPro:IPR002928"
>>>
>>>
>>> My guess is that either the GenbankFormat parser is not compatible
>>> with the GenPept format, or RefSeq is taking some liberties with the
>>> Genbank specification. Any help would be appreciated!
>>>
>>> Best regards,
>>> Tjeerd
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>
>>
>>
>>
>> --------------------------------
>> George Waldon
>>
>>
More information about the biojava-dev
mailing list