[Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files

Mon Jun 5 15:11:07 UTC 2006

Hi again.

Could you remove the offending question mark from the GenBank file and
try it again to see if that fixes it? The parser should just ignore it
but apparently not. The error looks weird to me because the tokenization
for a DNA GenBank file _does_ contain the letter 't'! Not sure what's
going on here.

With regard to your INSDseqXML problems, the stacktrace pointed to a bug
in SimpleRichSequenceBuilder that would actually cause these problems
for any file containing a no qualifier value for a feature, regardless
of format. I think I have fixed this now. Could you test it? (It's in
CVS already).

cheers,
Richard

On Mon, 2006-06-05 at 10:37 -0400, Seth Johnson wrote:
> Hell again Richard,
> 
> No sooner I've said about the fix of the last parsing exception than
> another one came up with Genbank format:
> --------------------------------------
> org.biojava.bio.seq.io.ParseException: DQ431065
> org.biojava.bio.BioException: Could not read sequence
>         at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
>         at exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151)
>         at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246)
>         at exonhit.parsers.GenBankParser.main(GenBankParser.java:326)
> Caused by: org.biojava.bio.seq.io.ParseException: DQ431065
>         at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245)
>         at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
>         ... 3 more
> org.biojava.bio.seq.io.ParseException:
> org.biojava.bio.symbol.IllegalSymbolException: This tokenization
> doesn't contain character: 't'
> ----------------------------------------
> The Genbank file that caused it is as follows:
> =========================================
> LOCUS       DQ431065                 425 bp    DNA     linear   INV 01-JUN-2006
> DEFINITION  Reticulitermes sp. ALS-2006c 16S ribosomal RNA gene, partial
>             sequence; mitochondrial.
> ACCESSION   DQ431065
> VERSION     DQ431065.1  GI:90102206
> KEYWORDS    .
> SOURCE      Vaccinium corymbosum
>   ORGANISM  Vaccinium corymbosum
>             Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
>             Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;
>             asterids; Ericales; Ericaceae; Vaccinioideae; Vaccinieae;
>             Vaccinium.
> ?
> REFERENCE   2  (bases 1 to 425)
>   AUTHORS   Naik,L.D. and Rowland,L.J.
>   TITLE     Expressed Sequence Tags of cDNA clones from subtracted library of
>             Vaccinium corymbosum
>   JOURNAL   Unpublished (2005)
> FEATURES             Location/Qualifiers
>      source          1..425
>                      /organism="Vaccinium corymbosum"
>                      /mol_type="genomic DNA"
>                      /cultivar="Bluecrop"
>                      /db_xref="taxon:69266"
>                      /tissue_type="Flower buds"
>                      /clone_lib="Subtracted cDNA library of Vaccinium
>                      corymbosum"
>                      /dev_stage="399 hour chill unit exposure"
>                      /note="Vector: pCR4TOPO; Site_1: Eco R I; Site_2: Eco R I"
>      rRNA            <1..>425
>                      /product="16S ribosomal RNA"
> ORIGIN
>         1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga agtatggcct gcccgctgac
>        61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg tagcatagtc attagttctt
>       121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc tgtcttaatt ttgaattgtt
>       181 tattgaattt ggtctttgag ttaaaattct tagatgtttt tatgggacga gaagacccta
>       241 tagagtttaa catttattat ggtccttttc tgtttgtgag ggctcactgg gccgtctaat
>       301 atgttttgtt ggggtgatgg gagggaataa tttaacccct cctttttatt attatattta
>       361 tttatattta tttgatccat ttattttgat tgtaagatta aattacctta gggataacag
>       421 cgtaa
> //
> ==================================
> I think it's the presence of the '?' at the beginning of the line?!?!
> I'm not sure wether the information that was supposed to be present
> instead of those question marks is absent from the original ASN.1
> batch file or it's a bug in the NCBI ASN2GO software.  It looks to me
> that the former is the case since the file from NCBI website contains
> much more information than the batch file. Just bringing this to
> everyone's attention.
> 
> 
> -- 
> Best Regards,
> 
> 
> Seth Johnson
> Senior Bioinformatics Associate
> 
> Ph: (202) 470-0900
> Fx: (775) 251-0358
> 
> On 6/2/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> > Hi Seth.
> >
> > Your second point, about the authors string not being read correctly in
> > Genbank format, has been fixed (or should have been if I got the code
> > right!). Could you check the latest version of biojava-live out of CVS
> > and give it another go? Basically the parser did not recognise the
> > CONSRTM tag, as it is not mentioned in the sample record provided by
> > NCBI, which is what I based the parser on.
> ...
> >
> > cheers,
> > Richard
> >
> >
-- 
Richard Holland (BioMart Team)
EMBL-EBI
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
UNITED KINGDOM
Tel: +44-(0)1223-494416