[Biojava-l] Genbank parser error [biojavax]

Morgane THOMAS-CHOLLIER mthomasc at vub.ac.be
Tue Feb 14 08:33:02 EST 2006


Hello Mark,

My file is indeed too large to be posted.
So I have exported a smaller sequence from Ensembl that I tested with 
the parser. The behavior is the same.
You will find below this "Genbank" formatted file enclosed.

Thanks for your help,

Morgane.

LOCUS       6 3498 bp DNA HTG 14-FEB-2006
DEFINITION  Mus musculus chromosome 6 NCBIM34 partial sequence
            52305503..52309000 reannotated via EnsEMBL
ACCESSION   chromosome:NCBIM34:6:52305503:52309000:1
VERSION     chromosome:NCBIM34:6:52305503:52309000:1
KEYWORDS    .
SOURCE      House mouse
  ORGANISM  Mus musculus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
            Sciurognathi; Muridae; Murinae; Mus.
COMMENT     This sequence was annotated by the Ensembl system. Please visit the
            Ensembl web site, http://www.ensembl.org/ for more information.
COMMENT     All feature locations are relative to the first (5') base of the
            sequence in this file.  The sequence presented is always the
            forward strand of the assembly. Features that lie outside of the
            sequence contained in this file have clonal location coordinates in
            the format: .:..
COMMENT     The /gene indicates a unique id for a gene,
            /note="transcript_id=..." a unique id for a transcript, /protein_id
            a unique id for a peptide and note="exon_id=..." a unique id for an
            exon. These ids are maintained wherever possible between versions.
COMMENT     All the exons and transcripts in Ensembl are confirmed by
            similarity to either protein or cDNA sequences.
FEATURES             Location/Qualifiers
     source          1..3498
                     /organism="Mus musculus"
                     /db_xref="taxon:10090"
     gene            complement(506..2826)
                     /gene=ENSMUSG00000014704
     mRNA            join(complement(2261..2826),complement(506..1620))
                     /gene="ENSMUSG00000014704"
                     /note="transcript_id=ENSMUST00000014848"
     CDS             join(complement(2261..2639),complement(881..1620))
                     /gene="ENSMUSG00000014704"
                     /protein_id="ENSMUSP00000014848"
                     /note="transcript_id=ENSMUST00000014848"
                     /db_xref="MarkerSymbol:Hoxa2"
                     /db_xref="Uniprot/SWISSPROT:HXA2_MOUSE"
                     /db_xref="RefSeq_peptide:NP_034581.1"
                     /db_xref="RefSeq_dna:NM_010451.1"
                     /db_xref="Uniprot/SPTREMBL:Q3UYP9_MOUSE"
                     /db_xref="Uniprot/SPTREMBL:Q920T7_MOUSE"
                     /db_xref="Uniprot/SPTREMBL:Q920T9_MOUSE"
                     /db_xref="Uniprot/SPTREMBL:Q920U0_MOUSE"
                     /db_xref="Uniprot/SPTREMBL:Q920U1_MOUSE"
                     /db_xref="Uniprot/SPTREMBL:Q920U2_MOUSE"
                     /db_xref="Uniprot/SPTREMBL:Q920U3_MOUSE"
                     /db_xref="Uniprot/SPTREMBL:Q920U4_MOUSE"
                     /db_xref="Uniprot/SPTREMBL:Q920U5_MOUSE"
                     /db_xref="EntrezGene:15399"
                     /db_xref="AgilentProbe:A_51_P501803"
                     /db_xref="EMBL:AB039184"
                     /db_xref="EMBL:AB039185"
                     /db_xref="EMBL:AB039186"
                     /db_xref="EMBL:AB039187"
                     /db_xref="EMBL:AB039188"
                     /db_xref="EMBL:AB039189"
                     /db_xref="EMBL:AB039190"
                     /db_xref="EMBL:AB039191"
                     /db_xref="EMBL:AB039192"
                     /db_xref="EMBL:AK134501"
                     /db_xref="EMBL:M87801"
                     /db_xref="EMBL:M93148"
                     /db_xref="EMBL:M93292"
                     /db_xref="EMBL:M95599"
                     /db_xref="GO:GO:0003700"
                     /db_xref="GO:GO:0005634"
                     /db_xref="GO:GO:0006355"
                     /db_xref="GO:GO:0007275"
                     /db_xref="IPI:IPI00132242.1"
                     /db_xref="UniGene:Mm.131"
                     /db_xref="protein_id:AAA37827.1"
                     /db_xref="protein_id:AAA37834.1"
                     /db_xref="protein_id:AAA37835.1"
                     /db_xref="protein_id:AAA37836.1"
                     /db_xref="protein_id:BAB68708.1"
                     /db_xref="protein_id:BAB68709.1"
                     /db_xref="protein_id:BAB68710.1"
                     /db_xref="protein_id:BAB68711.1"
                     /db_xref="protein_id:BAB68712.1"
                     /db_xref="protein_id:BAB68713.1"
                     /db_xref="protein_id:BAB68714.1"
                     /db_xref="protein_id:BAB68715.1"
                     /db_xref="protein_id:BAB68716.1"
                     /db_xref="protein_id:BAE22163.1"
                     /db_xref="AFFY_MG_U74Av2:102643_at"
                     /db_xref="AFFY_MG_U74Cv2:171063_at"
                     /db_xref="AFFY_Mouse430A_2:1419602_at"
                     /db_xref="AFFY_Mouse430_2:1419602_at"
                     /translation="MNYEFEREIGFINSQPSLAECLTSFPPVADTFQSSSIKTSTLSH
                     STLIPPPFEQTIPSLNPGSHPRHGAGVGGRPKSSPAGSRGSPVPAGALQPPEYPWMKE
                     KKAAKKTALPPAAASTGPACLGHKESLEIADGSGGGSRRLRTAYTNTQLLELEKEFHF
                     NKYLCRPRRVEIAALLDLTERQVKVWFQNRRMKHKRQTQCKENQNSEGKFKNLEDSDK
                     VEEDEEEKSLFEQALSVSGALLEREGYTFQQNALSQQQAPNGHNGDSQTFPVSPLTSN
                     EKNLKHFQHQSPTVPNCLSTMGQNCGAGLNNDSPEAIEVPSLQDFNVFSTDSCLQLSD
                     ALSPSLPGSLDSPVDISADSFDFFTDTLTTIDLQHLNY"
     exon            complement(506..1620)
                     /note="exon_id=ENSMUSE00000387033"
     exon            complement(2261..2826)
                     /note="exon_id=ENSMUSE00000193269"
BASE COUNT  938 a 815 c 882 g 863 t
ORIGIN
        1 AGGAAGAGTT GGAACGTAGA TGTTTGAAAC AAATGTGTAT AAATAAATGA ATTTTTGATA
       61 ACTCCGTTAT TGACCTAGAA ACTAGCAGCT TGGTAAGGGA ACTCCATTCC ACTCCACTCG
      121 TCCTAGAACT GGAAGTTTTT GTAGGCACTT TTCCTCTCCA CACTCAAAAG CTTGGGCTAG
      181 GGCCAACTCA GGCTGCCCAA GCCCATTTCT ATTACTAATG TAACTCTATG GCCTGAGTCT
      241 CAACACTGAA AACCAAATTC ATTCCCTTAG GGGGGAAAAA TCCAAAAAAA AAAAAAAAAA
      301 AAGTCTTGCC AGAAGCCCTA GCACTTTCTG GTTTTCTTCT TTGTTGCTGT TTGTTGCAGG
      361 CTTTGAACAT GCCACCCTAA TAAAATATAT TAAGATTGAA AAGTAAATTG TGACCAGACT
      421 TTTATTTACC ATGTTAGACT AAAAGAAGTA TAAGAAATCA GTATGAGTCT TGAGAAAGAG
      481 GGGAAGAAAA AAATAAGAAA GCTACTTATA GCAAAGGAGA ATTTATTCTA CCAAAAATAC
      541 GCATGACAAT GCATTCTAAT GTGGTACAAA AATAAACAGA AAGTGACAAG ACAATTTATG
      601 GTCACTTTCT TGCAGGCCTC CTGTTTTGTT TTTCAGGAAA ATCACATAGA AGCTTGTTGG
      661 GTTCTGTGTA AAAACCACTT AGAACGCCAA CATAATTTGC AAGAGATGGC TTTAAAACTG
      721 TGTCAGGGGA GAACATTAAA CGGAAAGTCC TCAACATTTG AGAGAGTAGG GGTAGATCAA
      781 GAAGAAACTA AAACGAAAAT CAACTCCCAG AATAAAAGAA GGCAAAGCCA CCTGGTCAAA
      841 GGCGTTTTGT TTTGTGAAGC TTTGTTTTGC TTTAATGTTC TTAGTAATTC AGATGCTGTA
      901 GGTCGATTGT GGTGAGTGTG TCTGTAAAAA AGTCAAAGCT GTCAGCTGAG ATATCTACAG
      961 GACTGTCCAG GGAGCCAGGC AAGCTGGGCG ACAGTGCATC TGAAAGCTGC AGGCAGGAAT
     1021 CTGTGGAGAA AACATTGAAG TCCTGCAAAG AGGGGACCTC GATGGCCTCG GGACTGTCAT
     1081 TGTTTAGGCC AGCTCCACAG TTCTGGCCCA TTGTTGACAA GCAGTTAGGA ACAGTGGGTG
     1141 ACTGGTGCTG AAAATGTTTC AAATTTTTCT CATTGCTGGT TAAAGGCGAA ACTGGGAAAG
     1201 TTTGGGAGTC GCCATTGTGT CCATTGGGAG CCTGCTGTTG AGAGAGCGCA TTTTGCTGAA
     1261 AAGTGTACCC TTCCCTCTCC AGAAGGGCCC CGGAGACACT GAGGGCTTGC TCAAAGAGTG
     1321 ACTTCTCTTC CTCGTCTTCC TCCACTTTGT CCGAGTCCTC CAGGTTTTTA AATTTCCCTT
     1381 CGCTGTTTTG GTTCTCCTTG CACTGGGTTT GCCTCTTATG CTTCATTCTC CGGTTCTGAA
     1441 ACCACACTTT CACTTGTCTC TCGGTCAAAT CCAGCAGCGC GGCGATTTCC ACCCTGCGGG
     1501 GTCTGCAAAG GTACTTGTTG AAATGAAATT CCTTTTCCAG CTCCAAAAGC TGAGTGTTGG
     1561 TGTACGCGGT TCTCAGACGC CTGGATCCCC CGCCGCTGCC ATCAGCTATT TCCAGGGATT
     1621 CTGCAGAAAG GGAAACCAAC AAGAGACACA CATACAGTTG AAGGTGGAAG GGTCCGAGCA
     1681 GGGTTATTCC ATTGGAGCAT AAATACAGCA GAAAAGATCA ACTGCAACAA AATGGCCGCC
     1741 CCTGGATGCA GTGCAGCTAT TGTGCTGCCC TTCCTGGGAG CCCAGCCCGG GGAAGCCCAG
     1801 TCTCTTCCAC CTCCATCAAA TTCCTGCCTG TGGCTTCCCC CAACCTCTTC ATCCGGGAGC
     1861 AAACTTTATA TTAGCTACAA CACAATTTAT AATTAATGCA TCAGCTGCTT AGCTGAGCAA
     1921 GAGCGGTCTA TCACTCTTCA TTACTGTCAA AAAGCCAAAC TCTAGGACAA CTAGACAAGA
     1981 GGAGGTCAGT TCCAACTCAA ATAAATCATC CTACATTACA CAAGTTAGGG AAAGTGCCCC
     2041 CCCTCCTCAA AATATATATG TCTCATTGTG GGACTCGGGA TCTATTTTCC CCTCCACCAA
     2101 ACCCACTCCT GAGACCACAG GGGCATGAGA CCCGCCACCA GGCATCTCTC TCTCTCCCCC
     2161 TTCCCTCGAA GCTCATGGTC CCCTCCCCCA CAACCGCTCC TAGGGAAGCC CGGAGGGGGA
     2221 CAAGGGTCCC CGAGACCTGG GGCCAAGTCT CCGGACTGAC CTTTGTGGCC GAGGCAGGCA
     2281 GGGCCCGTGG AGGCGGCGGC GGGCGGCAGC GCGGTTTTCT TGGCCGCCTT CTTCTCCTTC
     2341 ATCCAGGGAT ACTCAGGCGG CTGCAGGGCG CCGGCAGGCA CCGGGCTGCC GCGACTGCCC
     2401 GCGGGGCTCG ACTTGGGGCG GCCGCCAACG CCAGCGCCGT GGCGAGGGTG ACTGCCCGGG
     2461 TTCAGGCTGG GAATGGTCTG CTCAAAAGGA GGAGGAATCA GTGTCGAGTG TGAAAGCGTC
     2521 GAGGTCTTGA TTGATGAACT TTGAAATGTA TCAGCGACAG GGGGAAAAGA TGTCAGGCAC
     2581 TCAGCGAGCG ACGGCTGGCT ATTGATAAAA CCAATCTCTC GCTCAAATTC GTAATTCATG
     2641 GCCTTCTCCT TGGAGCCCCC TCGGAGGAAA AGTTCCCTCT TTTGGAGGGG CTTTGGGGGG
     2701 GCAAGGCCCA GGAAAAAGGC GAGCGCGAAG GAAAAAAAAA TCTATCATAG AAGATCGCTG
     2761 CTGGGGTGTT TTTTTTCTAA TTCACTGATT ACAGCCGTAT GGGGACCGCG CTACTATTAA
     2821 ACTATTGAAT TCATGGAGAC AAGGTTGAAA TTGGACCGAA TTGGCTGTCA CATGATTGCT
     2881 TCTGCCCAAT GACAATTTGG GCTTTAATCA AAAGAAGCCA CTGTCTGTTT GATTGATCCA
     2941 AAAAAGTCAG AAAGGAACGC CTCATTGGGG GCCATCGAGG CTTTATTTAC ACTTTTTTTC
     3001 AGGGCAAAAA TACATATATG TGGGTGTGGA TGGCAATGCC CCGGGAGTGC GTGGGGGGCG
     3061 AGAGTGCCTG TTTGCCTCCT GATCTGCAAG GATCTAGTGT GCTCCCTGGA GTGTGTGTGT
     3121 GAGTGTGTGC GTGTGAGCCC TGCTGCCGTC CCGCCAGTGG CTGCCCTCTG CCTCCCCCGC
     3181 ACACTCCGCG CATTGTTTGG GACTGTCGGG AAGACGCCTC GCACCTCACA AATCATTTAA
     3241 GCACCTCAGC CTGACGCCTG CAGTCATTAA CAAAGTAATC CATTAATCTT CAAAGTTTTG
     3301 ACACCCCAGG GCCCTGCATC TCAGCCACAT AAGTTCTGCT AAGGCAAGAG AAAGGAGCAG
     3361 AGTGGGAGAG AGAGAGGAGA GAGGGAGAGA GGGAGAGAGG GAGAGAGAGA GAGAGAGAGA
     3421 GAGAGAGAGA GAGAGAGAGA GAGAGAATGA ATATTGGGGT TCACCTTTCC TCTTCCTCCT
     3481 CTTTTTCCAA AATCAGTT
//




mark.schreiber at novartis.com wrote:

>Hi Morgane -
>
>I have to say that doesn't look much like Genbank : )
>
>The biojavax parser are possibly a bit brittle due to their use of regexps 
>to recognize key elements. It should be fixable, I think the problem is 
>that the parser expects a word after LOCUS not a number. This may not be 
>the only problem though. Could you post the entire file? Or if it is large 
>then a representative file of smaller size.
>
>- Mark
>
>
>
>
>
>Morgane THOMAS-CHOLLIER <mthomasc at vub.ac.be>
>Sent by: biojava-l-bounces at portal.open-bio.org
>02/14/2006 04:36 AM
>
> 
>        To:     biojava-l at biojava.org
>        cc:     (bcc: Mark Schreiber/GP/Novartis)
>        Subject:        [Biojava-l] Genbank  parser error [biojavax]
>
>
>Hello,
>
>I have tried biojavax today with a view to use the Genbank file parser.
>
>My test file is a Genbank formatted file which has been produced by 
>Ensembl export system.
>
>The head of the file is as follow :
>
>LOCUS       6 489671 bp DNA HTG 13-FEB-2006
>DEFINITION  Mus musculus chromosome 6 NCBIM34 partial sequence
>            52296503..52786173 reannotated via EnsEMBL
>ACCESSION   chromosome:NCBIM34:6:52296503:52786173:1
>VERSION     chromosome:NCBIM34:6:52296503:52786173:1
>
>I used the code provided in biojavax docbook to parse this file.
>I get the following error :
>
>Exception in thread "main" org.biojava.bio.BioException: Could not read 
>sequence
>    at 
>org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:111)
>    at 
>org.embnet.be.biojavax.tryout.GenbankParseTest.main(GenbankParseTest.java:31)
>Caused by: org.biojava.bio.seq.io.ParseException: Bad locus line found: 
>6 489671 bp DNA HTG 13-FEB-2006
>    at 
>org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:229)
>    at 
>org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:108)
>    ... 1 more
>
>I had a look at GenbankFormat.java, and I guess the problem comes from 
>the regular expression that do not recognize the LOCUS as a standard 
>Genbank file LOCUS tag.
>
>Am I wrong ? Have biojavax Genbank parser been tested on Ensembl 
>exported files ?
>
>Morgane.
>
>  
>

-- 
**********************************************************
Morgane THOMAS-CHOLLIER, PHD Student (mthomasc at vub.ac.be)

Vrije Universiteit Brussels (VUB)    
Laboratory of Cell Genetics          
Pleinlaan 2                          
1050 Brussels                        
Belgium                              

Tel : +32 2 629 15 22                		     
**********************************************************
Stop Using Internet Explorer, choose FIREFOX !
http://emmanuel.clement.free.fr/navigateurs/comparatif.htm



More information about the Biojava-l mailing list