[Biojava-l] Genbank parser error [biojavax]
Morgane THOMAS-CHOLLIER
mthomasc at vub.ac.be
Wed Feb 15 05:04:22 EST 2006
Hi Mark,
I have downloaded the fixed version and tested it with my large file.
Works great.
Thank you very much,
Morgane.
mark.schreiber at novartis.com wrote:
>Hi Morgane -
>
>Turned out to be a problem with a greedy regexp parsing the LOCUS tag.
>This is fixed in CVS. Let me know if something else is a problem.
>
>- Mark
>
>
>
>
>
>Morgane THOMAS-CHOLLIER <mthomasc at vub.ac.be>
>Sent by: biojava-l-bounces at portal.open-bio.org
>02/14/2006 09:33 PM
>
>
> To: biojava-l at biojava.org
> cc: (bcc: Mark Schreiber/GP/Novartis)
> Subject: Re: [Biojava-l] Genbank parser error [biojavax]
>
>
>Hello Mark,
>
>My file is indeed too large to be posted.
>So I have exported a smaller sequence from Ensembl that I tested with
>the parser. The behavior is the same.
>You will find below this "Genbank" formatted file enclosed.
>
>Thanks for your help,
>
>Morgane.
>
>LOCUS 6 3498 bp DNA HTG 14-FEB-2006
>DEFINITION Mus musculus chromosome 6 NCBIM34 partial sequence
> 52305503..52309000 reannotated via EnsEMBL
>ACCESSION chromosome:NCBIM34:6:52305503:52309000:1
>VERSION chromosome:NCBIM34:6:52305503:52309000:1
>KEYWORDS .
>SOURCE House mouse
> ORGANISM Mus musculus
> Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
>Euteleostomi;
> Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
> Sciurognathi; Muridae; Murinae; Mus.
>COMMENT This sequence was annotated by the Ensembl system. Please
>visit the
> Ensembl web site, http://www.ensembl.org/ for more information.
>COMMENT All feature locations are relative to the first (5') base of
>the
> sequence in this file. The sequence presented is always the
> forward strand of the assembly. Features that lie outside of
>the
> sequence contained in this file have clonal location
>coordinates in
> the format: .:..
>COMMENT The /gene indicates a unique id for a gene,
> /note="transcript_id=..." a unique id for a transcript,
>/protein_id
> a unique id for a peptide and note="exon_id=..." a unique id
>for an
> exon. These ids are maintained wherever possible between
>versions.
>COMMENT All the exons and transcripts in Ensembl are confirmed by
> similarity to either protein or cDNA sequences.
>FEATURES Location/Qualifiers
> source 1..3498
> /organism="Mus musculus"
> /db_xref="taxon:10090"
> gene complement(506..2826)
> /gene=ENSMUSG00000014704
> mRNA join(complement(2261..2826),complement(506..1620))
> /gene="ENSMUSG00000014704"
> /note="transcript_id=ENSMUST00000014848"
> CDS join(complement(2261..2639),complement(881..1620))
> /gene="ENSMUSG00000014704"
> /protein_id="ENSMUSP00000014848"
> /note="transcript_id=ENSMUST00000014848"
> /db_xref="MarkerSymbol:Hoxa2"
> /db_xref="Uniprot/SWISSPROT:HXA2_MOUSE"
> /db_xref="RefSeq_peptide:NP_034581.1"
> /db_xref="RefSeq_dna:NM_010451.1"
> /db_xref="Uniprot/SPTREMBL:Q3UYP9_MOUSE"
> /db_xref="Uniprot/SPTREMBL:Q920T7_MOUSE"
> /db_xref="Uniprot/SPTREMBL:Q920T9_MOUSE"
> /db_xref="Uniprot/SPTREMBL:Q920U0_MOUSE"
> /db_xref="Uniprot/SPTREMBL:Q920U1_MOUSE"
> /db_xref="Uniprot/SPTREMBL:Q920U2_MOUSE"
> /db_xref="Uniprot/SPTREMBL:Q920U3_MOUSE"
> /db_xref="Uniprot/SPTREMBL:Q920U4_MOUSE"
> /db_xref="Uniprot/SPTREMBL:Q920U5_MOUSE"
> /db_xref="EntrezGene:15399"
> /db_xref="AgilentProbe:A_51_P501803"
> /db_xref="EMBL:AB039184"
> /db_xref="EMBL:AB039185"
> /db_xref="EMBL:AB039186"
> /db_xref="EMBL:AB039187"
> /db_xref="EMBL:AB039188"
> /db_xref="EMBL:AB039189"
> /db_xref="EMBL:AB039190"
> /db_xref="EMBL:AB039191"
> /db_xref="EMBL:AB039192"
> /db_xref="EMBL:AK134501"
> /db_xref="EMBL:M87801"
> /db_xref="EMBL:M93148"
> /db_xref="EMBL:M93292"
> /db_xref="EMBL:M95599"
> /db_xref="GO:GO:0003700"
> /db_xref="GO:GO:0005634"
> /db_xref="GO:GO:0006355"
> /db_xref="GO:GO:0007275"
> /db_xref="IPI:IPI00132242.1"
> /db_xref="UniGene:Mm.131"
> /db_xref="protein_id:AAA37827.1"
> /db_xref="protein_id:AAA37834.1"
> /db_xref="protein_id:AAA37835.1"
> /db_xref="protein_id:AAA37836.1"
> /db_xref="protein_id:BAB68708.1"
> /db_xref="protein_id:BAB68709.1"
> /db_xref="protein_id:BAB68710.1"
> /db_xref="protein_id:BAB68711.1"
> /db_xref="protein_id:BAB68712.1"
> /db_xref="protein_id:BAB68713.1"
> /db_xref="protein_id:BAB68714.1"
> /db_xref="protein_id:BAB68715.1"
> /db_xref="protein_id:BAB68716.1"
> /db_xref="protein_id:BAE22163.1"
> /db_xref="AFFY_MG_U74Av2:102643_at"
> /db_xref="AFFY_MG_U74Cv2:171063_at"
> /db_xref="AFFY_Mouse430A_2:1419602_at"
> /db_xref="AFFY_Mouse430_2:1419602_at"
> /translation="MNYEFEREIGFINSQPSLAECLTSFPPVADTFQSSSIKTSTLSH
> STLIPPPFEQTIPSLNPGSHPRHGAGVGGRPKSSPAGSRGSPVPAGALQPPEYPWMKE
> KKAAKKTALPPAAASTGPACLGHKESLEIADGSGGGSRRLRTAYTNTQLLELEKEFHF
> NKYLCRPRRVEIAALLDLTERQVKVWFQNRRMKHKRQTQCKENQNSEGKFKNLEDSDK
> VEEDEEEKSLFEQALSVSGALLEREGYTFQQNALSQQQAPNGHNGDSQTFPVSPLTSN
> EKNLKHFQHQSPTVPNCLSTMGQNCGAGLNNDSPEAIEVPSLQDFNVFSTDSCLQLSD
> ALSPSLPGSLDSPVDISADSFDFFTDTLTTIDLQHLNY"
> exon complement(506..1620)
> /note="exon_id=ENSMUSE00000387033"
> exon complement(2261..2826)
> /note="exon_id=ENSMUSE00000193269"
>BASE COUNT 938 a 815 c 882 g 863 t
>ORIGIN
> 1 AGGAAGAGTT GGAACGTAGA TGTTTGAAAC AAATGTGTAT AAATAAATGA
>ATTTTTGATA
> 61 ACTCCGTTAT TGACCTAGAA ACTAGCAGCT TGGTAAGGGA ACTCCATTCC
>ACTCCACTCG
> 121 TCCTAGAACT GGAAGTTTTT GTAGGCACTT TTCCTCTCCA CACTCAAAAG
>CTTGGGCTAG
> 181 GGCCAACTCA GGCTGCCCAA GCCCATTTCT ATTACTAATG TAACTCTATG
>GCCTGAGTCT
> 241 CAACACTGAA AACCAAATTC ATTCCCTTAG GGGGGAAAAA TCCAAAAAAA
>AAAAAAAAAA
> 301 AAGTCTTGCC AGAAGCCCTA GCACTTTCTG GTTTTCTTCT TTGTTGCTGT
>TTGTTGCAGG
> 361 CTTTGAACAT GCCACCCTAA TAAAATATAT TAAGATTGAA AAGTAAATTG
>TGACCAGACT
> 421 TTTATTTACC ATGTTAGACT AAAAGAAGTA TAAGAAATCA GTATGAGTCT
>TGAGAAAGAG
> 481 GGGAAGAAAA AAATAAGAAA GCTACTTATA GCAAAGGAGA ATTTATTCTA
>CCAAAAATAC
> 541 GCATGACAAT GCATTCTAAT GTGGTACAAA AATAAACAGA AAGTGACAAG
>ACAATTTATG
> 601 GTCACTTTCT TGCAGGCCTC CTGTTTTGTT TTTCAGGAAA ATCACATAGA
>AGCTTGTTGG
> 661 GTTCTGTGTA AAAACCACTT AGAACGCCAA CATAATTTGC AAGAGATGGC
>TTTAAAACTG
> 721 TGTCAGGGGA GAACATTAAA CGGAAAGTCC TCAACATTTG AGAGAGTAGG
>GGTAGATCAA
> 781 GAAGAAACTA AAACGAAAAT CAACTCCCAG AATAAAAGAA GGCAAAGCCA
>CCTGGTCAAA
> 841 GGCGTTTTGT TTTGTGAAGC TTTGTTTTGC TTTAATGTTC TTAGTAATTC
>AGATGCTGTA
> 901 GGTCGATTGT GGTGAGTGTG TCTGTAAAAA AGTCAAAGCT GTCAGCTGAG
>ATATCTACAG
> 961 GACTGTCCAG GGAGCCAGGC AAGCTGGGCG ACAGTGCATC TGAAAGCTGC
>AGGCAGGAAT
> 1021 CTGTGGAGAA AACATTGAAG TCCTGCAAAG AGGGGACCTC GATGGCCTCG
>GGACTGTCAT
> 1081 TGTTTAGGCC AGCTCCACAG TTCTGGCCCA TTGTTGACAA GCAGTTAGGA
>ACAGTGGGTG
> 1141 ACTGGTGCTG AAAATGTTTC AAATTTTTCT CATTGCTGGT TAAAGGCGAA
>ACTGGGAAAG
> 1201 TTTGGGAGTC GCCATTGTGT CCATTGGGAG CCTGCTGTTG AGAGAGCGCA
>TTTTGCTGAA
> 1261 AAGTGTACCC TTCCCTCTCC AGAAGGGCCC CGGAGACACT GAGGGCTTGC
>TCAAAGAGTG
> 1321 ACTTCTCTTC CTCGTCTTCC TCCACTTTGT CCGAGTCCTC CAGGTTTTTA
>AATTTCCCTT
> 1381 CGCTGTTTTG GTTCTCCTTG CACTGGGTTT GCCTCTTATG CTTCATTCTC
>CGGTTCTGAA
> 1441 ACCACACTTT CACTTGTCTC TCGGTCAAAT CCAGCAGCGC GGCGATTTCC
>ACCCTGCGGG
> 1501 GTCTGCAAAG GTACTTGTTG AAATGAAATT CCTTTTCCAG CTCCAAAAGC
>TGAGTGTTGG
> 1561 TGTACGCGGT TCTCAGACGC CTGGATCCCC CGCCGCTGCC ATCAGCTATT
>TCCAGGGATT
> 1621 CTGCAGAAAG GGAAACCAAC AAGAGACACA CATACAGTTG AAGGTGGAAG
>GGTCCGAGCA
> 1681 GGGTTATTCC ATTGGAGCAT AAATACAGCA GAAAAGATCA ACTGCAACAA
>AATGGCCGCC
> 1741 CCTGGATGCA GTGCAGCTAT TGTGCTGCCC TTCCTGGGAG CCCAGCCCGG
>GGAAGCCCAG
> 1801 TCTCTTCCAC CTCCATCAAA TTCCTGCCTG TGGCTTCCCC CAACCTCTTC
>ATCCGGGAGC
> 1861 AAACTTTATA TTAGCTACAA CACAATTTAT AATTAATGCA TCAGCTGCTT
>AGCTGAGCAA
> 1921 GAGCGGTCTA TCACTCTTCA TTACTGTCAA AAAGCCAAAC TCTAGGACAA
>CTAGACAAGA
> 1981 GGAGGTCAGT TCCAACTCAA ATAAATCATC CTACATTACA CAAGTTAGGG
>AAAGTGCCCC
> 2041 CCCTCCTCAA AATATATATG TCTCATTGTG GGACTCGGGA TCTATTTTCC
>CCTCCACCAA
> 2101 ACCCACTCCT GAGACCACAG GGGCATGAGA CCCGCCACCA GGCATCTCTC
>TCTCTCCCCC
> 2161 TTCCCTCGAA GCTCATGGTC CCCTCCCCCA CAACCGCTCC TAGGGAAGCC
>CGGAGGGGGA
> 2221 CAAGGGTCCC CGAGACCTGG GGCCAAGTCT CCGGACTGAC CTTTGTGGCC
>GAGGCAGGCA
> 2281 GGGCCCGTGG AGGCGGCGGC GGGCGGCAGC GCGGTTTTCT TGGCCGCCTT
>CTTCTCCTTC
> 2341 ATCCAGGGAT ACTCAGGCGG CTGCAGGGCG CCGGCAGGCA CCGGGCTGCC
>GCGACTGCCC
> 2401 GCGGGGCTCG ACTTGGGGCG GCCGCCAACG CCAGCGCCGT GGCGAGGGTG
>ACTGCCCGGG
> 2461 TTCAGGCTGG GAATGGTCTG CTCAAAAGGA GGAGGAATCA GTGTCGAGTG
>TGAAAGCGTC
> 2521 GAGGTCTTGA TTGATGAACT TTGAAATGTA TCAGCGACAG GGGGAAAAGA
>TGTCAGGCAC
> 2581 TCAGCGAGCG ACGGCTGGCT ATTGATAAAA CCAATCTCTC GCTCAAATTC
>GTAATTCATG
> 2641 GCCTTCTCCT TGGAGCCCCC TCGGAGGAAA AGTTCCCTCT TTTGGAGGGG
>CTTTGGGGGG
> 2701 GCAAGGCCCA GGAAAAAGGC GAGCGCGAAG GAAAAAAAAA TCTATCATAG
>AAGATCGCTG
> 2761 CTGGGGTGTT TTTTTTCTAA TTCACTGATT ACAGCCGTAT GGGGACCGCG
>CTACTATTAA
> 2821 ACTATTGAAT TCATGGAGAC AAGGTTGAAA TTGGACCGAA TTGGCTGTCA
>CATGATTGCT
> 2881 TCTGCCCAAT GACAATTTGG GCTTTAATCA AAAGAAGCCA CTGTCTGTTT
>GATTGATCCA
> 2941 AAAAAGTCAG AAAGGAACGC CTCATTGGGG GCCATCGAGG CTTTATTTAC
>ACTTTTTTTC
> 3001 AGGGCAAAAA TACATATATG TGGGTGTGGA TGGCAATGCC CCGGGAGTGC
>GTGGGGGGCG
> 3061 AGAGTGCCTG TTTGCCTCCT GATCTGCAAG GATCTAGTGT GCTCCCTGGA
>GTGTGTGTGT
> 3121 GAGTGTGTGC GTGTGAGCCC TGCTGCCGTC CCGCCAGTGG CTGCCCTCTG
>CCTCCCCCGC
> 3181 ACACTCCGCG CATTGTTTGG GACTGTCGGG AAGACGCCTC GCACCTCACA
>AATCATTTAA
> 3241 GCACCTCAGC CTGACGCCTG CAGTCATTAA CAAAGTAATC CATTAATCTT
>CAAAGTTTTG
> 3301 ACACCCCAGG GCCCTGCATC TCAGCCACAT AAGTTCTGCT AAGGCAAGAG
>AAAGGAGCAG
> 3361 AGTGGGAGAG AGAGAGGAGA GAGGGAGAGA GGGAGAGAGG GAGAGAGAGA
>GAGAGAGAGA
> 3421 GAGAGAGAGA GAGAGAGAGA GAGAGAATGA ATATTGGGGT TCACCTTTCC
>TCTTCCTCCT
> 3481 CTTTTTCCAA AATCAGTT
>//
>
>
>
>
>mark.schreiber at novartis.com wrote:
>
>
>
>>Hi Morgane -
>>
>>I have to say that doesn't look much like Genbank : )
>>
>>The biojavax parser are possibly a bit brittle due to their use of
>>
>>
>regexps
>
>
>>to recognize key elements. It should be fixable, I think the problem is
>>that the parser expects a word after LOCUS not a number. This may not be
>>the only problem though. Could you post the entire file? Or if it is
>>
>>
>large
>
>
>>then a representative file of smaller size.
>>
>>- Mark
>>
>>
>>
>>
>>
>>Morgane THOMAS-CHOLLIER <mthomasc at vub.ac.be>
>>Sent by: biojava-l-bounces at portal.open-bio.org
>>02/14/2006 04:36 AM
>>
>>
>> To: biojava-l at biojava.org
>> cc: (bcc: Mark Schreiber/GP/Novartis)
>> Subject: [Biojava-l] Genbank parser error [biojavax]
>>
>>
>>Hello,
>>
>>I have tried biojavax today with a view to use the Genbank file parser.
>>
>>My test file is a Genbank formatted file which has been produced by
>>Ensembl export system.
>>
>>The head of the file is as follow :
>>
>>LOCUS 6 489671 bp DNA HTG 13-FEB-2006
>>DEFINITION Mus musculus chromosome 6 NCBIM34 partial sequence
>> 52296503..52786173 reannotated via EnsEMBL
>>ACCESSION chromosome:NCBIM34:6:52296503:52786173:1
>>VERSION chromosome:NCBIM34:6:52296503:52786173:1
>>
>>I used the code provided in biojavax docbook to parse this file.
>>I get the following error :
>>
>>Exception in thread "main" org.biojava.bio.BioException: Could not read
>>sequence
>> at
>>org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:111)
>> at
>>org.embnet.be.biojavax.tryout.GenbankParseTest.main(GenbankParseTest.java:31)
>>Caused by: org.biojava.bio.seq.io.ParseException: Bad locus line found:
>>6 489671 bp DNA HTG 13-FEB-2006
>> at
>>org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:229)
>> at
>>org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:108)
>> ... 1 more
>>
>>I had a look at GenbankFormat.java, and I guess the problem comes from
>>the regular expression that do not recognize the LOCUS as a standard
>>Genbank file LOCUS tag.
>>
>>Am I wrong ? Have biojavax Genbank parser been tested on Ensembl
>>exported files ?
>>
>>Morgane.
>>
>>
>>
>>
>>
>
>
>
--
**********************************************************
Morgane THOMAS-CHOLLIER, PHD Student (mthomasc at vub.ac.be)
Vrije Universiteit Brussels (VUB)
Laboratory of Cell Genetics
Pleinlaan 2
1050 Brussels
Belgium
Tel : +32 2 629 15 22
**********************************************************
Stop Using Internet Explorer, choose FIREFOX !
http://emmanuel.clement.free.fr/navigateurs/comparatif.htm
More information about the Biojava-l
mailing list