[Biojava-l] Genbank parser error [biojavax]
Morgane THOMAS-CHOLLIER
mthomasc at vub.ac.be
Wed Feb 15 03:56:53 EST 2006
Hello again,
I have continued using the Genbank parser, but this time with Genbank
files coming from NCBI :)
I really appreciate the example from the documentation that converts a
Genbank file into an EMBL file. I have to say, it is really easy to use.
I nevertheless have a question concerning the Organism and Source tags.
Indeed, it is clear in the documentation that they are ignored by the
parser.
But I do not really understand why.
When I used the Genbank file of the accession numbers : AC147788 and
DQ158013, I was unable to get the common name of the organism or use
getNameHierarchy(), but I can get the taxon ID for both.
Is there a way to get the common name of the organism, without using a
remote call to the NCBI with the taxonID ?
Thanks for your help,
Morgane.
Morgane THOMAS-CHOLLIER wrote:
> Hello Mark,
>
> My file is indeed too large to be posted.
> So I have exported a smaller sequence from Ensembl that I tested with
> the parser. The behavior is the same.
> You will find below this "Genbank" formatted file enclosed.
>
> Thanks for your help,
>
> Morgane.
>
> LOCUS 6 3498 bp DNA HTG 14-FEB-2006
> DEFINITION Mus musculus chromosome 6 NCBIM34 partial sequence
> 52305503..52309000 reannotated via EnsEMBL
> ACCESSION chromosome:NCBIM34:6:52305503:52309000:1
> VERSION chromosome:NCBIM34:6:52305503:52309000:1
> KEYWORDS .
> SOURCE House mouse
> ORGANISM Mus musculus
> Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
> Euteleostomi;
> Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
> Sciurognathi; Muridae; Murinae; Mus.
> COMMENT This sequence was annotated by the Ensembl system. Please
> visit the
> Ensembl web site, http://www.ensembl.org/ for more
> information.
> COMMENT All feature locations are relative to the first (5') base
> of the
> sequence in this file. The sequence presented is always the
> forward strand of the assembly. Features that lie outside
> of the
> sequence contained in this file have clonal location
> coordinates in
> the format: .:..
> COMMENT The /gene indicates a unique id for a gene,
> /note="transcript_id=..." a unique id for a transcript,
> /protein_id
> a unique id for a peptide and note="exon_id=..." a unique
> id for an
> exon. These ids are maintained wherever possible between
> versions.
> COMMENT All the exons and transcripts in Ensembl are confirmed by
> similarity to either protein or cDNA sequences.
> FEATURES Location/Qualifiers
> source 1..3498
> /organism="Mus musculus"
> /db_xref="taxon:10090"
> gene complement(506..2826)
> /gene=ENSMUSG00000014704
> mRNA join(complement(2261..2826),complement(506..1620))
> /gene="ENSMUSG00000014704"
> /note="transcript_id=ENSMUST00000014848"
> CDS join(complement(2261..2639),complement(881..1620))
> /gene="ENSMUSG00000014704"
> /protein_id="ENSMUSP00000014848"
> /note="transcript_id=ENSMUST00000014848"
> /db_xref="MarkerSymbol:Hoxa2"
> /db_xref="Uniprot/SWISSPROT:HXA2_MOUSE"
> /db_xref="RefSeq_peptide:NP_034581.1"
> /db_xref="RefSeq_dna:NM_010451.1"
> /db_xref="Uniprot/SPTREMBL:Q3UYP9_MOUSE"
> /db_xref="Uniprot/SPTREMBL:Q920T7_MOUSE"
> /db_xref="Uniprot/SPTREMBL:Q920T9_MOUSE"
> /db_xref="Uniprot/SPTREMBL:Q920U0_MOUSE"
> /db_xref="Uniprot/SPTREMBL:Q920U1_MOUSE"
> /db_xref="Uniprot/SPTREMBL:Q920U2_MOUSE"
> /db_xref="Uniprot/SPTREMBL:Q920U3_MOUSE"
> /db_xref="Uniprot/SPTREMBL:Q920U4_MOUSE"
> /db_xref="Uniprot/SPTREMBL:Q920U5_MOUSE"
> /db_xref="EntrezGene:15399"
> /db_xref="AgilentProbe:A_51_P501803"
> /db_xref="EMBL:AB039184"
> /db_xref="EMBL:AB039185"
> /db_xref="EMBL:AB039186"
> /db_xref="EMBL:AB039187"
> /db_xref="EMBL:AB039188"
> /db_xref="EMBL:AB039189"
> /db_xref="EMBL:AB039190"
> /db_xref="EMBL:AB039191"
> /db_xref="EMBL:AB039192"
> /db_xref="EMBL:AK134501"
> /db_xref="EMBL:M87801"
> /db_xref="EMBL:M93148"
> /db_xref="EMBL:M93292"
> /db_xref="EMBL:M95599"
> /db_xref="GO:GO:0003700"
> /db_xref="GO:GO:0005634"
> /db_xref="GO:GO:0006355"
> /db_xref="GO:GO:0007275"
> /db_xref="IPI:IPI00132242.1"
> /db_xref="UniGene:Mm.131"
> /db_xref="protein_id:AAA37827.1"
> /db_xref="protein_id:AAA37834.1"
> /db_xref="protein_id:AAA37835.1"
> /db_xref="protein_id:AAA37836.1"
> /db_xref="protein_id:BAB68708.1"
> /db_xref="protein_id:BAB68709.1"
> /db_xref="protein_id:BAB68710.1"
> /db_xref="protein_id:BAB68711.1"
> /db_xref="protein_id:BAB68712.1"
> /db_xref="protein_id:BAB68713.1"
> /db_xref="protein_id:BAB68714.1"
> /db_xref="protein_id:BAB68715.1"
> /db_xref="protein_id:BAB68716.1"
> /db_xref="protein_id:BAE22163.1"
> /db_xref="AFFY_MG_U74Av2:102643_at"
> /db_xref="AFFY_MG_U74Cv2:171063_at"
> /db_xref="AFFY_Mouse430A_2:1419602_at"
> /db_xref="AFFY_Mouse430_2:1419602_at"
>
> /translation="MNYEFEREIGFINSQPSLAECLTSFPPVADTFQSSSIKTSTLSH
>
> STLIPPPFEQTIPSLNPGSHPRHGAGVGGRPKSSPAGSRGSPVPAGALQPPEYPWMKE
>
> KKAAKKTALPPAAASTGPACLGHKESLEIADGSGGGSRRLRTAYTNTQLLELEKEFHF
>
> NKYLCRPRRVEIAALLDLTERQVKVWFQNRRMKHKRQTQCKENQNSEGKFKNLEDSDK
>
> VEEDEEEKSLFEQALSVSGALLEREGYTFQQNALSQQQAPNGHNGDSQTFPVSPLTSN
>
> EKNLKHFQHQSPTVPNCLSTMGQNCGAGLNNDSPEAIEVPSLQDFNVFSTDSCLQLSD
> ALSPSLPGSLDSPVDISADSFDFFTDTLTTIDLQHLNY"
> exon complement(506..1620)
> /note="exon_id=ENSMUSE00000387033"
> exon complement(2261..2826)
> /note="exon_id=ENSMUSE00000193269"
> BASE COUNT 938 a 815 c 882 g 863 t
> ORIGIN
> 1 AGGAAGAGTT GGAACGTAGA TGTTTGAAAC AAATGTGTAT AAATAAATGA
> ATTTTTGATA
> 61 ACTCCGTTAT TGACCTAGAA ACTAGCAGCT TGGTAAGGGA ACTCCATTCC
> ACTCCACTCG
> 121 TCCTAGAACT GGAAGTTTTT GTAGGCACTT TTCCTCTCCA CACTCAAAAG
> CTTGGGCTAG
> 181 GGCCAACTCA GGCTGCCCAA GCCCATTTCT ATTACTAATG TAACTCTATG
> GCCTGAGTCT
> 241 CAACACTGAA AACCAAATTC ATTCCCTTAG GGGGGAAAAA TCCAAAAAAA
> AAAAAAAAAA
> 301 AAGTCTTGCC AGAAGCCCTA GCACTTTCTG GTTTTCTTCT TTGTTGCTGT
> TTGTTGCAGG
> 361 CTTTGAACAT GCCACCCTAA TAAAATATAT TAAGATTGAA AAGTAAATTG
> TGACCAGACT
> 421 TTTATTTACC ATGTTAGACT AAAAGAAGTA TAAGAAATCA GTATGAGTCT
> TGAGAAAGAG
> 481 GGGAAGAAAA AAATAAGAAA GCTACTTATA GCAAAGGAGA ATTTATTCTA
> CCAAAAATAC
> 541 GCATGACAAT GCATTCTAAT GTGGTACAAA AATAAACAGA AAGTGACAAG
> ACAATTTATG
> 601 GTCACTTTCT TGCAGGCCTC CTGTTTTGTT TTTCAGGAAA ATCACATAGA
> AGCTTGTTGG
> 661 GTTCTGTGTA AAAACCACTT AGAACGCCAA CATAATTTGC AAGAGATGGC
> TTTAAAACTG
> 721 TGTCAGGGGA GAACATTAAA CGGAAAGTCC TCAACATTTG AGAGAGTAGG
> GGTAGATCAA
> 781 GAAGAAACTA AAACGAAAAT CAACTCCCAG AATAAAAGAA GGCAAAGCCA
> CCTGGTCAAA
> 841 GGCGTTTTGT TTTGTGAAGC TTTGTTTTGC TTTAATGTTC TTAGTAATTC
> AGATGCTGTA
> 901 GGTCGATTGT GGTGAGTGTG TCTGTAAAAA AGTCAAAGCT GTCAGCTGAG
> ATATCTACAG
> 961 GACTGTCCAG GGAGCCAGGC AAGCTGGGCG ACAGTGCATC TGAAAGCTGC
> AGGCAGGAAT
> 1021 CTGTGGAGAA AACATTGAAG TCCTGCAAAG AGGGGACCTC GATGGCCTCG
> GGACTGTCAT
> 1081 TGTTTAGGCC AGCTCCACAG TTCTGGCCCA TTGTTGACAA GCAGTTAGGA
> ACAGTGGGTG
> 1141 ACTGGTGCTG AAAATGTTTC AAATTTTTCT CATTGCTGGT TAAAGGCGAA
> ACTGGGAAAG
> 1201 TTTGGGAGTC GCCATTGTGT CCATTGGGAG CCTGCTGTTG AGAGAGCGCA
> TTTTGCTGAA
> 1261 AAGTGTACCC TTCCCTCTCC AGAAGGGCCC CGGAGACACT GAGGGCTTGC
> TCAAAGAGTG
> 1321 ACTTCTCTTC CTCGTCTTCC TCCACTTTGT CCGAGTCCTC CAGGTTTTTA
> AATTTCCCTT
> 1381 CGCTGTTTTG GTTCTCCTTG CACTGGGTTT GCCTCTTATG CTTCATTCTC
> CGGTTCTGAA
> 1441 ACCACACTTT CACTTGTCTC TCGGTCAAAT CCAGCAGCGC GGCGATTTCC
> ACCCTGCGGG
> 1501 GTCTGCAAAG GTACTTGTTG AAATGAAATT CCTTTTCCAG CTCCAAAAGC
> TGAGTGTTGG
> 1561 TGTACGCGGT TCTCAGACGC CTGGATCCCC CGCCGCTGCC ATCAGCTATT
> TCCAGGGATT
> 1621 CTGCAGAAAG GGAAACCAAC AAGAGACACA CATACAGTTG AAGGTGGAAG
> GGTCCGAGCA
> 1681 GGGTTATTCC ATTGGAGCAT AAATACAGCA GAAAAGATCA ACTGCAACAA
> AATGGCCGCC
> 1741 CCTGGATGCA GTGCAGCTAT TGTGCTGCCC TTCCTGGGAG CCCAGCCCGG
> GGAAGCCCAG
> 1801 TCTCTTCCAC CTCCATCAAA TTCCTGCCTG TGGCTTCCCC CAACCTCTTC
> ATCCGGGAGC
> 1861 AAACTTTATA TTAGCTACAA CACAATTTAT AATTAATGCA TCAGCTGCTT
> AGCTGAGCAA
> 1921 GAGCGGTCTA TCACTCTTCA TTACTGTCAA AAAGCCAAAC TCTAGGACAA
> CTAGACAAGA
> 1981 GGAGGTCAGT TCCAACTCAA ATAAATCATC CTACATTACA CAAGTTAGGG
> AAAGTGCCCC
> 2041 CCCTCCTCAA AATATATATG TCTCATTGTG GGACTCGGGA TCTATTTTCC
> CCTCCACCAA
> 2101 ACCCACTCCT GAGACCACAG GGGCATGAGA CCCGCCACCA GGCATCTCTC
> TCTCTCCCCC
> 2161 TTCCCTCGAA GCTCATGGTC CCCTCCCCCA CAACCGCTCC TAGGGAAGCC
> CGGAGGGGGA
> 2221 CAAGGGTCCC CGAGACCTGG GGCCAAGTCT CCGGACTGAC CTTTGTGGCC
> GAGGCAGGCA
> 2281 GGGCCCGTGG AGGCGGCGGC GGGCGGCAGC GCGGTTTTCT TGGCCGCCTT
> CTTCTCCTTC
> 2341 ATCCAGGGAT ACTCAGGCGG CTGCAGGGCG CCGGCAGGCA CCGGGCTGCC
> GCGACTGCCC
> 2401 GCGGGGCTCG ACTTGGGGCG GCCGCCAACG CCAGCGCCGT GGCGAGGGTG
> ACTGCCCGGG
> 2461 TTCAGGCTGG GAATGGTCTG CTCAAAAGGA GGAGGAATCA GTGTCGAGTG
> TGAAAGCGTC
> 2521 GAGGTCTTGA TTGATGAACT TTGAAATGTA TCAGCGACAG GGGGAAAAGA
> TGTCAGGCAC
> 2581 TCAGCGAGCG ACGGCTGGCT ATTGATAAAA CCAATCTCTC GCTCAAATTC
> GTAATTCATG
> 2641 GCCTTCTCCT TGGAGCCCCC TCGGAGGAAA AGTTCCCTCT TTTGGAGGGG
> CTTTGGGGGG
> 2701 GCAAGGCCCA GGAAAAAGGC GAGCGCGAAG GAAAAAAAAA TCTATCATAG
> AAGATCGCTG
> 2761 CTGGGGTGTT TTTTTTCTAA TTCACTGATT ACAGCCGTAT GGGGACCGCG
> CTACTATTAA
> 2821 ACTATTGAAT TCATGGAGAC AAGGTTGAAA TTGGACCGAA TTGGCTGTCA
> CATGATTGCT
> 2881 TCTGCCCAAT GACAATTTGG GCTTTAATCA AAAGAAGCCA CTGTCTGTTT
> GATTGATCCA
> 2941 AAAAAGTCAG AAAGGAACGC CTCATTGGGG GCCATCGAGG CTTTATTTAC
> ACTTTTTTTC
> 3001 AGGGCAAAAA TACATATATG TGGGTGTGGA TGGCAATGCC CCGGGAGTGC
> GTGGGGGGCG
> 3061 AGAGTGCCTG TTTGCCTCCT GATCTGCAAG GATCTAGTGT GCTCCCTGGA
> GTGTGTGTGT
> 3121 GAGTGTGTGC GTGTGAGCCC TGCTGCCGTC CCGCCAGTGG CTGCCCTCTG
> CCTCCCCCGC
> 3181 ACACTCCGCG CATTGTTTGG GACTGTCGGG AAGACGCCTC GCACCTCACA
> AATCATTTAA
> 3241 GCACCTCAGC CTGACGCCTG CAGTCATTAA CAAAGTAATC CATTAATCTT
> CAAAGTTTTG
> 3301 ACACCCCAGG GCCCTGCATC TCAGCCACAT AAGTTCTGCT AAGGCAAGAG
> AAAGGAGCAG
> 3361 AGTGGGAGAG AGAGAGGAGA GAGGGAGAGA GGGAGAGAGG GAGAGAGAGA
> GAGAGAGAGA
> 3421 GAGAGAGAGA GAGAGAGAGA GAGAGAATGA ATATTGGGGT TCACCTTTCC
> TCTTCCTCCT
> 3481 CTTTTTCCAA AATCAGTT
> //
>
>
>
>
> mark.schreiber at novartis.com wrote:
>
>> Hi Morgane -
>>
>> I have to say that doesn't look much like Genbank : )
>>
>> The biojavax parser are possibly a bit brittle due to their use of
>> regexps to recognize key elements. It should be fixable, I think the
>> problem is that the parser expects a word after LOCUS not a number.
>> This may not be the only problem though. Could you post the entire
>> file? Or if it is large then a representative file of smaller size.
>>
>> - Mark
>>
>>
>>
>>
>>
>> Morgane THOMAS-CHOLLIER <mthomasc at vub.ac.be>
>> Sent by: biojava-l-bounces at portal.open-bio.org
>> 02/14/2006 04:36 AM
>>
>>
>> To: biojava-l at biojava.org
>> cc: (bcc: Mark Schreiber/GP/Novartis)
>> Subject: [Biojava-l] Genbank parser error [biojavax]
>>
>>
>> Hello,
>>
>> I have tried biojavax today with a view to use the Genbank file parser.
>>
>> My test file is a Genbank formatted file which has been produced by
>> Ensembl export system.
>>
>> The head of the file is as follow :
>>
>> LOCUS 6 489671 bp DNA HTG 13-FEB-2006
>> DEFINITION Mus musculus chromosome 6 NCBIM34 partial sequence
>> 52296503..52786173 reannotated via EnsEMBL
>> ACCESSION chromosome:NCBIM34:6:52296503:52786173:1
>> VERSION chromosome:NCBIM34:6:52296503:52786173:1
>>
>> I used the code provided in biojavax docbook to parse this file.
>> I get the following error :
>>
>> Exception in thread "main" org.biojava.bio.BioException: Could not
>> read sequence
>> at
>> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:111)
>>
>> at
>> org.embnet.be.biojavax.tryout.GenbankParseTest.main(GenbankParseTest.java:31)
>>
>> Caused by: org.biojava.bio.seq.io.ParseException: Bad locus line
>> found: 6 489671 bp DNA HTG 13-FEB-2006
>> at
>> org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:229)
>>
>> at
>> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:108)
>>
>> ... 1 more
>>
>> I had a look at GenbankFormat.java, and I guess the problem comes
>> from the regular expression that do not recognize the LOCUS as a
>> standard Genbank file LOCUS tag.
>>
>> Am I wrong ? Have biojavax Genbank parser been tested on Ensembl
>> exported files ?
>>
>> Morgane.
>>
>>
>>
>
--
**********************************************************
Morgane THOMAS-CHOLLIER, PHD Student (mthomasc at vub.ac.be)
Vrije Universiteit Brussels (VUB)
Laboratory of Cell Genetics
Pleinlaan 2
1050 Brussels
Belgium
More information about the Biojava-l
mailing list