From richard.holland at ebi.ac.uk Thu Jun 1 11:26:12 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Thu, 01 Jun 2006 16:26:12 +0100 Subject: [Biojava-l] Error loading ontology terms In-Reply-To: References: Message-ID: <1149175573.3948.78.camel@texas.ebi.ac.uk> Hi there. I looked through your stack trace, and the line numbers don't match up with the current code. I have a strong feeling you may have an out-of- date version of biojava. Could you double-check that you have the latest biojava-1.4 version, or are using the biojava-live version built from CVS? If you can confirm that you are using the latest 1.4 or biojava-live then it'd be easier to solve this. Alternatively, you could have an out-of-date version of the BioSQL schema. The reason I suspect that your BioSQL or BioJava are out of date is because in the last stack trace you mention, this exception arises: java.sql.SQLException: Unknown column 'name' in 'field list' This shows that BioJava has expected to find a column called 'name' in some table in BioSQL, but that column is not there. This would only happen if your BioSQL version did not match the version of BioSQL that your version of BioJava was expecting. cheers, Richard On Thu, 2006-06-01 at 21:32 +0800, Yi-Feng Chang wrote: > Leif, this looks more like a biojava or biojava-x related problem, so > I'm resending it to the Biojava list. -hilmar > ======================================================================== > == > Dear All, > I've checked biosql archives, and found a similar thread > (http://lists.open-bio.org/pipermail/biojava-l/2005-November/ > 005151.html) > however, it did not give specific solution. So I post here again, and > hope there are someone could help me. > I'm using JDK1.5.0_05, Biojava 1.4, Biosql 1.41, and Mysql 5.0 with > My_connectJ 3.1 > I was following the demo source that provide by biojava-in-anger except > for the database connection > the exceptions were listed in following: > In first connection there would be a connection error > *** Importing a core ontology -- hope this is okay > *** Importing terms > Exception in thread "main" org.biojava.bio.BioException: Error > connecting to BioSQL database: Connection is closed. > at > org.biojava.bio.seq.db.biosql.BioSQLSequenceDB.initDb > (BioSQLSequenceDB.java:276) > at > org.biojava.bio.seq.db.biosql.BioSQLSequenceDB. > (BioSQLSequenceDB.java:194) > at genevote.BioSQLTest.loadSeq(BioSQLTest.java:31) > at genevote.BioSQLTest.main(BioSQLTest.java:70) > Caused by: java.sql.SQLException: Connection is closed. > at > org.apache.commons.dbcp.PoolingDataSource > $PoolGuardConnectionWrapper.checkOpen(PoolingDataSource.java:219) > at > org.apache.commons.dbcp.PoolingDataSource > $PoolGuardConnectionWrapper.createStatement(PoolingDataSource.java:248) > at > org.biojava.bio.seq.db.biosql.MySQLDBHelper.getInsertID > (MySQLDBHelper.java:68) > at > org.biojava.bio.seq.db.biosql.BioSQLSequenceDB.initDb > (BioSQLSequenceDB.java:268) > ... 3 more > Then I tried again, it works, and I put all sequences in genbank format > into biosql db without error. > But, while I tried to extract sequences, exception comes again. > org.biojava.bio.BioException: Error loading ontology terms > at > org.biojava.bio.seq.db.biosql.OntologySQL.loadOntology > (OntologySQL.java:444) > at > org.biojava.bio.seq.db.biosql.OntologySQL.getOntology > (OntologySQL.java:116) > at org.biojava.bio.seq.db.biosql.OntologySQL.(OntologySQL.java: > 413) > at > org.biojava.bio.seq.db.biosql.OntologySQL.getOntologySQL > (OntologySQL.java:72) > at > org.biojava.bio.seq.db.biosql.BioSQLSequenceDB.initDb > (BioSQLSequenceDB.java:240) > at > org.biojava.bio.seq.db.biosql.BioSQLSequenceDB. > (BioSQLSequenceDB.java:194) > at genevote.test.loadSeq(test.java:25) > at genevote.test.main(test.java:76) > Caused by: java.sql.SQLException: Unknown column 'name' in 'field list' > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2851) > at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1534) > at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:1625) > at com.mysql.jdbc.Connection.execSQL(Connection.java:2297) > at com.mysql.jdbc.Connection.execSQL(Connection.java:2226) > at > com.mysql.jdbc.PreparedStatement.executeInternal > (PreparedStatement.java:1812) > at > com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java: > 1657) > at > org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery > (DelegatingPreparedStatement.java:205) > at > org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery > (DelegatingPreparedStatement.java:205) > at org.biojava.bio.seq.db.biosql.OntologySQL.loadTerms > (OntologySQL.java:339) > at > org.biojava.bio.seq.db.biosql.OntologySQL.loadOntology > (OntologySQL.java:441) > ... 7 more > > yi-feng chang > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From johnson.biotech at gmail.com Thu Jun 1 18:03:43 2006 From: johnson.biotech at gmail.com (Seth Johnson) Date: Thu, 1 Jun 2006 18:03:43 -0400 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files Message-ID: Hi All, I'm a newbie to the whole BioJava(X) API and was hoping to get some clarification on several issues that I'm having. I am developing a parser that would take as input "NCBI Incremental ASN.1 Sequence Updates to Genbank" files ( ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the ASN2GB converter ( ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert resulting sequences to a format parsable by BioJava(X) ( http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where my problems start. ISSUE 1: I've tried to parse all of the formats that ASN2GB outputs ( GenBank (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML), tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank format is recognized by the "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with some exceptions that I'll describe in issue #2. This is the code that I'm using to parse, for example, the EMBL output: BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb")); Namespace gbNspace = (Namespace) RichObjectFactory.getObject(SimpleNamespace.class, new Object[]{"gbSpace"} ); RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace); while (gbSeqs.hasNext()) { try { RichSequence rs = gbSeqs.nextRichSequence(); // Further processing or RichSequence object from here } catch (BioException be){ be.printStackTrace(); } } The multi-sequence EMBL file looks like this: --------------------------------------------------------------------------------- ID DQ472184 standard; DNA; INV; 546 BP. XX AC DQ472184; XX SV DQ472184.1 DT 15-MAY-2006 XX DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, DE complete cds. XX KW . XX OS Trypanosoma cruzi strain CL Brener OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; OC Schizotrypanum. XX RN [1] RP 1-546 RA De Melo L.D.B.; RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; RL Unpublished. XX RN [2] RP 1-546 RA De Melo L.D.B.; RT ; RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ RL 21949-900, Brazil XX FH Key Location/Qualifiers FH FT source 1..546 FT /organism="Trypanosoma cruzi strain CL Brener" FT /mol_type="genomic DNA" FT /strain="CL Brener" FT /db_xref="taxon:353153" FT gene <1..>546 FT /gene="ARC21" FT /note="TcARC21" FT mRNA <1..>546 FT /gene="ARC21" FT /product="actin-related protein 3" FT CDS 1..546 FT /gene="ARC21" FT /note="actin-binding protein; ARPC3 21 kDa; putative FT member of Arp2/3 complex" FT /codon_start=1 FT /product="actin-related protein 3" FT /protein_id="ABF13401.1" FT /db_xref="GI:93360014" FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL FT FPEKDGTGNKFWMAFAKRPFLASS" atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 agttag 546 // ID DQ472185 standard; DNA; INV; 543 BP. XX AC DQ472185; XX SV DQ472185.1 DT 15-MAY-2006 XX DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, DE complete cds. XX KW . XX OS Trypanosoma cruzi strain CL Brener OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; OC Schizotrypanum. XX RN [1] RP 1-543 RA De Melo L.D.B.; RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; RL Unpublished. XX RN [2] RP 1-543 RA De Melo L.D.B.; RT ; RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ RL 21949-900, Brazil XX FH Key Location/Qualifiers FH FT source 1..543 FT /organism="Trypanosoma cruzi strain CL Brener" FT /mol_type="genomic DNA" FT /strain="CL Brener" FT /db_xref="taxon:353153" FT gene <1..>543 FT /gene="ARC20" FT /note="TcARC20" FT mRNA <1..>543 FT /gene="ARC20" FT /product="actin-related protein 4" FT CDS 1..543 FT /gene="ARC20" FT /note="actin-binding protein; ARPC4 20 kDa; putative FT member of Arp2/3 complex" FT /codon_start=1 FT /product="actin-related protein 4" FT /protein_id="ABF13402.1" FT /db_xref="GI:93360016" FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA FT MKLNVNQRARRAAMEFFLALNFT" atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 tga 543 // ----------------------------------------------------------------------- I get an exception message "Could Not Read Sequence". Same thing happens if I use the readINSDSetDNA reader instead of readEMBLDNA one with the following INSDset file (beginning of the file): DQ022078 16729 DNA linear ENV

15-MAY-2006

Uncultured bacterium WWRS-2005 putative aminoglycoside phosphotransferase (a3.001), putative oxidoreductase (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase class C (estA3), putative permease (a3.005), putative transmembrane signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative asparaginase (a3.010), hypothetical protein (a3.011), hypothetical protein (a3.012), putative membrane protease subunit (a3.013), putative haloalkane dehalogenase (a3.014), putative transcriptional regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and hypothetical protein (a3.017) genes, complete cds

DQ022078

gb|DQ022078.1| gi|71842722

15-MAY-2006

DQ022078

gb|DQ022078.1| gi|71842722

ENV ? 1..16729 Schmeisser,C. Elend,C. Streit,W.R. Isolation and biochemical characterization of two novel metagenome derived esterases Appl. Environ. Microbiol. 0:0-0 (2006) ? 1..16729 Schmeisser,C. Elend,C. Streit,W.R. Submitted (29-APR-2005) to the EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057, Germany So my question is wether the ASN2GB produces output that's incompatible with BioJava parsers or is there a problem with the sequence themselves or the problems with the majority of parsers??? Could it be that I'm using the API wrongly for the above formats, although GenBank parser works as advertised with some exceptions below: ISSUE #2: When I try to parse GenBank files using the following code: BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb")); Namespace gbNspace = (Namespace) RichObjectFactory.getObject(SimpleNamespace.class, new Object[]{"gbSpace"} ); RichSequenceIterator gbSeqs = RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace); while (gbSeqs.hasNext()) { try { RichSequence rs = gbSeqs.nextRichSequence(); // Further processing or RichSequence object from here } catch (BioException be){ be.printStackTrace(); } } Genbank file in question: LOCUS BC074905 838 bp mRNA linear PRI 15-APR-2006 DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038 IMAGE:30915482), complete cds. ACCESSION BC074905 VERSION BC074905.2 GI:50959825 KEYWORDS MGC. SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 838) AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G., Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D., Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K., Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F., Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L., Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L., Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S., Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J., Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J., McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S., Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W., Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A., Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S., Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y., Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D., Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M., Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E., Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A. CONSRTM Mammalian Gene Collection Program Team TITLE Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002) PUBMED 12477932 REFERENCE 2 (bases 1 to 838) CONSRTM NIH MGC Project TITLE Direct Submission JOURNAL Submitted (25-JUN-2004) National Institutes of Health, Mammalian Gene Collection (MGC), Bethesda, MD 20892-2590, USA REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832. Contact: MGC help desk Email: cgapbs-r at mail.nih.gov Tissue Procurement: Genome Sequence Centre, British Columbia Cancer Center cDNA Library Preparation: British Columbia Cancer Research Center cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL) DNA Sequencing by: Genome Sequence Centre, BC Cancer Agency, Vancouver, BC, Canada info at bcgsc.bc.ca Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao, Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang, Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra. Clone distribution: MGC clone distribution information can be found through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov Series: IRBU Plate: 4 Row: C Column: 3. Differences found between this sequence and the human reference genome (build 36) are described in misc_difference features below. FEATURES Location/Qualifiers source 1..838 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /clone="MGC:104038 IMAGE:30915482" /tissue_type="Lung, PCR rescued clones" /clone_lib="NIH_MGC_273" /lab_host="DH10B" /note="Vector: pCR4 Topo TA with reversed insert" gene 1..838 /gene="KLK14" /note="synonym: KLK-L6" /db_xref="GeneID:43847" /db_xref="HGNC:6362" /db_xref="IMGT/GENE-DB:6362" /db_xref="MIM:606135" CDS 49..804 /gene="KLK14" /codon_start=1 /product="KLK14 protein" /protein_id="AAH74905.1" /db_xref="GI:50959826" /db_xref="GeneID:43847" /db_xref="HGNC:6362" /db_xref="IMGT/GENE-DB:6362" /db_xref="MIM:606135" /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK" misc_difference 98 /gene="KLK14" /note="'G' in cDNA is 'A' in the human genome; amino acid difference: 'R' in cDNA, 'Q' in the human genome." misc_difference 133 /gene="KLK14" /note="'T' in cDNA is 'C' in the human genome; amino acid difference: 'Y' in cDNA, 'H' in the human genome." ORIGIN 1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag 121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact 241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc // I get the following exception: java.lang.IllegalArgumentException: Authors string cannot be null org.biojava.bio.BioException: Could not read sequence at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107) at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258) at exonhit.parsers.GenBankParser.main(GenBankParser.java:341) Caused by: java.lang.IllegalArgumentException: Authors string cannot be null at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76) at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) ----------------------------------------------------------------------- I'm trying to see what could be the problem with this particular sequence. Looks to me like the AUTHORS portion is not getting parsed correctly. Any ideas would be greatly appreciated! -- Best Regards, Seth Johnson Senior Bioinformatics Associate Ph: (202) 470-0900 Fx: (775) 251-0358 _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From andreas.draeger at uni-tuebingen.de Fri Jun 2 01:57:22 2006 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Fri, 02 Jun 2006 07:57:22 +0200 Subject: [Biojava-l] Error loading ontology terms In-Reply-To: <1149175573.3948.78.camel@texas.ebi.ac.uk> References: <1149175573.3948.78.camel@texas.ebi.ac.uk> Message-ID: <447FD342.4090806@uni-tuebingen.de> Hello, You can solve this problem just by renaming the column "synonym" in table "term_synonym" to "name". The reason for changing the name of this column is that in some database systems the term "synonym" is a reserved word. So the older version that you are using currently might cause problems with some databas systems. Once you renamed this column, BioJava will work fine. Andreas Dr?ger > java.sql.SQLException: Unknown column 'name' in 'field list' > > -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From richard.holland at ebi.ac.uk Fri Jun 2 05:01:39 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Fri, 02 Jun 2006 10:01:39 +0100 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: References: Message-ID: <1149238900.3948.87.camel@texas.ebi.ac.uk> Hi Seth. Your second point, about the authors string not being read correctly in Genbank format, has been fixed (or should have been if I got the code right!). Could you check the latest version of biojava-live out of CVS and give it another go? Basically the parser did not recognise the CONSRTM tag, as it is not mentioned in the sample record provided by NCBI, which is what I based the parser on. I've set it up now so that it reads the CONSRTM tag, but the value is merged with the authors tag with (consortium) appended. There will still be problems if the consortium value has commas in it - not sure how to fix this yet. Your first point is harder to solve because you did not provide a complete stack trace for the exceptions you are getting. The complete stack trace would enable me to identify exactly where things are going wrong and give me a better chance of fixing them. Could you send the stack trace, and I'll see what I can do. cheers, Richard On Thu, 2006-06-01 at 18:03 -0400, Seth Johnson wrote: > Hi All, > > I'm a newbie to the whole BioJava(X) API and was hoping to get some > clarification on several issues that I'm having. > I am developing a parser that would take as input "NCBI Incremental > ASN.1 Sequence Updates to Genbank" files ( > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the > ASN2GB converter ( > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert > resulting sequences to a format parsable by BioJava(X) ( > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where > my problems start. > > ISSUE 1: > I've tried to parse all of the formats that ASN2GB outputs ( GenBank > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML), > tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank > format is recognized by the > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with > some exceptions that I'll describe in issue #2. This is the code that > I'm using to parse, for example, the EMBL output: > > BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb")); > Namespace gbNspace = (Namespace) > RichObjectFactory.getObject(SimpleNamespace.class, new > Object[]{"gbSpace"} ); > RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace); > while (gbSeqs.hasNext()) { > try { > RichSequence rs = gbSeqs.nextRichSequence(); > // Further processing or RichSequence object from here > > } catch (BioException be){ > be.printStackTrace(); > } > } > > The multi-sequence EMBL file looks like this: > --------------------------------------------------------------------------------- > ID DQ472184 standard; DNA; INV; 546 BP. > XX > AC DQ472184; > XX > SV DQ472184.1 > DT 15-MAY-2006 > XX > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > DE complete cds. > XX > KW . > XX > OS Trypanosoma cruzi strain CL Brener > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > OC Schizotrypanum. > XX > RN [1] > RP 1-546 > RA De Melo L.D.B.; > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > RL Unpublished. > XX > RN [2] > RP 1-546 > RA De Melo L.D.B.; > RT ; > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > RL 21949-900, Brazil > XX > FH Key Location/Qualifiers > FH > FT source 1..546 > FT /organism="Trypanosoma cruzi strain CL Brener" > FT /mol_type="genomic DNA" > FT /strain="CL Brener" > FT /db_xref="taxon:353153" > FT gene <1..>546 > FT /gene="ARC21" > FT /note="TcARC21" > FT mRNA <1..>546 > FT /gene="ARC21" > FT /product="actin-related protein 3" > FT CDS 1..546 > FT /gene="ARC21" > FT /note="actin-binding protein; ARPC3 21 kDa; putative > FT member of Arp2/3 complex" > FT /codon_start=1 > FT /product="actin-related protein 3" > FT /protein_id="ABF13401.1" > FT /db_xref="GI:93360014" > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > FT FPEKDGTGNKFWMAFAKRPFLASS" > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > agttag 546 > // > ID DQ472185 standard; DNA; INV; 543 BP. > XX > AC DQ472185; > XX > SV DQ472185.1 > DT 15-MAY-2006 > XX > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > DE complete cds. > XX > KW . > XX > OS Trypanosoma cruzi strain CL Brener > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > OC Schizotrypanum. > XX > RN [1] > RP 1-543 > RA De Melo L.D.B.; > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > RL Unpublished. > XX > RN [2] > RP 1-543 > RA De Melo L.D.B.; > RT ; > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > RL 21949-900, Brazil > XX > FH Key Location/Qualifiers > FH > FT source 1..543 > FT /organism="Trypanosoma cruzi strain CL Brener" > FT /mol_type="genomic DNA" > FT /strain="CL Brener" > FT /db_xref="taxon:353153" > FT gene <1..>543 > FT /gene="ARC20" > FT /note="TcARC20" > FT mRNA <1..>543 > FT /gene="ARC20" > FT /product="actin-related protein 4" > FT CDS 1..543 > FT /gene="ARC20" > FT /note="actin-binding protein; ARPC4 20 kDa; putative > FT member of Arp2/3 complex" > FT /codon_start=1 > FT /product="actin-related protein 4" > FT /protein_id="ABF13402.1" > FT /db_xref="GI:93360016" > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > FT MKLNVNQRARRAAMEFFLALNFT" > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > tga 543 > // > ----------------------------------------------------------------------- > I get an exception message "Could Not Read Sequence". Same thing > happens if I use the readINSDSetDNA reader instead of readEMBLDNA one > with the following INSDset file (beginning of the file): > > > > > DQ022078 > 16729 > DNA > linear > ENV > 15-MAY-2006 > 15-MAY-2006 > Uncultured bacterium WWRS-2005 putative > aminoglycoside phosphotransferase (a3.001), putative oxidoreductase > (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase > class C (estA3), putative permease (a3.005), putative transmembrane > signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative > asparaginase (a3.010), hypothetical protein (a3.011), hypothetical > protein (a3.012), putative membrane protease subunit (a3.013), > putative haloalkane dehalogenase (a3.014), putative transcriptional > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and > hypothetical protein (a3.017) genes, complete cds > DQ022078 > > gb|DQ022078.1| > gi|71842722 > > > ENV > > > > ? > 1..16729 > > Schmeisser,C. > Elend,C. > Streit,W.R. > > Isolation and biochemical characterization > of two novel metagenome derived esterases > Appl. Environ. Microbiol. 0:0-0 > (2006) > > > ? > 1..16729 > > Schmeisser,C. > Elend,C. > Streit,W.R. > > Submitted (29-APR-2005) to the > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057, > Germany > > > > So my question is wether the ASN2GB produces output that's > incompatible with BioJava parsers or is there a problem with the > sequence themselves or the problems with the majority of parsers??? > Could it be that I'm using the API wrongly for the above formats, > although GenBank parser works as advertised with some exceptions > below: > > ISSUE #2: > When I try to parse GenBank files using the following code: > > BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb")); > Namespace gbNspace = (Namespace) > RichObjectFactory.getObject(SimpleNamespace.class, new > Object[]{"gbSpace"} ); > RichSequenceIterator gbSeqs = > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace); > while (gbSeqs.hasNext()) { > try { > RichSequence rs = gbSeqs.nextRichSequence(); > // Further processing or RichSequence object from here > > } catch (BioException be){ > be.printStackTrace(); > } > } > > Genbank file in question: > > LOCUS BC074905 838 bp mRNA linear PRI 15-APR-2006 > DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038 > IMAGE:30915482), complete cds. > ACCESSION BC074905 > VERSION BC074905.2 GI:50959825 > KEYWORDS MGC. > SOURCE Homo sapiens (human) > ORGANISM Homo sapiens > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; > Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; > Catarrhini; Hominidae; Homo. > REFERENCE 1 (bases 1 to 838) > AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G., > Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D., > Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K., > Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F., > Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L., > Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L., > Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S., > Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J., > Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J., > McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S., > Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W., > Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A., > Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S., > Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y., > Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D., > Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M., > Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E., > Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A. > CONSRTM Mammalian Gene Collection Program Team > TITLE Generation and initial analysis of more than 15,000 full-length > human and mouse cDNA sequences > JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002) > PUBMED 12477932 > REFERENCE 2 (bases 1 to 838) > CONSRTM NIH MGC Project > TITLE Direct Submission > JOURNAL Submitted (25-JUN-2004) National Institutes of Health, Mammalian > Gene Collection (MGC), Bethesda, MD 20892-2590, USA > REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov > COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832. > Contact: MGC help desk > Email: cgapbs-r at mail.nih.gov > Tissue Procurement: Genome Sequence Centre, British Columbia Cancer > Center > cDNA Library Preparation: British Columbia Cancer Research Center > cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL) > DNA Sequencing by: Genome Sequence Centre, > BC Cancer Agency, Vancouver, BC, Canada > info at bcgsc.bc.ca > Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson > Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen > Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel > Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave > Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth > Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao, > Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR > Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang, > Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra. > > Clone distribution: MGC clone distribution information can be found > through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov > Series: IRBU Plate: 4 Row: C Column: 3. > > Differences found between this sequence and the human reference > genome (build 36) are described in misc_difference features below. > FEATURES Location/Qualifiers > source 1..838 > /organism="Homo sapiens" > /mol_type="mRNA" > /db_xref="taxon:9606" > /clone="MGC:104038 IMAGE:30915482" > /tissue_type="Lung, PCR rescued clones" > /clone_lib="NIH_MGC_273" > /lab_host="DH10B" > /note="Vector: pCR4 Topo TA with reversed insert" > gene 1..838 > /gene="KLK14" > /note="synonym: KLK-L6" > /db_xref="GeneID:43847" > /db_xref="HGNC:6362" > /db_xref="IMGT/GENE-DB:6362" > /db_xref="MIM:606135" > CDS 49..804 > /gene="KLK14" > /codon_start=1 > /product="KLK14 protein" > /protein_id="AAH74905.1" > /db_xref="GI:50959826" > /db_xref="GeneID:43847" > /db_xref="HGNC:6362" > /db_xref="IMGT/GENE-DB:6362" > /db_xref="MIM:606135" > /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA > GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN > YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA > SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV > SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK" > misc_difference 98 > /gene="KLK14" > /note="'G' in cDNA is 'A' in the human genome; amino acid > difference: 'R' in cDNA, 'Q' in the human genome." > misc_difference 133 > /gene="KLK14" > /note="'T' in cDNA is 'C' in the human genome; amino acid > difference: 'Y' in cDNA, 'H' in the human genome." > ORIGIN > 1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg > 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag > 121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg > 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact > 241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg > 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac > 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg > 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga > 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc > 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg > 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct > 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc > 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt > 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc > // > > I get the following exception: > > java.lang.IllegalArgumentException: Authors string cannot be null > org.biojava.bio.BioException: Could not read sequence > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107) > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258) > at exonhit.parsers.GenBankParser.main(GenBankParser.java:341) > Caused by: java.lang.IllegalArgumentException: Authors string cannot be null > at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76) > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356) > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > ----------------------------------------------------------------------- > > I'm trying to see what could be the problem with this particular > sequence. Looks to me like the AUTHORS portion is not getting parsed > correctly. Any ideas would be greatly appreciated! > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From johnson.biotech at gmail.com Fri Jun 2 13:04:59 2006 From: johnson.biotech at gmail.com (Seth Johnson) Date: Fri, 2 Jun 2006 13:04:59 -0400 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: <1149238900.3948.87.camel@texas.ebi.ac.uk> References: <1149238900.3948.87.camel@texas.ebi.ac.uk> Message-ID: Hi Richard, I made sure I have the latest source code from CVS compiled (EMBLFormat.java & GenbankFormat.java are from 05/24/06). I'm happy to report that GenBank issue is solved!!!! As far as EMBL parsing, I apologize for not providing the stack dump for ISSUE #1. Here's the dump of the exception: -------------------------------------------------------- org.biojava.bio.BioException: Could not read sequence at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) at exonhit.parsers.GenBankParser.main(GenBankParser.java:359) Caused by: java.lang.NumberFormatException: null at java.lang.Integer.parseInt(Integer.java:415) at java.lang.Integer.parseInt(Integer.java:497) at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:299) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) ... 1 more Java Result: -1 ------------------------------------------------------- Here, again, is the code that I'm using to to parse: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ BufferedReader gbBR = null; try { gbBR = new BufferedReader(new FileReader("C:\\Download\\ASN2BSML\\seth_06_02.emb")); } catch (FileNotFoundException fnfe) { fnfe.printStackTrace(); System.exit(-1); } Namespace gbNspace = (Namespace) RichObjectFactory.getObject(SimpleNamespace.class, new Object[]{"gbSpace"} ); RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(gbBR,gbNspace); while (gbSeqs.hasNext()) { try { RichSequence rs = gbSeqs.nextRichSequence(); NCBITaxon myTaxon = rs.getTaxon(); }catch (BioException be){ be.printStackTrace(); System.exit(-1); } } ~~~~~~~~~~~~~~~~~~~~~~~~~ And here's the EMBL file that I'm trying to parse: +++++++++++++++++++++++++ ID DQ472184 standard; DNA; INV; 546 BP. XX AC DQ472184; XX SV DQ472184.1 DT 15-MAY-2006 XX DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, DE complete cds. XX KW . XX OS Trypanosoma cruzi strain CL Brener OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; OC Schizotrypanum. XX RN [1] RP 1-546 RA De Melo L.D.B.; RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; RL Unpublished. XX RN [2] RP 1-546 RA De Melo L.D.B.; RT ; RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ RL 21949-900, Brazil XX FH Key Location/Qualifiers FH FT source 1..546 FT /organism="Trypanosoma cruzi strain CL Brener" FT /mol_type="genomic DNA" FT /strain="CL Brener" FT /db_xref="taxon:353153" FT gene <1..>546 FT /gene="ARC21" FT /note="TcARC21" FT mRNA <1..>546 FT /gene="ARC21" FT /product="actin-related protein 3" FT CDS 1..546 FT /gene="ARC21" FT /note="actin-binding protein; ARPC3 21 kDa; putative FT member of Arp2/3 complex" FT /codon_start=1 FT /product="actin-related protein 3" FT /protein_id="ABF13401.1" FT /db_xref="GI:93360014" FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL FT FPEKDGTGNKFWMAFAKRPFLASS" atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 agttag 546 // ID DQ472185 standard; DNA; INV; 543 BP. XX AC DQ472185; XX SV DQ472185.1 DT 15-MAY-2006 XX DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, DE complete cds. XX KW . XX OS Trypanosoma cruzi strain CL Brener OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; OC Schizotrypanum. XX RN [1] RP 1-543 RA De Melo L.D.B.; RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; RL Unpublished. XX RN [2] RP 1-543 RA De Melo L.D.B.; RT ; RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ RL 21949-900, Brazil XX FH Key Location/Qualifiers FH FT source 1..543 FT /organism="Trypanosoma cruzi strain CL Brener" FT /mol_type="genomic DNA" FT /strain="CL Brener" FT /db_xref="taxon:353153" FT gene <1..>543 FT /gene="ARC20" FT /note="TcARC20" FT mRNA <1..>543 FT /gene="ARC20" FT /product="actin-related protein 4" FT CDS 1..543 FT /gene="ARC20" FT /note="actin-binding protein; ARPC4 20 kDa; putative FT member of Arp2/3 complex" FT /codon_start=1 FT /product="actin-related protein 4" FT /protein_id="ABF13402.1" FT /db_xref="GI:93360016" FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA FT MKLNVNQRARRAAMEFFLALNFT" atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 tga 543 // +++++++++++++++++++++++++++++++++ It looks to me like there's some kind of problem with parsing the sequence version number. I even tried the sequence from test directory (AY069118.em) with same outcome. Regards, Seth On 6/2/06, Richard Holland wrote: > Hi Seth. > > Your second point, about the authors string not being read correctly in > Genbank format, has been fixed (or should have been if I got the code > right!). Could you check the latest version of biojava-live out of CVS > and give it another go? Basically the parser did not recognise the > CONSRTM tag, as it is not mentioned in the sample record provided by > NCBI, which is what I based the parser on. > > I've set it up now so that it reads the CONSRTM tag, but the value is > merged with the authors tag with (consortium) appended. There will still > be problems if the consortium value has commas in it - not sure how to > fix this yet. > > Your first point is harder to solve because you did not provide a > complete stack trace for the exceptions you are getting. The complete > stack trace would enable me to identify exactly where things are going > wrong and give me a better chance of fixing them. Could you send the > stack trace, and I'll see what I can do. > > cheers, > Richard > > > On Thu, 2006-06-01 at 18:03 -0400, Seth Johnson wrote: > > Hi All, > > > > I'm a newbie to the whole BioJava(X) API and was hoping to get some > > clarification on several issues that I'm having. > > I am developing a parser that would take as input "NCBI Incremental > > ASN.1 Sequence Updates to Genbank" files ( > > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the > > ASN2GB converter ( > > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert > > resulting sequences to a format parsable by BioJava(X) ( > > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where > > my problems start. > > > > ISSUE 1: > > I've tried to parse all of the formats that ASN2GB outputs ( GenBank > > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML), > > tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank > > format is recognized by the > > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with > > some exceptions that I'll describe in issue #2. This is the code that > > I'm using to parse, for example, the EMBL output: > > > > BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb")); > > Namespace gbNspace = (Namespace) > > RichObjectFactory.getObject(SimpleNamespace.class, new > > Object[]{"gbSpace"} ); > > RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace); > > while (gbSeqs.hasNext()) { > > try { > > RichSequence rs = gbSeqs.nextRichSequence(); > > // Further processing or RichSequence object from here > > > > } catch (BioException be){ > > be.printStackTrace(); > > } > > } > > > > The multi-sequence EMBL file looks like this: > > --------------------------------------------------------------------------------- > > ID DQ472184 standard; DNA; INV; 546 BP. > > XX > > AC DQ472184; > > XX > > SV DQ472184.1 > > DT 15-MAY-2006 > > XX > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > > DE complete cds. > > XX > > KW . > > XX > > OS Trypanosoma cruzi strain CL Brener > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > OC Schizotrypanum. > > XX > > RN [1] > > RP 1-546 > > RA De Melo L.D.B.; > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > RL Unpublished. > > XX > > RN [2] > > RP 1-546 > > RA De Melo L.D.B.; > > RT ; > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > RL 21949-900, Brazil > > XX > > FH Key Location/Qualifiers > > FH > > FT source 1..546 > > FT /organism="Trypanosoma cruzi strain CL Brener" > > FT /mol_type="genomic DNA" > > FT /strain="CL Brener" > > FT /db_xref="taxon:353153" > > FT gene <1..>546 > > FT /gene="ARC21" > > FT /note="TcARC21" > > FT mRNA <1..>546 > > FT /gene="ARC21" > > FT /product="actin-related protein 3" > > FT CDS 1..546 > > FT /gene="ARC21" > > FT /note="actin-binding protein; ARPC3 21 kDa; putative > > FT member of Arp2/3 complex" > > FT /codon_start=1 > > FT /product="actin-related protein 3" > > FT /protein_id="ABF13401.1" > > FT /db_xref="GI:93360014" > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > > FT FPEKDGTGNKFWMAFAKRPFLASS" > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > > agttag 546 > > // > > ID DQ472185 standard; DNA; INV; 543 BP. > > XX > > AC DQ472185; > > XX > > SV DQ472185.1 > > DT 15-MAY-2006 > > XX > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > > DE complete cds. > > XX > > KW . > > XX > > OS Trypanosoma cruzi strain CL Brener > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > OC Schizotrypanum. > > XX > > RN [1] > > RP 1-543 > > RA De Melo L.D.B.; > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > RL Unpublished. > > XX > > RN [2] > > RP 1-543 > > RA De Melo L.D.B.; > > RT ; > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > RL 21949-900, Brazil > > XX > > FH Key Location/Qualifiers > > FH > > FT source 1..543 > > FT /organism="Trypanosoma cruzi strain CL Brener" > > FT /mol_type="genomic DNA" > > FT /strain="CL Brener" > > FT /db_xref="taxon:353153" > > FT gene <1..>543 > > FT /gene="ARC20" > > FT /note="TcARC20" > > FT mRNA <1..>543 > > FT /gene="ARC20" > > FT /product="actin-related protein 4" > > FT CDS 1..543 > > FT /gene="ARC20" > > FT /note="actin-binding protein; ARPC4 20 kDa; putative > > FT member of Arp2/3 complex" > > FT /codon_start=1 > > FT /product="actin-related protein 4" > > FT /protein_id="ABF13402.1" > > FT /db_xref="GI:93360016" > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > > FT MKLNVNQRARRAAMEFFLALNFT" > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > > tga 543 > > // > > ----------------------------------------------------------------------- > > I get an exception message "Could Not Read Sequence". Same thing > > happens if I use the readINSDSetDNA reader instead of readEMBLDNA one > > with the following INSDset file (beginning of the file): > > > > > > > > > > DQ022078 > > 16729 > > DNA > > linear > > ENV > > 15-MAY-2006 > > 15-MAY-2006 > > Uncultured bacterium WWRS-2005 putative > > aminoglycoside phosphotransferase (a3.001), putative oxidoreductase > > (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase > > class C (estA3), putative permease (a3.005), putative transmembrane > > signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone > > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative > > asparaginase (a3.010), hypothetical protein (a3.011), hypothetical > > protein (a3.012), putative membrane protease subunit (a3.013), > > putative haloalkane dehalogenase (a3.014), putative transcriptional > > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and > > hypothetical protein (a3.017) genes, complete cds > > DQ022078 > > > > gb|DQ022078.1| > > gi|71842722 > > > > > > ENV > > > > > > > > ? > > 1..16729 > > > > Schmeisser,C. > > Elend,C. > > Streit,W.R. > > > > Isolation and biochemical characterization > > of two novel metagenome derived esterases > > Appl. Environ. Microbiol. 0:0-0 > > (2006) > > > > > > ? > > 1..16729 > > > > Schmeisser,C. > > Elend,C. > > Streit,W.R. > > > > Submitted (29-APR-2005) to the > > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University > > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057, > > Germany > > > > > > > > So my question is wether the ASN2GB produces output that's > > incompatible with BioJava parsers or is there a problem with the > > sequence themselves or the problems with the majority of parsers??? > > Could it be that I'm using the API wrongly for the above formats, > > although GenBank parser works as advertised with some exceptions > > below: > > > > ISSUE #2: > > When I try to parse GenBank files using the following code: > > > > BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb")); > > Namespace gbNspace = (Namespace) > > RichObjectFactory.getObject(SimpleNamespace.class, new > > Object[]{"gbSpace"} ); > > RichSequenceIterator gbSeqs = > > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace); > > while (gbSeqs.hasNext()) { > > try { > > RichSequence rs = gbSeqs.nextRichSequence(); > > // Further processing or RichSequence object from here > > > > } catch (BioException be){ > > be.printStackTrace(); > > } > > } > > > > Genbank file in question: > > > > LOCUS BC074905 838 bp mRNA linear PRI 15-APR-2006 > > DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038 > > IMAGE:30915482), complete cds. > > ACCESSION BC074905 > > VERSION BC074905.2 GI:50959825 > > KEYWORDS MGC. > > SOURCE Homo sapiens (human) > > ORGANISM Homo sapiens > > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; > > Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; > > Catarrhini; Hominidae; Homo. > > REFERENCE 1 (bases 1 to 838) > > AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G., > > Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D., > > Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K., > > Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F., > > Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L., > > Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L., > > Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S., > > Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J., > > Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J., > > McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S., > > Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W., > > Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A., > > Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S., > > Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y., > > Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D., > > Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M., > > Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E., > > Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A. > > CONSRTM Mammalian Gene Collection Program Team > > TITLE Generation and initial analysis of more than 15,000 full-length > > human and mouse cDNA sequences > > JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002) > > PUBMED 12477932 > > REFERENCE 2 (bases 1 to 838) > > CONSRTM NIH MGC Project > > TITLE Direct Submission > > JOURNAL Submitted (25-JUN-2004) National Institutes of Health, Mammalian > > Gene Collection (MGC), Bethesda, MD 20892-2590, USA > > REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov > > COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832. > > Contact: MGC help desk > > Email: cgapbs-r at mail.nih.gov > > Tissue Procurement: Genome Sequence Centre, British Columbia Cancer > > Center > > cDNA Library Preparation: British Columbia Cancer Research Center > > cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL) > > DNA Sequencing by: Genome Sequence Centre, > > BC Cancer Agency, Vancouver, BC, Canada > > info at bcgsc.bc.ca > > Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson > > Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen > > Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel > > Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave > > Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth > > Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao, > > Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR > > Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang, > > Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra. > > > > Clone distribution: MGC clone distribution information can be found > > through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov > > Series: IRBU Plate: 4 Row: C Column: 3. > > > > Differences found between this sequence and the human reference > > genome (build 36) are described in misc_difference features below. > > FEATURES Location/Qualifiers > > source 1..838 > > /organism="Homo sapiens" > > /mol_type="mRNA" > > /db_xref="taxon:9606" > > /clone="MGC:104038 IMAGE:30915482" > > /tissue_type="Lung, PCR rescued clones" > > /clone_lib="NIH_MGC_273" > > /lab_host="DH10B" > > /note="Vector: pCR4 Topo TA with reversed insert" > > gene 1..838 > > /gene="KLK14" > > /note="synonym: KLK-L6" > > /db_xref="GeneID:43847" > > /db_xref="HGNC:6362" > > /db_xref="IMGT/GENE-DB:6362" > > /db_xref="MIM:606135" > > CDS 49..804 > > /gene="KLK14" > > /codon_start=1 > > /product="KLK14 protein" > > /protein_id="AAH74905.1" > > /db_xref="GI:50959826" > > /db_xref="GeneID:43847" > > /db_xref="HGNC:6362" > > /db_xref="IMGT/GENE-DB:6362" > > /db_xref="MIM:606135" > > /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA > > GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN > > YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA > > SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV > > SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK" > > misc_difference 98 > > /gene="KLK14" > > /note="'G' in cDNA is 'A' in the human genome; amino acid > > difference: 'R' in cDNA, 'Q' in the human genome." > > misc_difference 133 > > /gene="KLK14" > > /note="'T' in cDNA is 'C' in the human genome; amino acid > > difference: 'Y' in cDNA, 'H' in the human genome." > > ORIGIN > > 1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg > > 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag > > 121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg > > 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact > > 241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg > > 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac > > 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg > > 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga > > 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc > > 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg > > 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct > > 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc > > 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt > > 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc > > // > > > > I get the following exception: > > > > java.lang.IllegalArgumentException: Authors string cannot be null > > org.biojava.bio.BioException: Could not read sequence > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107) > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258) > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:341) > > Caused by: java.lang.IllegalArgumentException: Authors string cannot be null > > at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76) > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356) > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > ----------------------------------------------------------------------- > > > > I'm trying to see what could be the problem with this particular > > sequence. Looks to me like the AUTHORS portion is not getting parsed > > correctly. Any ideas would be greatly appreciated! > > > -- > Richard Holland (BioMart Team) > EMBL-EBI > Wellcome Trust Genome Campus > Hinxton > Cambridge CB10 1SD > UNITED KINGDOM > Tel: +44-(0)1223-494416 > > -- Best Regards, Seth Johnson Senior Bioinformatics Associate Ph: (202) 470-0900 Fx: (775) 251-0358 From johnson.biotech at gmail.com Fri Jun 2 14:46:26 2006 From: johnson.biotech at gmail.com (Seth Johnson) Date: Fri, 2 Jun 2006 14:46:26 -0400 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: References: Message-ID: Hi Mark, Thank you for your suggestions. I've followed your suggestions and it seems to have found a bug that caused an exception in readINSDseqDNA parser. http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=94481355 The problem int the above sequence in INSDseq format was caused by the presence of tags without the corresponding tags: environmental_sample I have not checked wether it's handled correctly by other parsers when it is converted from original NCBI ASN.1 format. Could the code be adjusted so if there's no tags it would assume the value to be 'null' ??? Regards, Seth On 6/1/06, mark.schreiber at novartis.com wrote: > Hi Seth - > > The BioJavaX parsers are still quite new and have not been heavily tested > so your experiences can help us quite a lot. The parsers where initially > designed to be quite strict and follow the GenBank etc specifications. > However, there are often minor variations to those specs which cause > things to break. > > To help us find the bugs can you make sure you are using the very latest > version of biojava from CVS, for example I was under the impression that > the author = null problem had been solved. In each case an example file > and the full stack trace is very useful as well. In some cases you have > provided these so we have a starting point. > > Also, if you have ideas on ways to fix the problems your suggestions would > be greatly appreciated. We only have a very small team of active > developers many of whom are unfortunately very busy just now. > > Hopefully we can get to this soon. > > - Mark > > > > > > "Seth Johnson" > Sent by: biojava-l-bounces at lists.open-bio.org > 06/02/2006 06:03 AM > > > To: biojava-l at lists.open-bio.org > cc: (bcc: Mark Schreiber/GP/Novartis) > Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 > daily update files > > > Hi All, > > I'm a newbie to the whole BioJava(X) API and was hoping to get some > clarification on several issues that I'm having. > I am developing a parser that would take as input "NCBI Incremental > ASN.1 Sequence Updates to Genbank" files ( > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the > ASN2GB converter ( > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert > resulting sequences to a format parsable by BioJava(X) ( > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where > my problems start. > > ISSUE 1: > I've tried to parse all of the formats that ASN2GB outputs ( GenBank > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML), > tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank > format is recognized by the > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with > some exceptions that I'll describe in issue #2. This is the code that > I'm using to parse, for example, the EMBL output: > > BufferedReader inBuf = new BufferedReader(new > FileReader("embl_output.emb")); > Namespace gbNspace = (Namespace) > RichObjectFactory.getObject(SimpleNamespace.class, new > Object[]{"gbSpace"} ); > RichSequenceIterator gbSeqs = > RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace); > while (gbSeqs.hasNext()) { > try { > RichSequence rs = gbSeqs.nextRichSequence(); > // Further processing or RichSequence object from here > > } catch (BioException be){ > be.printStackTrace(); > } > } > > The multi-sequence EMBL file looks like this: > --------------------------------------------------------------------------------- > ID DQ472184 standard; DNA; INV; 546 BP. > XX > AC DQ472184; > XX > SV DQ472184.1 > DT 15-MAY-2006 > XX > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) > gene, > DE complete cds. > XX > KW . > XX > OS Trypanosoma cruzi strain CL Brener > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > OC Schizotrypanum. > XX > RN [1] > RP 1-546 > RA De Melo L.D.B.; > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > RL Unpublished. > XX > RN [2] > RP 1-546 > RA De Melo L.D.B.; > RT ; > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do > Rio > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, > RJ > RL 21949-900, Brazil > XX > FH Key Location/Qualifiers > FH > FT source 1..546 > FT /organism="Trypanosoma cruzi strain CL Brener" > FT /mol_type="genomic DNA" > FT /strain="CL Brener" > FT /db_xref="taxon:353153" > FT gene <1..>546 > FT /gene="ARC21" > FT /note="TcARC21" > FT mRNA <1..>546 > FT /gene="ARC21" > FT /product="actin-related protein 3" > FT CDS 1..546 > FT /gene="ARC21" > FT /note="actin-binding protein; ARPC3 21 kDa; putative > FT member of Arp2/3 complex" > FT /codon_start=1 > FT /product="actin-related protein 3" > FT /protein_id="ABF13401.1" > FT /db_xref="GI:93360014" > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > FT FPEKDGTGNKFWMAFAKRPFLASS" > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt > 120 > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc > 180 > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg > 240 > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat > 300 > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg > 360 > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca > 420 > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag > 480 > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct > 540 > agttag 546 > // > ID DQ472185 standard; DNA; INV; 543 BP. > XX > AC DQ472185; > XX > SV DQ472185.1 > DT 15-MAY-2006 > XX > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) > gene, > DE complete cds. > XX > KW . > XX > OS Trypanosoma cruzi strain CL Brener > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > OC Schizotrypanum. > XX > RN [1] > RP 1-543 > RA De Melo L.D.B.; > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > RL Unpublished. > XX > RN [2] > RP 1-543 > RA De Melo L.D.B.; > RT ; > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do > Rio > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, > RJ > RL 21949-900, Brazil > XX > FH Key Location/Qualifiers > FH > FT source 1..543 > FT /organism="Trypanosoma cruzi strain CL Brener" > FT /mol_type="genomic DNA" > FT /strain="CL Brener" > FT /db_xref="taxon:353153" > FT gene <1..>543 > FT /gene="ARC20" > FT /note="TcARC20" > FT mRNA <1..>543 > FT /gene="ARC20" > FT /product="actin-related protein 4" > FT CDS 1..543 > FT /gene="ARC20" > FT /note="actin-binding protein; ARPC4 20 kDa; putative > FT member of Arp2/3 complex" > FT /codon_start=1 > FT /product="actin-related protein 4" > FT /protein_id="ABF13402.1" > FT /db_xref="GI:93360016" > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > FT MKLNVNQRARRAAMEFFLALNFT" > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt > 120 > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata > 180 > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc > 240 > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt > 300 > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga > 360 > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt > 420 > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg > 480 > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca > 540 > tga 543 > // > ----------------------------------------------------------------------- > I get an exception message "Could Not Read Sequence". Same thing > happens if I use the readINSDSetDNA reader instead of readEMBLDNA one > with the following INSDset file (beginning of the file): > > > > > DQ022078 > 16729 > DNA > linear > ENV > 15-MAY-2006 > 15-MAY-2006 > Uncultured bacterium WWRS-2005 putative > aminoglycoside phosphotransferase (a3.001), putative oxidoreductase > (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase > class C (estA3), putative permease (a3.005), putative transmembrane > signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative > asparaginase (a3.010), hypothetical protein (a3.011), hypothetical > protein (a3.012), putative membrane protease subunit (a3.013), > putative haloalkane dehalogenase (a3.014), putative transcriptional > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and > hypothetical protein (a3.017) genes, complete cds > DQ022078 > > gb|DQ022078.1| > gi|71842722 > > > ENV > > > > ? > 1..16729 > > Schmeisser,C. > Elend,C. > Streit,W.R. > > Isolation and biochemical characterization > of two novel metagenome derived esterases > Appl. Environ. Microbiol. 0:0-0 > (2006) > > > ? > 1..16729 > > Schmeisser,C. > Elend,C. > Streit,W.R. > > Submitted (29-APR-2005) to the > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057, > Germany > > > > So my question is wether the ASN2GB produces output that's > incompatible with BioJava parsers or is there a problem with the > sequence themselves or the problems with the majority of parsers??? > Could it be that I'm using the API wrongly for the above formats, > although GenBank parser works as advertised with some exceptions > below: > > ISSUE #2: > When I try to parse GenBank files using the following code: > > BufferedReader inBuf = new BufferedReader(new > FileReader("genbank_output.gb")); > Namespace gbNspace = (Namespace) > RichObjectFactory.getObject(SimpleNamespace.class, new > Object[]{"gbSpace"} ); > RichSequenceIterator gbSeqs = > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace); > while (gbSeqs.hasNext()) { > try { > RichSequence rs = gbSeqs.nextRichSequence(); > // Further processing or RichSequence object from here > > } catch (BioException be){ > be.printStackTrace(); > } > } > > Genbank file in question: > > LOCUS BC074905 838 bp mRNA linear PRI > 15-APR-2006 > DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038 > IMAGE:30915482), complete cds. > ACCESSION BC074905 > VERSION BC074905.2 GI:50959825 > KEYWORDS MGC. > SOURCE Homo sapiens (human) > ORGANISM Homo sapiens > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; > Euteleostomi; > Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; > Catarrhini; Hominidae; Homo. > REFERENCE 1 (bases 1 to 838) > AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G., > Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., > Schuler,G.D., > Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., > Bhat,N.K., > Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., > Hsieh,F., > Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L., > Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L., > Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S., > Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., > Peters,G.J., > Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J., > McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S., > Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W., > Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A., > Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S., > Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y., > Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D., > Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., > Myers,R.M., > Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E., > Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A. > CONSRTM Mammalian Gene Collection Program Team > TITLE Generation and initial analysis of more than 15,000 > full-length > human and mouse cDNA sequences > JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002) > PUBMED 12477932 > REFERENCE 2 (bases 1 to 838) > CONSRTM NIH MGC Project > TITLE Direct Submission > JOURNAL Submitted (25-JUN-2004) National Institutes of Health, > Mammalian > Gene Collection (MGC), Bethesda, MD 20892-2590, USA > REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov > COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832. > Contact: MGC help desk > Email: cgapbs-r at mail.nih.gov > Tissue Procurement: Genome Sequence Centre, British Columbia > Cancer > Center > cDNA Library Preparation: British Columbia Cancer Research > Center > cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL) > DNA Sequencing by: Genome Sequence Centre, > BC Cancer Agency, Vancouver, BC, Canada > info at bcgsc.bc.ca > Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson > Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen > Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel > Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave > Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, > Ruth > Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy > Liao, > Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR > Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang, > Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco > Marra. > > Clone distribution: MGC clone distribution information can be > found > through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov > Series: IRBU Plate: 4 Row: C Column: 3. > > Differences found between this sequence and the human > reference > genome (build 36) are described in misc_difference features > below. > FEATURES Location/Qualifiers > source 1..838 > /organism="Homo sapiens" > /mol_type="mRNA" > /db_xref="taxon:9606" > /clone="MGC:104038 IMAGE:30915482" > /tissue_type="Lung, PCR rescued clones" > /clone_lib="NIH_MGC_273" > /lab_host="DH10B" > /note="Vector: pCR4 Topo TA with reversed insert" > gene 1..838 > /gene="KLK14" > /note="synonym: KLK-L6" > /db_xref="GeneID:43847" > /db_xref="HGNC:6362" > /db_xref="IMGT/GENE-DB:6362" > /db_xref="MIM:606135" > CDS 49..804 > /gene="KLK14" > /codon_start=1 > /product="KLK14 protein" > /protein_id="AAH74905.1" > /db_xref="GI:50959826" > /db_xref="GeneID:43847" > /db_xref="HGNC:6362" > /db_xref="IMGT/GENE-DB:6362" > /db_xref="MIM:606135" > /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA > GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN > YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA > SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV > SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK" > misc_difference 98 > /gene="KLK14" > /note="'G' in cDNA is 'A' in the human genome; amino > acid > difference: 'R' in cDNA, 'Q' in the human genome." > misc_difference 133 > /gene="KLK14" > /note="'T' in cDNA is 'C' in the human genome; amino > acid > difference: 'Y' in cDNA, 'H' in the human genome." > ORIGIN > 1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat > gttcctcctg > 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga > tgagaacaag > 121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc > cctgctggcg > 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg > ggtcatcact > 241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa > cctgaggagg > 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc > caactacaac > 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc > acggatcggg > 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac > ctcctgccga > 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc > tctgcaatgc > 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag > aaccatcacg > 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca > gggtgactct > 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg > aatggagcgc > 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag > aagctggatt > 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc > // > > I get the following exception: > > java.lang.IllegalArgumentException: Authors string cannot be null > org.biojava.bio.BioException: Could not read sequence > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > at > exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107) > at > exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258) > at exonhit.parsers.GenBankParser.main(GenBankParser.java:341) > Caused by: java.lang.IllegalArgumentException: Authors string cannot be > null > at > org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76) > at > org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356) > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > ----------------------------------------------------------------------- > > I'm trying to see what could be the problem with this particular > sequence. Looks to me like the AUTHORS portion is not getting parsed > correctly. Any ideas would be greatly appreciated! > > -- > Best Regards, > > > Seth Johnson > Senior Bioinformatics Associate > > Ph: (202) 470-0900 > Fx: (775) 251-0358 > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- Best Regards, Seth Johnson Senior Bioinformatics Associate Ph: (202) 470-0900 Fx: (775) 251-0358 From mark.schreiber at novartis.com Sun Jun 4 22:57:35 2006 From: mark.schreiber at novartis.com (mark.schreiber at novartis.com) Date: Mon, 5 Jun 2006 10:57:35 +0800 Subject: [Biojava-l] en.wikipedia.org/wiki/BioJava Message-ID: Hi all - This page looks pretty sad and sparse (http://en.wikipedia.org/wiki/BioJava), anyone feel like updating the information in it? - Mark From richard.holland at ebi.ac.uk Mon Jun 5 04:44:26 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Mon, 05 Jun 2006 09:44:26 +0100 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: References: <1149238900.3948.87.camel@texas.ebi.ac.uk> Message-ID: <1149497066.3947.12.camel@texas.ebi.ac.uk> This one should be fixed in CVS now. Typo on my behalf - I put in code to make it work with both 87+ and pre-87 version of EMBL, then got the regexes the wrong way round!! Could you send the full stacktrace for the INSDseq format problem you're having? (The one where you say you've tracked it down to the qualifier value being missing). I can't see anything wrong there, so I need the stacktrace in order to know which exact sequence of events is throwing the exception. cheers, Richard On Fri, 2006-06-02 at 13:04 -0400, Seth Johnson wrote: > Hi Richard, > > I made sure I have the latest source code from CVS compiled > (EMBLFormat.java & GenbankFormat.java are from 05/24/06). I'm happy > to report that GenBank issue is solved!!!! > As far as EMBL parsing, I apologize for not providing the stack dump > for ISSUE #1. Here's the dump of the exception: > -------------------------------------------------------- > org.biojava.bio.BioException: Could not read sequence > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > at exonhit.parsers.GenBankParser.main(GenBankParser.java:359) > Caused by: java.lang.NumberFormatException: null > at java.lang.Integer.parseInt(Integer.java:415) > at java.lang.Integer.parseInt(Integer.java:497) > at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:299) > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > ... 1 more > Java Result: -1 > ------------------------------------------------------- > Here, again, is the code that I'm using to to parse: > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > BufferedReader gbBR = null; > try { > gbBR = new BufferedReader(new > FileReader("C:\\Download\\ASN2BSML\\seth_06_02.emb")); > } catch (FileNotFoundException fnfe) { > fnfe.printStackTrace(); > System.exit(-1); > } > Namespace gbNspace = (Namespace) > RichObjectFactory.getObject(SimpleNamespace.class, new > Object[]{"gbSpace"} ); > RichSequenceIterator gbSeqs = > RichSequence.IOTools.readEMBLDNA(gbBR,gbNspace); > while (gbSeqs.hasNext()) { > try { > RichSequence rs = gbSeqs.nextRichSequence(); > NCBITaxon myTaxon = rs.getTaxon(); > }catch (BioException be){ > be.printStackTrace(); > System.exit(-1); > } > } > ~~~~~~~~~~~~~~~~~~~~~~~~~ > And here's the EMBL file that I'm trying to parse: > +++++++++++++++++++++++++ > ID DQ472184 standard; DNA; INV; 546 BP. > XX > AC DQ472184; > XX > SV DQ472184.1 > DT 15-MAY-2006 > XX > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > DE complete cds. > XX > KW . > XX > OS Trypanosoma cruzi strain CL Brener > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > OC Schizotrypanum. > XX > RN [1] > RP 1-546 > RA De Melo L.D.B.; > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > RL Unpublished. > XX > RN [2] > RP 1-546 > RA De Melo L.D.B.; > RT ; > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > RL 21949-900, Brazil > XX > FH Key Location/Qualifiers > FH > FT source 1..546 > FT /organism="Trypanosoma cruzi strain CL Brener" > FT /mol_type="genomic DNA" > FT /strain="CL Brener" > FT /db_xref="taxon:353153" > FT gene <1..>546 > FT /gene="ARC21" > FT /note="TcARC21" > FT mRNA <1..>546 > FT /gene="ARC21" > FT /product="actin-related protein 3" > FT CDS 1..546 > FT /gene="ARC21" > FT /note="actin-binding protein; ARPC3 21 kDa; putative > FT member of Arp2/3 complex" > FT /codon_start=1 > FT /product="actin-related protein 3" > FT /protein_id="ABF13401.1" > FT /db_xref="GI:93360014" > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > FT FPEKDGTGNKFWMAFAKRPFLASS" > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > agttag 546 > // > ID DQ472185 standard; DNA; INV; 543 BP. > XX > AC DQ472185; > XX > SV DQ472185.1 > DT 15-MAY-2006 > XX > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > DE complete cds. > XX > KW . > XX > OS Trypanosoma cruzi strain CL Brener > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > OC Schizotrypanum. > XX > RN [1] > RP 1-543 > RA De Melo L.D.B.; > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > RL Unpublished. > XX > RN [2] > RP 1-543 > RA De Melo L.D.B.; > RT ; > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > RL 21949-900, Brazil > XX > FH Key Location/Qualifiers > FH > FT source 1..543 > FT /organism="Trypanosoma cruzi strain CL Brener" > FT /mol_type="genomic DNA" > FT /strain="CL Brener" > FT /db_xref="taxon:353153" > FT gene <1..>543 > FT /gene="ARC20" > FT /note="TcARC20" > FT mRNA <1..>543 > FT /gene="ARC20" > FT /product="actin-related protein 4" > FT CDS 1..543 > FT /gene="ARC20" > FT /note="actin-binding protein; ARPC4 20 kDa; putative > FT member of Arp2/3 complex" > FT /codon_start=1 > FT /product="actin-related protein 4" > FT /protein_id="ABF13402.1" > FT /db_xref="GI:93360016" > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > FT MKLNVNQRARRAAMEFFLALNFT" > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > tga 543 > // > +++++++++++++++++++++++++++++++++ > > It looks to me like there's some kind of problem with parsing the > sequence version number. I even tried the sequence from test directory > (AY069118.em) with same outcome. > > Regards, > > Seth > > On 6/2/06, Richard Holland wrote: > > Hi Seth. > > > > Your second point, about the authors string not being read correctly in > > Genbank format, has been fixed (or should have been if I got the code > > right!). Could you check the latest version of biojava-live out of CVS > > and give it another go? Basically the parser did not recognise the > > CONSRTM tag, as it is not mentioned in the sample record provided by > > NCBI, which is what I based the parser on. > > > > I've set it up now so that it reads the CONSRTM tag, but the value is > > merged with the authors tag with (consortium) appended. There will still > > be problems if the consortium value has commas in it - not sure how to > > fix this yet. > > > > Your first point is harder to solve because you did not provide a > > complete stack trace for the exceptions you are getting. The complete > > stack trace would enable me to identify exactly where things are going > > wrong and give me a better chance of fixing them. Could you send the > > stack trace, and I'll see what I can do. > > > > cheers, > > Richard > > > > > > On Thu, 2006-06-01 at 18:03 -0400, Seth Johnson wrote: > > > Hi All, > > > > > > I'm a newbie to the whole BioJava(X) API and was hoping to get some > > > clarification on several issues that I'm having. > > > I am developing a parser that would take as input "NCBI Incremental > > > ASN.1 Sequence Updates to Genbank" files ( > > > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the > > > ASN2GB converter ( > > > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert > > > resulting sequences to a format parsable by BioJava(X) ( > > > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where > > > my problems start. > > > > > > ISSUE 1: > > > I've tried to parse all of the formats that ASN2GB outputs ( GenBank > > > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML), > > > tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank > > > format is recognized by the > > > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with > > > some exceptions that I'll describe in issue #2. This is the code that > > > I'm using to parse, for example, the EMBL output: > > > > > > BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb")); > > > Namespace gbNspace = (Namespace) > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > Object[]{"gbSpace"} ); > > > RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace); > > > while (gbSeqs.hasNext()) { > > > try { > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > // Further processing or RichSequence object from here > > > > > > } catch (BioException be){ > > > be.printStackTrace(); > > > } > > > } > > > > > > The multi-sequence EMBL file looks like this: > > > --------------------------------------------------------------------------------- > > > ID DQ472184 standard; DNA; INV; 546 BP. > > > XX > > > AC DQ472184; > > > XX > > > SV DQ472184.1 > > > DT 15-MAY-2006 > > > XX > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > > > DE complete cds. > > > XX > > > KW . > > > XX > > > OS Trypanosoma cruzi strain CL Brener > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > OC Schizotrypanum. > > > XX > > > RN [1] > > > RP 1-546 > > > RA De Melo L.D.B.; > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > RL Unpublished. > > > XX > > > RN [2] > > > RP 1-546 > > > RA De Melo L.D.B.; > > > RT ; > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > RL 21949-900, Brazil > > > XX > > > FH Key Location/Qualifiers > > > FH > > > FT source 1..546 > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > FT /mol_type="genomic DNA" > > > FT /strain="CL Brener" > > > FT /db_xref="taxon:353153" > > > FT gene <1..>546 > > > FT /gene="ARC21" > > > FT /note="TcARC21" > > > FT mRNA <1..>546 > > > FT /gene="ARC21" > > > FT /product="actin-related protein 3" > > > FT CDS 1..546 > > > FT /gene="ARC21" > > > FT /note="actin-binding protein; ARPC3 21 kDa; putative > > > FT member of Arp2/3 complex" > > > FT /codon_start=1 > > > FT /product="actin-related protein 3" > > > FT /protein_id="ABF13401.1" > > > FT /db_xref="GI:93360014" > > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > > > FT FPEKDGTGNKFWMAFAKRPFLASS" > > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > > > agttag 546 > > > // > > > ID DQ472185 standard; DNA; INV; 543 BP. > > > XX > > > AC DQ472185; > > > XX > > > SV DQ472185.1 > > > DT 15-MAY-2006 > > > XX > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > > > DE complete cds. > > > XX > > > KW . > > > XX > > > OS Trypanosoma cruzi strain CL Brener > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > OC Schizotrypanum. > > > XX > > > RN [1] > > > RP 1-543 > > > RA De Melo L.D.B.; > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > RL Unpublished. > > > XX > > > RN [2] > > > RP 1-543 > > > RA De Melo L.D.B.; > > > RT ; > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > RL 21949-900, Brazil > > > XX > > > FH Key Location/Qualifiers > > > FH > > > FT source 1..543 > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > FT /mol_type="genomic DNA" > > > FT /strain="CL Brener" > > > FT /db_xref="taxon:353153" > > > FT gene <1..>543 > > > FT /gene="ARC20" > > > FT /note="TcARC20" > > > FT mRNA <1..>543 > > > FT /gene="ARC20" > > > FT /product="actin-related protein 4" > > > FT CDS 1..543 > > > FT /gene="ARC20" > > > FT /note="actin-binding protein; ARPC4 20 kDa; putative > > > FT member of Arp2/3 complex" > > > FT /codon_start=1 > > > FT /product="actin-related protein 4" > > > FT /protein_id="ABF13402.1" > > > FT /db_xref="GI:93360016" > > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > > > FT MKLNVNQRARRAAMEFFLALNFT" > > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > > > tga 543 > > > // > > > ----------------------------------------------------------------------- > > > I get an exception message "Could Not Read Sequence". Same thing > > > happens if I use the readINSDSetDNA reader instead of readEMBLDNA one > > > with the following INSDset file (beginning of the file): > > > > > > > > > > > > > > > DQ022078 > > > 16729 > > > DNA > > > linear > > > ENV > > > 15-MAY-2006 > > > 15-MAY-2006 > > > Uncultured bacterium WWRS-2005 putative > > > aminoglycoside phosphotransferase (a3.001), putative oxidoreductase > > > (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase > > > class C (estA3), putative permease (a3.005), putative transmembrane > > > signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone > > > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative > > > asparaginase (a3.010), hypothetical protein (a3.011), hypothetical > > > protein (a3.012), putative membrane protease subunit (a3.013), > > > putative haloalkane dehalogenase (a3.014), putative transcriptional > > > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and > > > hypothetical protein (a3.017) genes, complete cds > > > DQ022078 > > > > > > gb|DQ022078.1| > > > gi|71842722 > > > > > > > > > ENV > > > > > > > > > > > > ? > > > 1..16729 > > > > > > Schmeisser,C. > > > Elend,C. > > > Streit,W.R. > > > > > > Isolation and biochemical characterization > > > of two novel metagenome derived esterases > > > Appl. Environ. Microbiol. 0:0-0 > > > (2006) > > > > > > > > > ? > > > 1..16729 > > > > > > Schmeisser,C. > > > Elend,C. > > > Streit,W.R. > > > > > > Submitted (29-APR-2005) to the > > > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University > > > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057, > > > Germany > > > > > > > > > > > > So my question is wether the ASN2GB produces output that's > > > incompatible with BioJava parsers or is there a problem with the > > > sequence themselves or the problems with the majority of parsers??? > > > Could it be that I'm using the API wrongly for the above formats, > > > although GenBank parser works as advertised with some exceptions > > > below: > > > > > > ISSUE #2: > > > When I try to parse GenBank files using the following code: > > > > > > BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb")); > > > Namespace gbNspace = (Namespace) > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > Object[]{"gbSpace"} ); > > > RichSequenceIterator gbSeqs = > > > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace); > > > while (gbSeqs.hasNext()) { > > > try { > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > // Further processing or RichSequence object from here > > > > > > } catch (BioException be){ > > > be.printStackTrace(); > > > } > > > } > > > > > > Genbank file in question: > > > > > > LOCUS BC074905 838 bp mRNA linear PRI 15-APR-2006 > > > DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038 > > > IMAGE:30915482), complete cds. > > > ACCESSION BC074905 > > > VERSION BC074905.2 GI:50959825 > > > KEYWORDS MGC. > > > SOURCE Homo sapiens (human) > > > ORGANISM Homo sapiens > > > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; > > > Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; > > > Catarrhini; Hominidae; Homo. > > > REFERENCE 1 (bases 1 to 838) > > > AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G., > > > Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D., > > > Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K., > > > Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F., > > > Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L., > > > Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L., > > > Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S., > > > Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J., > > > Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J., > > > McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S., > > > Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W., > > > Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A., > > > Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S., > > > Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y., > > > Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D., > > > Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M., > > > Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E., > > > Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A. > > > CONSRTM Mammalian Gene Collection Program Team > > > TITLE Generation and initial analysis of more than 15,000 full-length > > > human and mouse cDNA sequences > > > JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002) > > > PUBMED 12477932 > > > REFERENCE 2 (bases 1 to 838) > > > CONSRTM NIH MGC Project > > > TITLE Direct Submission > > > JOURNAL Submitted (25-JUN-2004) National Institutes of Health, Mammalian > > > Gene Collection (MGC), Bethesda, MD 20892-2590, USA > > > REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov > > > COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832. > > > Contact: MGC help desk > > > Email: cgapbs-r at mail.nih.gov > > > Tissue Procurement: Genome Sequence Centre, British Columbia Cancer > > > Center > > > cDNA Library Preparation: British Columbia Cancer Research Center > > > cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL) > > > DNA Sequencing by: Genome Sequence Centre, > > > BC Cancer Agency, Vancouver, BC, Canada > > > info at bcgsc.bc.ca > > > Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson > > > Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen > > > Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel > > > Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave > > > Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth > > > Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao, > > > Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR > > > Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang, > > > Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra. > > > > > > Clone distribution: MGC clone distribution information can be found > > > through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov > > > Series: IRBU Plate: 4 Row: C Column: 3. > > > > > > Differences found between this sequence and the human reference > > > genome (build 36) are described in misc_difference features below. > > > FEATURES Location/Qualifiers > > > source 1..838 > > > /organism="Homo sapiens" > > > /mol_type="mRNA" > > > /db_xref="taxon:9606" > > > /clone="MGC:104038 IMAGE:30915482" > > > /tissue_type="Lung, PCR rescued clones" > > > /clone_lib="NIH_MGC_273" > > > /lab_host="DH10B" > > > /note="Vector: pCR4 Topo TA with reversed insert" > > > gene 1..838 > > > /gene="KLK14" > > > /note="synonym: KLK-L6" > > > /db_xref="GeneID:43847" > > > /db_xref="HGNC:6362" > > > /db_xref="IMGT/GENE-DB:6362" > > > /db_xref="MIM:606135" > > > CDS 49..804 > > > /gene="KLK14" > > > /codon_start=1 > > > /product="KLK14 protein" > > > /protein_id="AAH74905.1" > > > /db_xref="GI:50959826" > > > /db_xref="GeneID:43847" > > > /db_xref="HGNC:6362" > > > /db_xref="IMGT/GENE-DB:6362" > > > /db_xref="MIM:606135" > > > /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA > > > GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN > > > YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA > > > SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV > > > SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK" > > > misc_difference 98 > > > /gene="KLK14" > > > /note="'G' in cDNA is 'A' in the human genome; amino acid > > > difference: 'R' in cDNA, 'Q' in the human genome." > > > misc_difference 133 > > > /gene="KLK14" > > > /note="'T' in cDNA is 'C' in the human genome; amino acid > > > difference: 'Y' in cDNA, 'H' in the human genome." > > > ORIGIN > > > 1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg > > > 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag > > > 121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg > > > 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact > > > 241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg > > > 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac > > > 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg > > > 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga > > > 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc > > > 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg > > > 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct > > > 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc > > > 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt > > > 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc > > > // > > > > > > I get the following exception: > > > > > > java.lang.IllegalArgumentException: Authors string cannot be null > > > org.biojava.bio.BioException: Could not read sequence > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107) > > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258) > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:341) > > > Caused by: java.lang.IllegalArgumentException: Authors string cannot be null > > > at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76) > > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356) > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > > > ----------------------------------------------------------------------- > > > > > > I'm trying to see what could be the problem with this particular > > > sequence. Looks to me like the AUTHORS portion is not getting parsed > > > correctly. Any ideas would be greatly appreciated! > > > > > -- > > Richard Holland (BioMart Team) > > EMBL-EBI > > Wellcome Trust Genome Campus > > Hinxton > > Cambridge CB10 1SD > > UNITED KINGDOM > > Tel: +44-(0)1223-494416 > > > > > > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From richard.holland at ebi.ac.uk Mon Jun 5 04:47:33 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Mon, 05 Jun 2006 09:47:33 +0100 Subject: [Biojava-l] viterbi training in biojava In-Reply-To: <1149065381.3948.2.camel@texas.ebi.ac.uk> References:

<1149065381.3948.2.camel@texas.ebi.ac.uk> Message-ID: <1149497253.3947.13.camel@texas.ebi.ac.uk> I just got a bounce response for this message. So I'm trying again in case you didn't get it the first time... cheers, Richard On Wed, 2006-05-31 at 09:49 +0100, Richard Holland wrote: > I've modified BaumWelchSampler in CVS so that it accepts alternative > score types as an additional parameter to singleSequenceIterator(). > > cheers, > Richard. > > > On Tue, 2006-05-30 at 16:43 +0100, wendy wong wrote: > > thanks! i only need one head so BaumWelchSampler works fine with me. > > The default SCORETYPE is probability and when I tried it the score > > goes back and forth, like + for one time and - for the next time. I > > then changed it to LOGODDS and recompiled biojava and now that the > > score is steadily increasing. I was wondering if the SCORETYPE could > > be passed in as an argument in the next version of biojava? > > > > thanks, > > wendy > > > > > > On 30 May 2006 12:19:15 +0100, David Huen wrote: > > > On May 30 2006, wendy wong wrote: > > > > > > >Hi, > > > > > > > >I was wondering if viterbi training is implemented in biojava, or if > > > >there's any open source version implemented using biojava? > > > > > > > There is one-head viterbi training already I think. The training framework > > > doesn't work for two-head - I wrote a viterbi training API that works for > > > two head but it is not fully compatible with the existing API so I never > > > put it into CVS, plus it doesn't have Baum-Welch implemented either. > > > > > > If it is any use to you you can have it. > > > > > > Regards, > > > David > > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From mark.schreiber at novartis.com Mon Jun 5 05:43:14 2006 From: mark.schreiber at novartis.com (mark.schreiber at novartis.com) Date: Mon, 5 Jun 2006 17:43:14 +0800 Subject: [Biojava-l] where is biojava used Message-ID: Hello - I have added a page to the biojava site that talks about the use of biojava in projects and publications. Please feel free to add your own URLS and citations. - Mark From johnson.biotech at gmail.com Mon Jun 5 10:37:31 2006 From: johnson.biotech at gmail.com (Seth Johnson) Date: Mon, 5 Jun 2006 10:37:31 -0400 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: <1149238900.3948.87.camel@texas.ebi.ac.uk> References: <1149238900.3948.87.camel@texas.ebi.ac.uk> Message-ID: Hell again Richard, No sooner I've said about the fix of the last parsing exception than another one came up with Genbank format: -------------------------------------- org.biojava.bio.seq.io.ParseException: DQ431065 org.biojava.bio.BioException: Could not read sequence at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) at exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151) at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246) at exonhit.parsers.GenBankParser.main(GenBankParser.java:326) Caused by: org.biojava.bio.seq.io.ParseException: DQ431065 at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) ... 3 more org.biojava.bio.seq.io.ParseException: org.biojava.bio.symbol.IllegalSymbolException: This tokenization doesn't contain character: 't' ---------------------------------------- The Genbank file that caused it is as follows: ========================================= LOCUS DQ431065 425 bp DNA linear INV 01-JUN-2006 DEFINITION Reticulitermes sp. ALS-2006c 16S ribosomal RNA gene, partial sequence; mitochondrial. ACCESSION DQ431065 VERSION DQ431065.1 GI:90102206 KEYWORDS . SOURCE Vaccinium corymbosum ORGANISM Vaccinium corymbosum Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; asterids; Ericales; Ericaceae; Vaccinioideae; Vaccinieae; Vaccinium. ? REFERENCE 2 (bases 1 to 425) AUTHORS Naik,L.D. and Rowland,L.J. TITLE Expressed Sequence Tags of cDNA clones from subtracted library of Vaccinium corymbosum JOURNAL Unpublished (2005) FEATURES Location/Qualifiers source 1..425 /organism="Vaccinium corymbosum" /mol_type="genomic DNA" /cultivar="Bluecrop" /db_xref="taxon:69266" /tissue_type="Flower buds" /clone_lib="Subtracted cDNA library of Vaccinium corymbosum" /dev_stage="399 hour chill unit exposure" /note="Vector: pCR4TOPO; Site_1: Eco R I; Site_2: Eco R I" rRNA <1..>425 /product="16S ribosomal RNA" ORIGIN 1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga agtatggcct gcccgctgac 61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg tagcatagtc attagttctt 121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc tgtcttaatt ttgaattgtt 181 tattgaattt ggtctttgag ttaaaattct tagatgtttt tatgggacga gaagacccta 241 tagagtttaa catttattat ggtccttttc tgtttgtgag ggctcactgg gccgtctaat 301 atgttttgtt ggggtgatgg gagggaataa tttaacccct cctttttatt attatattta 361 tttatattta tttgatccat ttattttgat tgtaagatta aattacctta gggataacag 421 cgtaa // ================================== I think it's the presence of the '?' at the beginning of the line?!?! I'm not sure wether the information that was supposed to be present instead of those question marks is absent from the original ASN.1 batch file or it's a bug in the NCBI ASN2GO software. It looks to me that the former is the case since the file from NCBI website contains much more information than the batch file. Just bringing this to everyone's attention. -- Best Regards, Seth Johnson Senior Bioinformatics Associate Ph: (202) 470-0900 Fx: (775) 251-0358 On 6/2/06, Richard Holland wrote: > Hi Seth. > > Your second point, about the authors string not being read correctly in > Genbank format, has been fixed (or should have been if I got the code > right!). Could you check the latest version of biojava-live out of CVS > and give it another go? Basically the parser did not recognise the > CONSRTM tag, as it is not mentioned in the sample record provided by > NCBI, which is what I based the parser on. ... > > cheers, > Richard > > From richard.holland at ebi.ac.uk Mon Jun 5 11:11:07 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Mon, 05 Jun 2006 16:11:07 +0100 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: References: <1149238900.3948.87.camel@texas.ebi.ac.uk> Message-ID: <1149520267.3947.36.camel@texas.ebi.ac.uk> Hi again. Could you remove the offending question mark from the GenBank file and try it again to see if that fixes it? The parser should just ignore it but apparently not. The error looks weird to me because the tokenization for a DNA GenBank file _does_ contain the letter 't'! Not sure what's going on here. With regard to your INSDseqXML problems, the stacktrace pointed to a bug in SimpleRichSequenceBuilder that would actually cause these problems for any file containing a no qualifier value for a feature, regardless of format. I think I have fixed this now. Could you test it? (It's in CVS already). cheers, Richard On Mon, 2006-06-05 at 10:37 -0400, Seth Johnson wrote: > Hell again Richard, > > No sooner I've said about the fix of the last parsing exception than > another one came up with Genbank format: > -------------------------------------- > org.biojava.bio.seq.io.ParseException: DQ431065 > org.biojava.bio.BioException: Could not read sequence > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > at exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151) > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246) > at exonhit.parsers.GenBankParser.main(GenBankParser.java:326) > Caused by: org.biojava.bio.seq.io.ParseException: DQ431065 > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > ... 3 more > org.biojava.bio.seq.io.ParseException: > org.biojava.bio.symbol.IllegalSymbolException: This tokenization > doesn't contain character: 't' > ---------------------------------------- > The Genbank file that caused it is as follows: > ========================================= > LOCUS DQ431065 425 bp DNA linear INV 01-JUN-2006 > DEFINITION Reticulitermes sp. ALS-2006c 16S ribosomal RNA gene, partial > sequence; mitochondrial. > ACCESSION DQ431065 > VERSION DQ431065.1 GI:90102206 > KEYWORDS . > SOURCE Vaccinium corymbosum > ORGANISM Vaccinium corymbosum > Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; > Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; > asterids; Ericales; Ericaceae; Vaccinioideae; Vaccinieae; > Vaccinium. > ? > REFERENCE 2 (bases 1 to 425) > AUTHORS Naik,L.D. and Rowland,L.J. > TITLE Expressed Sequence Tags of cDNA clones from subtracted library of > Vaccinium corymbosum > JOURNAL Unpublished (2005) > FEATURES Location/Qualifiers > source 1..425 > /organism="Vaccinium corymbosum" > /mol_type="genomic DNA" > /cultivar="Bluecrop" > /db_xref="taxon:69266" > /tissue_type="Flower buds" > /clone_lib="Subtracted cDNA library of Vaccinium > corymbosum" > /dev_stage="399 hour chill unit exposure" > /note="Vector: pCR4TOPO; Site_1: Eco R I; Site_2: Eco R I" > rRNA <1..>425 > /product="16S ribosomal RNA" > ORIGIN > 1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga agtatggcct gcccgctgac > 61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg tagcatagtc attagttctt > 121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc tgtcttaatt ttgaattgtt > 181 tattgaattt ggtctttgag ttaaaattct tagatgtttt tatgggacga gaagacccta > 241 tagagtttaa catttattat ggtccttttc tgtttgtgag ggctcactgg gccgtctaat > 301 atgttttgtt ggggtgatgg gagggaataa tttaacccct cctttttatt attatattta > 361 tttatattta tttgatccat ttattttgat tgtaagatta aattacctta gggataacag > 421 cgtaa > // > ================================== > I think it's the presence of the '?' at the beginning of the line?!?! > I'm not sure wether the information that was supposed to be present > instead of those question marks is absent from the original ASN.1 > batch file or it's a bug in the NCBI ASN2GO software. It looks to me > that the former is the case since the file from NCBI website contains > much more information than the batch file. Just bringing this to > everyone's attention. > > > -- > Best Regards, > > > Seth Johnson > Senior Bioinformatics Associate > > Ph: (202) 470-0900 > Fx: (775) 251-0358 > > On 6/2/06, Richard Holland wrote: > > Hi Seth. > > > > Your second point, about the authors string not being read correctly in > > Genbank format, has been fixed (or should have been if I got the code > > right!). Could you check the latest version of biojava-live out of CVS > > and give it another go? Basically the parser did not recognise the > > CONSRTM tag, as it is not mentioned in the sample record provided by > > NCBI, which is what I based the parser on. > ... > > > > cheers, > > Richard > > > > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From richard.holland at ebi.ac.uk Mon Jun 5 11:16:37 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Mon, 05 Jun 2006 16:16:37 +0100 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: References: <1149238900.3948.87.camel@texas.ebi.ac.uk> <1149497066.3947.12.camel@texas.ebi.ac.uk> Message-ID: <1149520598.3947.38.camel@texas.ebi.ac.uk> Doh! I am in desparate need of coffee methinks... that's the second error in EMBLFormat directly related to me being stupid when I cut-and-pasted the stuff for the new 87+ ID line format... Should be fixed now in CVS (as of about 30 seconds ago). cheers, Richard On Mon, 2006-06-05 at 11:05 -0400, Seth Johnson wrote: > Hi Richard, > > I go another exception on EMBL format: > ============================= > org.biojava.bio.BioException: Could not read sequence > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > at exonhit.parsers.GenBankParser.main(GenBankParser.java:347) > Caused by: java.lang.IllegalStateException: No match found > at java.util.regex.Matcher.group(Matcher.java:461) > at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:311) > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > ... 1 more > Java Result: -1 > ============================= > I used the same file from test directory:(AY069118.em) > > > Seth > > On 6/5/06, Richard Holland wrote: > > This one should be fixed in CVS now. Typo on my behalf - I put in code > > to make it work with both 87+ and pre-87 version of EMBL, then got the > > regexes the wrong way round!! > > > ... > > > > cheers, > > Richard > > > > > > On Fri, 2006-06-02 at 13:04 -0400, Seth Johnson wrote: > > > Hi Richard, > > > > > > I made sure I have the latest source code from CVS compiled > > > (EMBLFormat.java & GenbankFormat.java are from 05/24/06). I'm happy > > > to report that GenBank issue is solved!!!! > > > As far as EMBL parsing, I apologize for not providing the stack dump > > > for ISSUE #1. Here's the dump of the exception: > > > -------------------------------------------------------- > > > org.biojava.bio.BioException: Could not read sequence > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:359) > > > Caused by: java.lang.NumberFormatException: null > > > at java.lang.Integer.parseInt(Integer.java:415) > > > at java.lang.Integer.parseInt(Integer.java:497) > > > at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:299) > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > ... 1 more > > > Java Result: -1 > > > ------------------------------------------------------- > > > Here, again, is the code that I'm using to to parse: > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > BufferedReader gbBR = null; > > > try { > > > gbBR = new BufferedReader(new > > > FileReader("C:\\Download\\ASN2BSML\\seth_06_02.emb")); > > > } catch (FileNotFoundException fnfe) { > > > fnfe.printStackTrace(); > > > System.exit(-1); > > > } > > > Namespace gbNspace = (Namespace) > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > Object[]{"gbSpace"} ); > > > RichSequenceIterator gbSeqs = > > > RichSequence.IOTools.readEMBLDNA(gbBR,gbNspace); > > > while (gbSeqs.hasNext()) { > > > try { > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > NCBITaxon myTaxon = rs.getTaxon(); > > > }catch (BioException be){ > > > be.printStackTrace(); > > > System.exit(-1); > > > } > > > } > > > ~~~~~~~~~~~~~~~~~~~~~~~~~ > > > And here's the EMBL file that I'm trying to parse: > > > +++++++++++++++++++++++++ > > > ID DQ472184 standard; DNA; INV; 546 BP. > > > XX > > > AC DQ472184; > > > XX > > > SV DQ472184.1 > > > DT 15-MAY-2006 > > > XX > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > > > DE complete cds. > > > XX > > > KW . > > > XX > > > OS Trypanosoma cruzi strain CL Brener > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > OC Schizotrypanum. > > > XX > > > RN [1] > > > RP 1-546 > > > RA De Melo L.D.B.; > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > RL Unpublished. > > > XX > > > RN [2] > > > RP 1-546 > > > RA De Melo L.D.B.; > > > RT ; > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > RL 21949-900, Brazil > > > XX > > > FH Key Location/Qualifiers > > > FH > > > FT source 1..546 > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > FT /mol_type="genomic DNA" > > > FT /strain="CL Brener" > > > FT /db_xref="taxon:353153" > > > FT gene <1..>546 > > > FT /gene="ARC21" > > > FT /note="TcARC21" > > > FT mRNA <1..>546 > > > FT /gene="ARC21" > > > FT /product="actin-related protein 3" > > > FT CDS 1..546 > > > FT /gene="ARC21" > > > FT /note="actin-binding protein; ARPC3 21 kDa; putative > > > FT member of Arp2/3 complex" > > > FT /codon_start=1 > > > FT /product="actin-related protein 3" > > > FT /protein_id="ABF13401.1" > > > FT /db_xref="GI:93360014" > > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > > > FT FPEKDGTGNKFWMAFAKRPFLASS" > > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > > > agttag 546 > > > // > > > ID DQ472185 standard; DNA; INV; 543 BP. > > > XX > > > AC DQ472185; > > > XX > > > SV DQ472185.1 > > > DT 15-MAY-2006 > > > XX > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > > > DE complete cds. > > > XX > > > KW . > > > XX > > > OS Trypanosoma cruzi strain CL Brener > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > OC Schizotrypanum. > > > XX > > > RN [1] > > > RP 1-543 > > > RA De Melo L.D.B.; > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > RL Unpublished. > > > XX > > > RN [2] > > > RP 1-543 > > > RA De Melo L.D.B.; > > > RT ; > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > RL 21949-900, Brazil > > > XX > > > FH Key Location/Qualifiers > > > FH > > > FT source 1..543 > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > FT /mol_type="genomic DNA" > > > FT /strain="CL Brener" > > > FT /db_xref="taxon:353153" > > > FT gene <1..>543 > > > FT /gene="ARC20" > > > FT /note="TcARC20" > > > FT mRNA <1..>543 > > > FT /gene="ARC20" > > > FT /product="actin-related protein 4" > > > FT CDS 1..543 > > > FT /gene="ARC20" > > > FT /note="actin-binding protein; ARPC4 20 kDa; putative > > > FT member of Arp2/3 complex" > > > FT /codon_start=1 > > > FT /product="actin-related protein 4" > > > FT /protein_id="ABF13402.1" > > > FT /db_xref="GI:93360016" > > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > > > FT MKLNVNQRARRAAMEFFLALNFT" > > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > > > tga 543 > > > // > > > +++++++++++++++++++++++++++++++++ > > > > > > It looks to me like there's some kind of problem with parsing the > > > sequence version number. I even tried the sequence from test directory > > > (AY069118.em) with same outcome. > > > > > > Regards, > > > > > > Seth > > > > > > On 6/2/06, Richard Holland wrote: > > > > Hi Seth. > > > > > > > > Your second point, about the authors string not being read correctly in > > > > Genbank format, has been fixed (or should have been if I got the code > > > > right!). Could you check the latest version of biojava-live out of CVS > > > > and give it another go? Basically the parser did not recognise the > > > > CONSRTM tag, as it is not mentioned in the sample record provided by > > > > NCBI, which is what I based the parser on. > > > > > > > > I've set it up now so that it reads the CONSRTM tag, but the value is > > > > merged with the authors tag with (consortium) appended. There will still > > > > be problems if the consortium value has commas in it - not sure how to > > > > fix this yet. > > > > > > > > Your first point is harder to solve because you did not provide a > > > > complete stack trace for the exceptions you are getting. The complete > > > > stack trace would enable me to identify exactly where things are going > > > > wrong and give me a better chance of fixing them. Could you send the > > > > stack trace, and I'll see what I can do. > > > > > > > > cheers, > > > > Richard > > > > > > > > > > > > On Thu, 2006-06-01 at 18:03 -0400, Seth Johnson wrote: > > > > > Hi All, > > > > > > > > > > I'm a newbie to the whole BioJava(X) API and was hoping to get some > > > > > clarification on several issues that I'm having. > > > > > I am developing a parser that would take as input "NCBI Incremental > > > > > ASN.1 Sequence Updates to Genbank" files ( > > > > > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the > > > > > ASN2GB converter ( > > > > > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert > > > > > resulting sequences to a format parsable by BioJava(X) ( > > > > > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where > > > > > my problems start. > > > > > > > > > > ISSUE 1: > > > > > I've tried to parse all of the formats that ASN2GB outputs ( GenBank > > > > > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML), > > > > > tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank > > > > > format is recognized by the > > > > > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with > > > > > some exceptions that I'll describe in issue #2. This is the code that > > > > > I'm using to parse, for example, the EMBL output: > > > > > > > > > > BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb")); > > > > > Namespace gbNspace = (Namespace) > > > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > > > Object[]{"gbSpace"} ); > > > > > RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace); > > > > > while (gbSeqs.hasNext()) { > > > > > try { > > > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > > > // Further processing or RichSequence object from here > > > > > > > > > > } catch (BioException be){ > > > > > be.printStackTrace(); > > > > > } > > > > > } > > > > > > > > > > The multi-sequence EMBL file looks like this: > > > > > --------------------------------------------------------------------------------- > > > > > ID DQ472184 standard; DNA; INV; 546 BP. > > > > > XX > > > > > AC DQ472184; > > > > > XX > > > > > SV DQ472184.1 > > > > > DT 15-MAY-2006 > > > > > XX > > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > > > > > DE complete cds. > > > > > XX > > > > > KW . > > > > > XX > > > > > OS Trypanosoma cruzi strain CL Brener > > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > > OC Schizotrypanum. > > > > > XX > > > > > RN [1] > > > > > RP 1-546 > > > > > RA De Melo L.D.B.; > > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > > RL Unpublished. > > > > > XX > > > > > RN [2] > > > > > RP 1-546 > > > > > RA De Melo L.D.B.; > > > > > RT ; > > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > > RL 21949-900, Brazil > > > > > XX > > > > > FH Key Location/Qualifiers > > > > > FH > > > > > FT source 1..546 > > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > > FT /mol_type="genomic DNA" > > > > > FT /strain="CL Brener" > > > > > FT /db_xref="taxon:353153" > > > > > FT gene <1..>546 > > > > > FT /gene="ARC21" > > > > > FT /note="TcARC21" > > > > > FT mRNA <1..>546 > > > > > FT /gene="ARC21" > > > > > FT /product="actin-related protein 3" > > > > > FT CDS 1..546 > > > > > FT /gene="ARC21" > > > > > FT /note="actin-binding protein; ARPC3 21 kDa; putative > > > > > FT member of Arp2/3 complex" > > > > > FT /codon_start=1 > > > > > FT /product="actin-related protein 3" > > > > > FT /protein_id="ABF13401.1" > > > > > FT /db_xref="GI:93360014" > > > > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > > > > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > > > > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > > > > > FT FPEKDGTGNKFWMAFAKRPFLASS" > > > > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > > > > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > > > > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > > > > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > > > > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > > > > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > > > > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > > > > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > > > > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > > > > > agttag 546 > > > > > // > > > > > ID DQ472185 standard; DNA; INV; 543 BP. > > > > > XX > > > > > AC DQ472185; > > > > > XX > > > > > SV DQ472185.1 > > > > > DT 15-MAY-2006 > > > > > XX > > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > > > > > DE complete cds. > > > > > XX > > > > > KW . > > > > > XX > > > > > OS Trypanosoma cruzi strain CL Brener > > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > > OC Schizotrypanum. > > > > > XX > > > > > RN [1] > > > > > RP 1-543 > > > > > RA De Melo L.D.B.; > > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > > RL Unpublished. > > > > > XX > > > > > RN [2] > > > > > RP 1-543 > > > > > RA De Melo L.D.B.; > > > > > RT ; > > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > > RL 21949-900, Brazil > > > > > XX > > > > > FH Key Location/Qualifiers > > > > > FH > > > > > FT source 1..543 > > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > > FT /mol_type="genomic DNA" > > > > > FT /strain="CL Brener" > > > > > FT /db_xref="taxon:353153" > > > > > FT gene <1..>543 > > > > > FT /gene="ARC20" > > > > > FT /note="TcARC20" > > > > > FT mRNA <1..>543 > > > > > FT /gene="ARC20" > > > > > FT /product="actin-related protein 4" > > > > > FT CDS 1..543 > > > > > FT /gene="ARC20" > > > > > FT /note="actin-binding protein; ARPC4 20 kDa; putative > > > > > FT member of Arp2/3 complex" > > > > > FT /codon_start=1 > > > > > FT /product="actin-related protein 4" > > > > > FT /protein_id="ABF13402.1" > > > > > FT /db_xref="GI:93360016" > > > > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > > > > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > > > > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > > > > > FT MKLNVNQRARRAAMEFFLALNFT" > > > > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > > > > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > > > > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > > > > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > > > > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > > > > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > > > > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > > > > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > > > > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > > > > > tga 543 > > > > > // > > > > > ----------------------------------------------------------------------- > > > > > I get an exception message "Could Not Read Sequence". Same thing > > > > > happens if I use the readINSDSetDNA reader instead of readEMBLDNA one > > > > > with the following INSDset file (beginning of the file): > > > > > > > > > > > > > > > > > > > > > > > > > DQ022078 > > > > > 16729 > > > > > DNA > > > > > linear > > > > > ENV > > > > > 15-MAY-2006 > > > > > 15-MAY-2006 > > > > > Uncultured bacterium WWRS-2005 putative > > > > > aminoglycoside phosphotransferase (a3.001), putative oxidoreductase > > > > > (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase > > > > > class C (estA3), putative permease (a3.005), putative transmembrane > > > > > signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone > > > > > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative > > > > > asparaginase (a3.010), hypothetical protein (a3.011), hypothetical > > > > > protein (a3.012), putative membrane protease subunit (a3.013), > > > > > putative haloalkane dehalogenase (a3.014), putative transcriptional > > > > > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and > > > > > hypothetical protein (a3.017) genes, complete cds > > > > > DQ022078 > > > > > > > > > > gb|DQ022078.1| > > > > > gi|71842722 > > > > > > > > > > > > > > > ENV > > > > > > > > > > > > > > > > > > > > ? > > > > > 1..16729 > > > > > > > > > > Schmeisser,C. > > > > > Elend,C. > > > > > Streit,W.R. > > > > > > > > > > Isolation and biochemical characterization > > > > > of two novel metagenome derived esterases > > > > > Appl. Environ. Microbiol. 0:0-0 > > > > > (2006) > > > > > > > > > > > > > > > ? > > > > > 1..16729 > > > > > > > > > > Schmeisser,C. > > > > > Elend,C. > > > > > Streit,W.R. > > > > > > > > > > Submitted (29-APR-2005) to the > > > > > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University > > > > > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057, > > > > > Germany > > > > > > > > > > > > > > > > > > > > So my question is wether the ASN2GB produces output that's > > > > > incompatible with BioJava parsers or is there a problem with the > > > > > sequence themselves or the problems with the majority of parsers??? > > > > > Could it be that I'm using the API wrongly for the above formats, > > > > > although GenBank parser works as advertised with some exceptions > > > > > below: > > > > > > > > > > ISSUE #2: > > > > > When I try to parse GenBank files using the following code: > > > > > > > > > > BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb")); > > > > > Namespace gbNspace = (Namespace) > > > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > > > Object[]{"gbSpace"} ); > > > > > RichSequenceIterator gbSeqs = > > > > > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace); > > > > > while (gbSeqs.hasNext()) { > > > > > try { > > > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > > > // Further processing or RichSequence object from here > > > > > > > > > > } catch (BioException be){ > > > > > be.printStackTrace(); > > > > > } > > > > > } > > > > > > > > > > Genbank file in question: > > > > > > > > > > LOCUS BC074905 838 bp mRNA linear PRI 15-APR-2006 > > > > > DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038 > > > > > IMAGE:30915482), complete cds. > > > > > ACCESSION BC074905 > > > > > VERSION BC074905.2 GI:50959825 > > > > > KEYWORDS MGC. > > > > > SOURCE Homo sapiens (human) > > > > > ORGANISM Homo sapiens > > > > > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; > > > > > Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; > > > > > Catarrhini; Hominidae; Homo. > > > > > REFERENCE 1 (bases 1 to 838) > > > > > AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G., > > > > > Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D., > > > > > Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K., > > > > > Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F., > > > > > Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L., > > > > > Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L., > > > > > Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S., > > > > > Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J., > > > > > Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J., > > > > > McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S., > > > > > Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W., > > > > > Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A., > > > > > Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S., > > > > > Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y., > > > > > Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D., > > > > > Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M., > > > > > Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E., > > > > > Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A. > > > > > CONSRTM Mammalian Gene Collection Program Team > > > > > TITLE Generation and initial analysis of more than 15,000 full-length > > > > > human and mouse cDNA sequences > > > > > JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002) > > > > > PUBMED 12477932 > > > > > REFERENCE 2 (bases 1 to 838) > > > > > CONSRTM NIH MGC Project > > > > > TITLE Direct Submission > > > > > JOURNAL Submitted (25-JUN-2004) National Institutes of Health, Mammalian > > > > > Gene Collection (MGC), Bethesda, MD 20892-2590, USA > > > > > REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov > > > > > COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832. > > > > > Contact: MGC help desk > > > > > Email: cgapbs-r at mail.nih.gov > > > > > Tissue Procurement: Genome Sequence Centre, British Columbia Cancer > > > > > Center > > > > > cDNA Library Preparation: British Columbia Cancer Research Center > > > > > cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL) > > > > > DNA Sequencing by: Genome Sequence Centre, > > > > > BC Cancer Agency, Vancouver, BC, Canada > > > > > info at bcgsc.bc.ca > > > > > Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson > > > > > Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen > > > > > Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel > > > > > Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave > > > > > Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth > > > > > Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao, > > > > > Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR > > > > > Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang, > > > > > Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra. > > > > > > > > > > Clone distribution: MGC clone distribution information can be found > > > > > through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov > > > > > Series: IRBU Plate: 4 Row: C Column: 3. > > > > > > > > > > Differences found between this sequence and the human reference > > > > > genome (build 36) are described in misc_difference features below. > > > > > FEATURES Location/Qualifiers > > > > > source 1..838 > > > > > /organism="Homo sapiens" > > > > > /mol_type="mRNA" > > > > > /db_xref="taxon:9606" > > > > > /clone="MGC:104038 IMAGE:30915482" > > > > > /tissue_type="Lung, PCR rescued clones" > > > > > /clone_lib="NIH_MGC_273" > > > > > /lab_host="DH10B" > > > > > /note="Vector: pCR4 Topo TA with reversed insert" > > > > > gene 1..838 > > > > > /gene="KLK14" > > > > > /note="synonym: KLK-L6" > > > > > /db_xref="GeneID:43847" > > > > > /db_xref="HGNC:6362" > > > > > /db_xref="IMGT/GENE-DB:6362" > > > > > /db_xref="MIM:606135" > > > > > CDS 49..804 > > > > > /gene="KLK14" > > > > > /codon_start=1 > > > > > /product="KLK14 protein" > > > > > /protein_id="AAH74905.1" > > > > > /db_xref="GI:50959826" > > > > > /db_xref="GeneID:43847" > > > > > /db_xref="HGNC:6362" > > > > > /db_xref="IMGT/GENE-DB:6362" > > > > > /db_xref="MIM:606135" > > > > > /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA > > > > > GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN > > > > > YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA > > > > > SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV > > > > > SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK" > > > > > misc_difference 98 > > > > > /gene="KLK14" > > > > > /note="'G' in cDNA is 'A' in the human genome; amino acid > > > > > difference: 'R' in cDNA, 'Q' in the human genome." > > > > > misc_difference 133 > > > > > /gene="KLK14" > > > > > /note="'T' in cDNA is 'C' in the human genome; amino acid > > > > > difference: 'Y' in cDNA, 'H' in the human genome." > > > > > ORIGIN > > > > > 1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg > > > > > 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag > > > > > 121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg > > > > > 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact > > > > > 241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg > > > > > 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac > > > > > 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg > > > > > 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga > > > > > 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc > > > > > 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg > > > > > 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct > > > > > 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc > > > > > 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt > > > > > 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc > > > > > // > > > > > > > > > > I get the following exception: > > > > > > > > > > java.lang.IllegalArgumentException: Authors string cannot be null > > > > > org.biojava.bio.BioException: Could not read sequence > > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > > at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107) > > > > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258) > > > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:341) > > > > > Caused by: java.lang.IllegalArgumentException: Authors string cannot be null > > > > > at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76) > > > > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356) > > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > > > > > > > ----------------------------------------------------------------------- > > > > > > > > > > I'm trying to see what could be the problem with this particular > > > > > sequence. Looks to me like the AUTHORS portion is not getting parsed > > > > > correctly. Any ideas would be greatly appreciated! > > > > > > > > > -- > > > > Richard Holland (BioMart Team) > > > > EMBL-EBI > > > > Wellcome Trust Genome Campus > > > > Hinxton > > > > Cambridge CB10 1SD > > > > UNITED KINGDOM > > > > Tel: +44-(0)1223-494416 > > > > > > > > > > > > > > > > -- > > Richard Holland (BioMart Team) > > EMBL-EBI > > Wellcome Trust Genome Campus > > Hinxton > > Cambridge CB10 1SD > > UNITED KINGDOM > > Tel: +44-(0)1223-494416 > > > > > > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From johnson.biotech at gmail.com Mon Jun 5 11:05:21 2006 From: johnson.biotech at gmail.com (Seth Johnson) Date: Mon, 5 Jun 2006 11:05:21 -0400 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: <1149497066.3947.12.camel@texas.ebi.ac.uk> References: <1149238900.3948.87.camel@texas.ebi.ac.uk> <1149497066.3947.12.camel@texas.ebi.ac.uk> Message-ID: Hi Richard, I go another exception on EMBL format: ============================= org.biojava.bio.BioException: Could not read sequence at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) at exonhit.parsers.GenBankParser.main(GenBankParser.java:347) Caused by: java.lang.IllegalStateException: No match found at java.util.regex.Matcher.group(Matcher.java:461) at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:311) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) ... 1 more Java Result: -1 ============================= I used the same file from test directory:(AY069118.em) Seth On 6/5/06, Richard Holland wrote: > This one should be fixed in CVS now. Typo on my behalf - I put in code > to make it work with both 87+ and pre-87 version of EMBL, then got the > regexes the wrong way round!! > ... > > cheers, > Richard > > > On Fri, 2006-06-02 at 13:04 -0400, Seth Johnson wrote: > > Hi Richard, > > > > I made sure I have the latest source code from CVS compiled > > (EMBLFormat.java & GenbankFormat.java are from 05/24/06). I'm happy > > to report that GenBank issue is solved!!!! > > As far as EMBL parsing, I apologize for not providing the stack dump > > for ISSUE #1. Here's the dump of the exception: > > -------------------------------------------------------- > > org.biojava.bio.BioException: Could not read sequence > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:359) > > Caused by: java.lang.NumberFormatException: null > > at java.lang.Integer.parseInt(Integer.java:415) > > at java.lang.Integer.parseInt(Integer.java:497) > > at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:299) > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > ... 1 more > > Java Result: -1 > > ------------------------------------------------------- > > Here, again, is the code that I'm using to to parse: > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > BufferedReader gbBR = null; > > try { > > gbBR = new BufferedReader(new > > FileReader("C:\\Download\\ASN2BSML\\seth_06_02.emb")); > > } catch (FileNotFoundException fnfe) { > > fnfe.printStackTrace(); > > System.exit(-1); > > } > > Namespace gbNspace = (Namespace) > > RichObjectFactory.getObject(SimpleNamespace.class, new > > Object[]{"gbSpace"} ); > > RichSequenceIterator gbSeqs = > > RichSequence.IOTools.readEMBLDNA(gbBR,gbNspace); > > while (gbSeqs.hasNext()) { > > try { > > RichSequence rs = gbSeqs.nextRichSequence(); > > NCBITaxon myTaxon = rs.getTaxon(); > > }catch (BioException be){ > > be.printStackTrace(); > > System.exit(-1); > > } > > } > > ~~~~~~~~~~~~~~~~~~~~~~~~~ > > And here's the EMBL file that I'm trying to parse: > > +++++++++++++++++++++++++ > > ID DQ472184 standard; DNA; INV; 546 BP. > > XX > > AC DQ472184; > > XX > > SV DQ472184.1 > > DT 15-MAY-2006 > > XX > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > > DE complete cds. > > XX > > KW . > > XX > > OS Trypanosoma cruzi strain CL Brener > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > OC Schizotrypanum. > > XX > > RN [1] > > RP 1-546 > > RA De Melo L.D.B.; > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > RL Unpublished. > > XX > > RN [2] > > RP 1-546 > > RA De Melo L.D.B.; > > RT ; > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > RL 21949-900, Brazil > > XX > > FH Key Location/Qualifiers > > FH > > FT source 1..546 > > FT /organism="Trypanosoma cruzi strain CL Brener" > > FT /mol_type="genomic DNA" > > FT /strain="CL Brener" > > FT /db_xref="taxon:353153" > > FT gene <1..>546 > > FT /gene="ARC21" > > FT /note="TcARC21" > > FT mRNA <1..>546 > > FT /gene="ARC21" > > FT /product="actin-related protein 3" > > FT CDS 1..546 > > FT /gene="ARC21" > > FT /note="actin-binding protein; ARPC3 21 kDa; putative > > FT member of Arp2/3 complex" > > FT /codon_start=1 > > FT /product="actin-related protein 3" > > FT /protein_id="ABF13401.1" > > FT /db_xref="GI:93360014" > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > > FT FPEKDGTGNKFWMAFAKRPFLASS" > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > > agttag 546 > > // > > ID DQ472185 standard; DNA; INV; 543 BP. > > XX > > AC DQ472185; > > XX > > SV DQ472185.1 > > DT 15-MAY-2006 > > XX > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > > DE complete cds. > > XX > > KW . > > XX > > OS Trypanosoma cruzi strain CL Brener > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > OC Schizotrypanum. > > XX > > RN [1] > > RP 1-543 > > RA De Melo L.D.B.; > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > RL Unpublished. > > XX > > RN [2] > > RP 1-543 > > RA De Melo L.D.B.; > > RT ; > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > RL 21949-900, Brazil > > XX > > FH Key Location/Qualifiers > > FH > > FT source 1..543 > > FT /organism="Trypanosoma cruzi strain CL Brener" > > FT /mol_type="genomic DNA" > > FT /strain="CL Brener" > > FT /db_xref="taxon:353153" > > FT gene <1..>543 > > FT /gene="ARC20" > > FT /note="TcARC20" > > FT mRNA <1..>543 > > FT /gene="ARC20" > > FT /product="actin-related protein 4" > > FT CDS 1..543 > > FT /gene="ARC20" > > FT /note="actin-binding protein; ARPC4 20 kDa; putative > > FT member of Arp2/3 complex" > > FT /codon_start=1 > > FT /product="actin-related protein 4" > > FT /protein_id="ABF13402.1" > > FT /db_xref="GI:93360016" > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > > FT MKLNVNQRARRAAMEFFLALNFT" > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > > tga 543 > > // > > +++++++++++++++++++++++++++++++++ > > > > It looks to me like there's some kind of problem with parsing the > > sequence version number. I even tried the sequence from test directory > > (AY069118.em) with same outcome. > > > > Regards, > > > > Seth > > > > On 6/2/06, Richard Holland wrote: > > > Hi Seth. > > > > > > Your second point, about the authors string not being read correctly in > > > Genbank format, has been fixed (or should have been if I got the code > > > right!). Could you check the latest version of biojava-live out of CVS > > > and give it another go? Basically the parser did not recognise the > > > CONSRTM tag, as it is not mentioned in the sample record provided by > > > NCBI, which is what I based the parser on. > > > > > > I've set it up now so that it reads the CONSRTM tag, but the value is > > > merged with the authors tag with (consortium) appended. There will still > > > be problems if the consortium value has commas in it - not sure how to > > > fix this yet. > > > > > > Your first point is harder to solve because you did not provide a > > > complete stack trace for the exceptions you are getting. The complete > > > stack trace would enable me to identify exactly where things are going > > > wrong and give me a better chance of fixing them. Could you send the > > > stack trace, and I'll see what I can do. > > > > > > cheers, > > > Richard > > > > > > > > > On Thu, 2006-06-01 at 18:03 -0400, Seth Johnson wrote: > > > > Hi All, > > > > > > > > I'm a newbie to the whole BioJava(X) API and was hoping to get some > > > > clarification on several issues that I'm having. > > > > I am developing a parser that would take as input "NCBI Incremental > > > > ASN.1 Sequence Updates to Genbank" files ( > > > > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the > > > > ASN2GB converter ( > > > > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert > > > > resulting sequences to a format parsable by BioJava(X) ( > > > > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where > > > > my problems start. > > > > > > > > ISSUE 1: > > > > I've tried to parse all of the formats that ASN2GB outputs ( GenBank > > > > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML), > > > > tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank > > > > format is recognized by the > > > > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with > > > > some exceptions that I'll describe in issue #2. This is the code that > > > > I'm using to parse, for example, the EMBL output: > > > > > > > > BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb")); > > > > Namespace gbNspace = (Namespace) > > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > > Object[]{"gbSpace"} ); > > > > RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace); > > > > while (gbSeqs.hasNext()) { > > > > try { > > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > > // Further processing or RichSequence object from here > > > > > > > > } catch (BioException be){ > > > > be.printStackTrace(); > > > > } > > > > } > > > > > > > > The multi-sequence EMBL file looks like this: > > > > --------------------------------------------------------------------------------- > > > > ID DQ472184 standard; DNA; INV; 546 BP. > > > > XX > > > > AC DQ472184; > > > > XX > > > > SV DQ472184.1 > > > > DT 15-MAY-2006 > > > > XX > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > > > > DE complete cds. > > > > XX > > > > KW . > > > > XX > > > > OS Trypanosoma cruzi strain CL Brener > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > OC Schizotrypanum. > > > > XX > > > > RN [1] > > > > RP 1-546 > > > > RA De Melo L.D.B.; > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > RL Unpublished. > > > > XX > > > > RN [2] > > > > RP 1-546 > > > > RA De Melo L.D.B.; > > > > RT ; > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > RL 21949-900, Brazil > > > > XX > > > > FH Key Location/Qualifiers > > > > FH > > > > FT source 1..546 > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > FT /mol_type="genomic DNA" > > > > FT /strain="CL Brener" > > > > FT /db_xref="taxon:353153" > > > > FT gene <1..>546 > > > > FT /gene="ARC21" > > > > FT /note="TcARC21" > > > > FT mRNA <1..>546 > > > > FT /gene="ARC21" > > > > FT /product="actin-related protein 3" > > > > FT CDS 1..546 > > > > FT /gene="ARC21" > > > > FT /note="actin-binding protein; ARPC3 21 kDa; putative > > > > FT member of Arp2/3 complex" > > > > FT /codon_start=1 > > > > FT /product="actin-related protein 3" > > > > FT /protein_id="ABF13401.1" > > > > FT /db_xref="GI:93360014" > > > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > > > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > > > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > > > > FT FPEKDGTGNKFWMAFAKRPFLASS" > > > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > > > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > > > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > > > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > > > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > > > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > > > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > > > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > > > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > > > > agttag 546 > > > > // > > > > ID DQ472185 standard; DNA; INV; 543 BP. > > > > XX > > > > AC DQ472185; > > > > XX > > > > SV DQ472185.1 > > > > DT 15-MAY-2006 > > > > XX > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > > > > DE complete cds. > > > > XX > > > > KW . > > > > XX > > > > OS Trypanosoma cruzi strain CL Brener > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > OC Schizotrypanum. > > > > XX > > > > RN [1] > > > > RP 1-543 > > > > RA De Melo L.D.B.; > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > RL Unpublished. > > > > XX > > > > RN [2] > > > > RP 1-543 > > > > RA De Melo L.D.B.; > > > > RT ; > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > RL 21949-900, Brazil > > > > XX > > > > FH Key Location/Qualifiers > > > > FH > > > > FT source 1..543 > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > FT /mol_type="genomic DNA" > > > > FT /strain="CL Brener" > > > > FT /db_xref="taxon:353153" > > > > FT gene <1..>543 > > > > FT /gene="ARC20" > > > > FT /note="TcARC20" > > > > FT mRNA <1..>543 > > > > FT /gene="ARC20" > > > > FT /product="actin-related protein 4" > > > > FT CDS 1..543 > > > > FT /gene="ARC20" > > > > FT /note="actin-binding protein; ARPC4 20 kDa; putative > > > > FT member of Arp2/3 complex" > > > > FT /codon_start=1 > > > > FT /product="actin-related protein 4" > > > > FT /protein_id="ABF13402.1" > > > > FT /db_xref="GI:93360016" > > > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > > > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > > > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > > > > FT MKLNVNQRARRAAMEFFLALNFT" > > > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > > > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > > > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > > > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > > > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > > > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > > > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > > > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > > > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > > > > tga 543 > > > > // > > > > ----------------------------------------------------------------------- > > > > I get an exception message "Could Not Read Sequence". Same thing > > > > happens if I use the readINSDSetDNA reader instead of readEMBLDNA one > > > > with the following INSDset file (beginning of the file): > > > > > > > > > > > > > > > > > > > > DQ022078 > > > > 16729 > > > > DNA > > > > linear > > > > ENV > > > > 15-MAY-2006 > > > > 15-MAY-2006 > > > > Uncultured bacterium WWRS-2005 putative > > > > aminoglycoside phosphotransferase (a3.001), putative oxidoreductase > > > > (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase > > > > class C (estA3), putative permease (a3.005), putative transmembrane > > > > signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone > > > > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative > > > > asparaginase (a3.010), hypothetical protein (a3.011), hypothetical > > > > protein (a3.012), putative membrane protease subunit (a3.013), > > > > putative haloalkane dehalogenase (a3.014), putative transcriptional > > > > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and > > > > hypothetical protein (a3.017) genes, complete cds > > > > DQ022078 > > > > > > > > gb|DQ022078.1| > > > > gi|71842722 > > > > > > > > > > > > ENV > > > > > > > > > > > > > > > > ? > > > > 1..16729 > > > > > > > > Schmeisser,C. > > > > Elend,C. > > > > Streit,W.R. > > > > > > > > Isolation and biochemical characterization > > > > of two novel metagenome derived esterases > > > > Appl. Environ. Microbiol. 0:0-0 > > > > (2006) > > > > > > > > > > > > ? > > > > 1..16729 > > > > > > > > Schmeisser,C. > > > > Elend,C. > > > > Streit,W.R. > > > > > > > > Submitted (29-APR-2005) to the > > > > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University > > > > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057, > > > > Germany > > > > > > > > > > > > > > > > So my question is wether the ASN2GB produces output that's > > > > incompatible with BioJava parsers or is there a problem with the > > > > sequence themselves or the problems with the majority of parsers??? > > > > Could it be that I'm using the API wrongly for the above formats, > > > > although GenBank parser works as advertised with some exceptions > > > > below: > > > > > > > > ISSUE #2: > > > > When I try to parse GenBank files using the following code: > > > > > > > > BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb")); > > > > Namespace gbNspace = (Namespace) > > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > > Object[]{"gbSpace"} ); > > > > RichSequenceIterator gbSeqs = > > > > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace); > > > > while (gbSeqs.hasNext()) { > > > > try { > > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > > // Further processing or RichSequence object from here > > > > > > > > } catch (BioException be){ > > > > be.printStackTrace(); > > > > } > > > > } > > > > > > > > Genbank file in question: > > > > > > > > LOCUS BC074905 838 bp mRNA linear PRI 15-APR-2006 > > > > DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038 > > > > IMAGE:30915482), complete cds. > > > > ACCESSION BC074905 > > > > VERSION BC074905.2 GI:50959825 > > > > KEYWORDS MGC. > > > > SOURCE Homo sapiens (human) > > > > ORGANISM Homo sapiens > > > > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; > > > > Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; > > > > Catarrhini; Hominidae; Homo. > > > > REFERENCE 1 (bases 1 to 838) > > > > AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G., > > > > Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D., > > > > Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K., > > > > Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F., > > > > Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L., > > > > Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L., > > > > Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S., > > > > Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J., > > > > Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J., > > > > McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S., > > > > Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W., > > > > Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A., > > > > Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S., > > > > Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y., > > > > Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D., > > > > Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M., > > > > Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E., > > > > Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A. > > > > CONSRTM Mammalian Gene Collection Program Team > > > > TITLE Generation and initial analysis of more than 15,000 full-length > > > > human and mouse cDNA sequences > > > > JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002) > > > > PUBMED 12477932 > > > > REFERENCE 2 (bases 1 to 838) > > > > CONSRTM NIH MGC Project > > > > TITLE Direct Submission > > > > JOURNAL Submitted (25-JUN-2004) National Institutes of Health, Mammalian > > > > Gene Collection (MGC), Bethesda, MD 20892-2590, USA > > > > REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov > > > > COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832. > > > > Contact: MGC help desk > > > > Email: cgapbs-r at mail.nih.gov > > > > Tissue Procurement: Genome Sequence Centre, British Columbia Cancer > > > > Center > > > > cDNA Library Preparation: British Columbia Cancer Research Center > > > > cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL) > > > > DNA Sequencing by: Genome Sequence Centre, > > > > BC Cancer Agency, Vancouver, BC, Canada > > > > info at bcgsc.bc.ca > > > > Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson > > > > Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen > > > > Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel > > > > Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave > > > > Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth > > > > Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao, > > > > Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR > > > > Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang, > > > > Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra. > > > > > > > > Clone distribution: MGC clone distribution information can be found > > > > through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov > > > > Series: IRBU Plate: 4 Row: C Column: 3. > > > > > > > > Differences found between this sequence and the human reference > > > > genome (build 36) are described in misc_difference features below. > > > > FEATURES Location/Qualifiers > > > > source 1..838 > > > > /organism="Homo sapiens" > > > > /mol_type="mRNA" > > > > /db_xref="taxon:9606" > > > > /clone="MGC:104038 IMAGE:30915482" > > > > /tissue_type="Lung, PCR rescued clones" > > > > /clone_lib="NIH_MGC_273" > > > > /lab_host="DH10B" > > > > /note="Vector: pCR4 Topo TA with reversed insert" > > > > gene 1..838 > > > > /gene="KLK14" > > > > /note="synonym: KLK-L6" > > > > /db_xref="GeneID:43847" > > > > /db_xref="HGNC:6362" > > > > /db_xref="IMGT/GENE-DB:6362" > > > > /db_xref="MIM:606135" > > > > CDS 49..804 > > > > /gene="KLK14" > > > > /codon_start=1 > > > > /product="KLK14 protein" > > > > /protein_id="AAH74905.1" > > > > /db_xref="GI:50959826" > > > > /db_xref="GeneID:43847" > > > > /db_xref="HGNC:6362" > > > > /db_xref="IMGT/GENE-DB:6362" > > > > /db_xref="MIM:606135" > > > > /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA > > > > GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN > > > > YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA > > > > SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV > > > > SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK" > > > > misc_difference 98 > > > > /gene="KLK14" > > > > /note="'G' in cDNA is 'A' in the human genome; amino acid > > > > difference: 'R' in cDNA, 'Q' in the human genome." > > > > misc_difference 133 > > > > /gene="KLK14" > > > > /note="'T' in cDNA is 'C' in the human genome; amino acid > > > > difference: 'Y' in cDNA, 'H' in the human genome." > > > > ORIGIN > > > > 1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg > > > > 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag > > > > 121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg > > > > 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact > > > > 241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg > > > > 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac > > > > 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg > > > > 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga > > > > 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc > > > > 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg > > > > 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct > > > > 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc > > > > 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt > > > > 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc > > > > // > > > > > > > > I get the following exception: > > > > > > > > java.lang.IllegalArgumentException: Authors string cannot be null > > > > org.biojava.bio.BioException: Could not read sequence > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107) > > > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258) > > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:341) > > > > Caused by: java.lang.IllegalArgumentException: Authors string cannot be null > > > > at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76) > > > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356) > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > > > > > ----------------------------------------------------------------------- > > > > > > > > I'm trying to see what could be the problem with this particular > > > > sequence. Looks to me like the AUTHORS portion is not getting parsed > > > > correctly. Any ideas would be greatly appreciated! > > > > > > > -- > > > Richard Holland (BioMart Team) > > > EMBL-EBI > > > Wellcome Trust Genome Campus > > > Hinxton > > > Cambridge CB10 1SD > > > UNITED KINGDOM > > > Tel: +44-(0)1223-494416 > > > > > > > > > > > -- > Richard Holland (BioMart Team) > EMBL-EBI > Wellcome Trust Genome Campus > Hinxton > Cambridge CB10 1SD > UNITED KINGDOM > Tel: +44-(0)1223-494416 > > -- Best Regards, Seth Johnson Senior Bioinformatics Associate Ph: (202) 470-0900 Fx: (775) 251-0358 From richard.holland at ebi.ac.uk Mon Jun 5 11:45:13 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Mon, 05 Jun 2006 16:45:13 +0100 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: References: <1149238900.3948.87.camel@texas.ebi.ac.uk> <1149520267.3947.36.camel@texas.ebi.ac.uk> Message-ID: <1149522313.3947.48.camel@texas.ebi.ac.uk> Hmmm... interesting. I _could_ put in a special case that ignores the question marks, but that wouldn't be 'nice' really - this is more of a problem with the program that is producing the Genbank files than a problem with the parser trying to read them. '?' is not a valid tag in the official Genbank format, and has no meaning attached to it that I can work out, so I'm reluctant to make the parser recognise it. I'd suggest you contact the people who write the software you are using to produce the Genbank files and ask them if they could stick to the rules! In the meantime you could work around the problem by stripping the question marks in some kind of pre-processor before passing it onto BioJavaX for parsing. cheers, Richard On Mon, 2006-06-05 at 11:39 -0400, Seth Johnson wrote: > Removing '?' (or several of them in my case) avoids the following exception: > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > org.biojava.bio.BioException: Could not read sequence > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > at exonhit.parsers.GenBankParser.main(GenBankParser.java:348) > Caused by: org.biojava.bio.seq.io.ParseException: DQ415957 > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > ... 1 more > Java Result: -1 > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > I don't know where that previous tokenization problem came from since > I can no longer reproduce it. This time it's more or less straight > forward. > Here's the original file with question marks: > ============================ > LOCUS DQ415957 1437 bp mRNA linear VRT 01-JUN-2006 > DEFINITION Danio rerio capillary morphogenesis protein 2A (cmg2a) mRNA, > complete cds. > ACCESSION DQ415957 > VERSION DQ415957.1 GI:89513612 > KEYWORDS . > SOURCE Unknown. > ORGANISM Unknown. > Unclassified. > ? > ? > FEATURES Location/Qualifiers > ? > gene 1..1437 > /gene="cmg2a" > CDS 1..1437 > /gene="cmg2a" > /note="cell surface receptor; similar to anthrax toxin > receptor 2 (ANTXR2, ATR2, CMG2)" > /codon_start=1 > /product="capillary morphogenesis protein 2A" > /protein_id="ABD74633.1" > /db_xref="GI:89513613" > /translation="MTKENLWSVATTATLFFCLCFSSFKAETPSCHGAYDLYFVLDRS > GSVSTDWSEIYDFVKNLTERFVSPNLRVSFIVFSSRAEIVLPLTGDRSEINKGLKTLS > EVNPAGETYMHEGIKLATEQMKKEPKKSSSIIVALTDGKLETYIHQLTIDEADSARKY > GARVYCVGVKDFDEEQLADVADSKEQVFPVKGGFQALKGIVNSILKQSCTEILTVEPS > SVCVNQSFDIVLRGNGFAVGRQTEGVICSFIVDGVTYKQKPTKVKIDYILCPAPVLYT > VGQQMEVLISLNSGTSYITSAFIITASSCSDGTVVAIVFLVLFLLLALALMWWFWPLC > CTVVIKDPPPQRPPPPPPKLEPDPEPKKKWPTVDASYYGGRGAGGIKRMEVRWGEKGS > TEEGARLEMAKNAVVSIQEESEEPMVKKPRAPAQTCHQSESKWYTPIRGRLDALWALL > RRQYDRVSVMRPTSADKGRCMNFSRTQH" > ORIGIN > 1 atgacaaagg aaaatctctg gagcgtggca accacggcga ctcttttctt ctgtttatgc > 61 ttttcatctt ttaaagcgga aaccccatct tgtcatggtg cctacgacct gtactttgtg > 121 ttggaccgat ctggaagtgt ttcgactgac tggagtgaaa tctatgactt tgtcaaaaat > 181 cttacagaga gatttgtgag tccaaatctg cgagtgtcct tcattgtttt ttcatcaaga > 241 gcagagattg tgttaccgct cactggagac aggtcagaaa ttaataaagg cctgaagacc > 301 ttaagtgagg tcaatccagc tggagaaaca tacatgcatg aaggaattaa attggcaact > 361 gaacaaatga aaaaagagcc taaaaagtcc tctagtatta ttgtggcctt gactgatgga > 421 aagcttgaaa cgtatatcca tcaactcact attgacgagg ctgattcagc aaggaagtat > 481 ggggctcgtg tgtactgtgt tggtgtaaaa gactttgatg aagaacagct agccgatgtg > 541 gctgattcca aggagcaagt gttcccagtc aaaggaggct ttcaggctct caaaggcatc > 601 gttaactcga tcctcaagca atcatgcacc gaaatcctaa cagtggaacc gtccagcgtc > 661 tgcgtgaacc agtcctttga cattgttttg agagggaacg ggttcgcagt ggggagacaa > 721 acagaaggag tcatctgcag tttcatagtg gatggagtta cttacaaaca aaaaccaacc > 781 aaagtgaaga ttgactacat cctatgtcct gctccagtgc tgtatacagt tggacagcaa > 841 atggaggttc tgatcagttt gaacagtgga acatcatata tcaccagtgc tttcatcatc > 901 actgcctctt catgttcgga cggcacagtg gtggccattg tgttcttggt gctttttctc > 961 ctgttggctt tggctctgat gtggtggttc tggcctctat gctgcactgt cgttattaaa > 1021 gacccacctc cacaaagacc tcctccacct ccacctaagc tagagccaga cccggaaccc > 1081 aagaagaagt ggccaactgt ggatgcatct tactatgggg gaagaggagc tggtggaatc > 1141 aaacgcatgg aggtccgttg gggagaaaaa gggtctacag aggaaggtgc aagactagag > 1201 atggctaaga atgcagtagt gtcaatacaa gaggaatcag aagaacccat ggtcaaaaag > 1261 ccaagagcac ctgcacaaac atgccatcaa tctgaatcca agtggtatac accaatcaga > 1321 ggccgtcttg acgcactgtg ggctcttttg cggcggcaat atgaccgagt ttcagttatg > 1381 cgaccaactt ctgcagataa gggtcgctgt atgaatttca gtcgcacgca gcattaa > // > > ============================ > > > On 6/5/06, Richard Holland wrote: > > Hi again. > > > > Could you remove the offending question mark from the GenBank file and > > try it again to see if that fixes it? The parser should just ignore it > > but apparently not. The error looks weird to me because the tokenization > > for a DNA GenBank file _does_ contain the letter 't'! Not sure what's > > going on here. > ... > > > > cheers, > > Richard > > > > On Mon, 2006-06-05 at 10:37 -0400, Seth Johnson wrote: > > > Hell again Richard, > > > > > > No sooner I've said about the fix of the last parsing exception than > > > another one came up with Genbank format: > > > -------------------------------------- > > > org.biojava.bio.seq.io.ParseException: DQ431065 > > > org.biojava.bio.BioException: Could not read sequence > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > at exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151) > > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246) > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:326) > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ431065 > > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > ... 3 more > > > org.biojava.bio.seq.io.ParseException: > > > org.biojava.bio.symbol.IllegalSymbolException: This tokenization > > > doesn't contain character: 't' > > > ---------------------------------------- > > > The Genbank file that caused it is as follows: > > > ========================================= > > > LOCUS DQ431065 425 bp DNA linear INV 01-JUN-2006 > > > DEFINITION Reticulitermes sp. ALS-2006c 16S ribosomal RNA gene, partial > > > sequence; mitochondrial. > > > ACCESSION DQ431065 > > > VERSION DQ431065.1 GI:90102206 > > > KEYWORDS . > > > SOURCE Vaccinium corymbosum > > > ORGANISM Vaccinium corymbosum > > > Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; > > > Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; > > > asterids; Ericales; Ericaceae; Vaccinioideae; Vaccinieae; > > > Vaccinium. > > > ? > > > REFERENCE 2 (bases 1 to 425) > > > AUTHORS Naik,L.D. and Rowland,L.J. > > > TITLE Expressed Sequence Tags of cDNA clones from subtracted library of > > > Vaccinium corymbosum > > > JOURNAL Unpublished (2005) > > > FEATURES Location/Qualifiers > > > source 1..425 > > > /organism="Vaccinium corymbosum" > > > /mol_type="genomic DNA" > > > /cultivar="Bluecrop" > > > /db_xref="taxon:69266" > > > /tissue_type="Flower buds" > > > /clone_lib="Subtracted cDNA library of Vaccinium > > > corymbosum" > > > /dev_stage="399 hour chill unit exposure" > > > /note="Vector: pCR4TOPO; Site_1: Eco R I; Site_2: Eco R I" > > > rRNA <1..>425 > > > /product="16S ribosomal RNA" > > > ORIGIN > > > 1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga agtatggcct gcccgctgac > > > 61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg tagcatagtc attagttctt > > > 121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc tgtcttaatt ttgaattgtt > > > 181 tattgaattt ggtctttgag ttaaaattct tagatgtttt tatgggacga gaagacccta > > > 241 tagagtttaa catttattat ggtccttttc tgtttgtgag ggctcactgg gccgtctaat > > > 301 atgttttgtt ggggtgatgg gagggaataa tttaacccct cctttttatt attatattta > > > 361 tttatattta tttgatccat ttattttgat tgtaagatta aattacctta gggataacag > > > 421 cgtaa > > > // > > > ================================== > > > I think it's the presence of the '?' at the beginning of the line?!?! > > > I'm not sure wether the information that was supposed to be present > > > instead of those question marks is absent from the original ASN.1 > > > batch file or it's a bug in the NCBI ASN2GO software. It looks to me > > > that the former is the case since the file from NCBI website contains > > > much more information than the batch file. Just bringing this to > > > everyone's attention. > > > > > > > > > -- > > > Best Regards, > > > > > > > > > Seth Johnson > > > Senior Bioinformatics Associate > > > > > > Ph: (202) 470-0900 > > > Fx: (775) 251-0358 > > > > > > On 6/2/06, Richard Holland wrote: > > > > Hi Seth. > > > > > > > > Your second point, about the authors string not being read correctly in > > > > Genbank format, has been fixed (or should have been if I got the code > > > > right!). Could you check the latest version of biojava-live out of CVS > > > > and give it another go? Basically the parser did not recognise the > > > > CONSRTM tag, as it is not mentioned in the sample record provided by > > > > NCBI, which is what I based the parser on. > > > ... > > > > > > > > cheers, > > > > Richard > > > > > > > > > > -- > > Richard Holland (BioMart Team) > > EMBL-EBI > > Wellcome Trust Genome Campus > > Hinxton > > Cambridge CB10 1SD > > UNITED KINGDOM > > Tel: +44-(0)1223-494416 > > > > > > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From johnson.biotech at gmail.com Mon Jun 5 11:39:40 2006 From: johnson.biotech at gmail.com (Seth Johnson) Date: Mon, 5 Jun 2006 11:39:40 -0400 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: <1149520267.3947.36.camel@texas.ebi.ac.uk> References: <1149238900.3948.87.camel@texas.ebi.ac.uk> <1149520267.3947.36.camel@texas.ebi.ac.uk> Message-ID: Removing '?' (or several of them in my case) avoids the following exception: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ org.biojava.bio.BioException: Could not read sequence at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) at exonhit.parsers.GenBankParser.main(GenBankParser.java:348) Caused by: org.biojava.bio.seq.io.ParseException: DQ415957 at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) ... 1 more Java Result: -1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I don't know where that previous tokenization problem came from since I can no longer reproduce it. This time it's more or less straight forward. Here's the original file with question marks: ============================ LOCUS DQ415957 1437 bp mRNA linear VRT 01-JUN-2006 DEFINITION Danio rerio capillary morphogenesis protein 2A (cmg2a) mRNA, complete cds. ACCESSION DQ415957 VERSION DQ415957.1 GI:89513612 KEYWORDS . SOURCE Unknown. ORGANISM Unknown. Unclassified. ? ? FEATURES Location/Qualifiers ? gene 1..1437 /gene="cmg2a" CDS 1..1437 /gene="cmg2a" /note="cell surface receptor; similar to anthrax toxin receptor 2 (ANTXR2, ATR2, CMG2)" /codon_start=1 /product="capillary morphogenesis protein 2A" /protein_id="ABD74633.1" /db_xref="GI:89513613" /translation="MTKENLWSVATTATLFFCLCFSSFKAETPSCHGAYDLYFVLDRS GSVSTDWSEIYDFVKNLTERFVSPNLRVSFIVFSSRAEIVLPLTGDRSEINKGLKTLS EVNPAGETYMHEGIKLATEQMKKEPKKSSSIIVALTDGKLETYIHQLTIDEADSARKY GARVYCVGVKDFDEEQLADVADSKEQVFPVKGGFQALKGIVNSILKQSCTEILTVEPS SVCVNQSFDIVLRGNGFAVGRQTEGVICSFIVDGVTYKQKPTKVKIDYILCPAPVLYT VGQQMEVLISLNSGTSYITSAFIITASSCSDGTVVAIVFLVLFLLLALALMWWFWPLC CTVVIKDPPPQRPPPPPPKLEPDPEPKKKWPTVDASYYGGRGAGGIKRMEVRWGEKGS TEEGARLEMAKNAVVSIQEESEEPMVKKPRAPAQTCHQSESKWYTPIRGRLDALWALL RRQYDRVSVMRPTSADKGRCMNFSRTQH" ORIGIN 1 atgacaaagg aaaatctctg gagcgtggca accacggcga ctcttttctt ctgtttatgc 61 ttttcatctt ttaaagcgga aaccccatct tgtcatggtg cctacgacct gtactttgtg 121 ttggaccgat ctggaagtgt ttcgactgac tggagtgaaa tctatgactt tgtcaaaaat 181 cttacagaga gatttgtgag tccaaatctg cgagtgtcct tcattgtttt ttcatcaaga 241 gcagagattg tgttaccgct cactggagac aggtcagaaa ttaataaagg cctgaagacc 301 ttaagtgagg tcaatccagc tggagaaaca tacatgcatg aaggaattaa attggcaact 361 gaacaaatga aaaaagagcc taaaaagtcc tctagtatta ttgtggcctt gactgatgga 421 aagcttgaaa cgtatatcca tcaactcact attgacgagg ctgattcagc aaggaagtat 481 ggggctcgtg tgtactgtgt tggtgtaaaa gactttgatg aagaacagct agccgatgtg 541 gctgattcca aggagcaagt gttcccagtc aaaggaggct ttcaggctct caaaggcatc 601 gttaactcga tcctcaagca atcatgcacc gaaatcctaa cagtggaacc gtccagcgtc 661 tgcgtgaacc agtcctttga cattgttttg agagggaacg ggttcgcagt ggggagacaa 721 acagaaggag tcatctgcag tttcatagtg gatggagtta cttacaaaca aaaaccaacc 781 aaagtgaaga ttgactacat cctatgtcct gctccagtgc tgtatacagt tggacagcaa 841 atggaggttc tgatcagttt gaacagtgga acatcatata tcaccagtgc tttcatcatc 901 actgcctctt catgttcgga cggcacagtg gtggccattg tgttcttggt gctttttctc 961 ctgttggctt tggctctgat gtggtggttc tggcctctat gctgcactgt cgttattaaa 1021 gacccacctc cacaaagacc tcctccacct ccacctaagc tagagccaga cccggaaccc 1081 aagaagaagt ggccaactgt ggatgcatct tactatgggg gaagaggagc tggtggaatc 1141 aaacgcatgg aggtccgttg gggagaaaaa gggtctacag aggaaggtgc aagactagag 1201 atggctaaga atgcagtagt gtcaatacaa gaggaatcag aagaacccat ggtcaaaaag 1261 ccaagagcac ctgcacaaac atgccatcaa tctgaatcca agtggtatac accaatcaga 1321 ggccgtcttg acgcactgtg ggctcttttg cggcggcaat atgaccgagt ttcagttatg 1381 cgaccaactt ctgcagataa gggtcgctgt atgaatttca gtcgcacgca gcattaa // ============================ On 6/5/06, Richard Holland wrote: > Hi again. > > Could you remove the offending question mark from the GenBank file and > try it again to see if that fixes it? The parser should just ignore it > but apparently not. The error looks weird to me because the tokenization > for a DNA GenBank file _does_ contain the letter 't'! Not sure what's > going on here. ... > > cheers, > Richard > > On Mon, 2006-06-05 at 10:37 -0400, Seth Johnson wrote: > > Hell again Richard, > > > > No sooner I've said about the fix of the last parsing exception than > > another one came up with Genbank format: > > -------------------------------------- > > org.biojava.bio.seq.io.ParseException: DQ431065 > > org.biojava.bio.BioException: Could not read sequence > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > at exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151) > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246) > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:326) > > Caused by: org.biojava.bio.seq.io.ParseException: DQ431065 > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > ... 3 more > > org.biojava.bio.seq.io.ParseException: > > org.biojava.bio.symbol.IllegalSymbolException: This tokenization > > doesn't contain character: 't' > > ---------------------------------------- > > The Genbank file that caused it is as follows: > > ========================================= > > LOCUS DQ431065 425 bp DNA linear INV 01-JUN-2006 > > DEFINITION Reticulitermes sp. ALS-2006c 16S ribosomal RNA gene, partial > > sequence; mitochondrial. > > ACCESSION DQ431065 > > VERSION DQ431065.1 GI:90102206 > > KEYWORDS . > > SOURCE Vaccinium corymbosum > > ORGANISM Vaccinium corymbosum > > Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; > > Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; > > asterids; Ericales; Ericaceae; Vaccinioideae; Vaccinieae; > > Vaccinium. > > ? > > REFERENCE 2 (bases 1 to 425) > > AUTHORS Naik,L.D. and Rowland,L.J. > > TITLE Expressed Sequence Tags of cDNA clones from subtracted library of > > Vaccinium corymbosum > > JOURNAL Unpublished (2005) > > FEATURES Location/Qualifiers > > source 1..425 > > /organism="Vaccinium corymbosum" > > /mol_type="genomic DNA" > > /cultivar="Bluecrop" > > /db_xref="taxon:69266" > > /tissue_type="Flower buds" > > /clone_lib="Subtracted cDNA library of Vaccinium > > corymbosum" > > /dev_stage="399 hour chill unit exposure" > > /note="Vector: pCR4TOPO; Site_1: Eco R I; Site_2: Eco R I" > > rRNA <1..>425 > > /product="16S ribosomal RNA" > > ORIGIN > > 1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga agtatggcct gcccgctgac > > 61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg tagcatagtc attagttctt > > 121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc tgtcttaatt ttgaattgtt > > 181 tattgaattt ggtctttgag ttaaaattct tagatgtttt tatgggacga gaagacccta > > 241 tagagtttaa catttattat ggtccttttc tgtttgtgag ggctcactgg gccgtctaat > > 301 atgttttgtt ggggtgatgg gagggaataa tttaacccct cctttttatt attatattta > > 361 tttatattta tttgatccat ttattttgat tgtaagatta aattacctta gggataacag > > 421 cgtaa > > // > > ================================== > > I think it's the presence of the '?' at the beginning of the line?!?! > > I'm not sure wether the information that was supposed to be present > > instead of those question marks is absent from the original ASN.1 > > batch file or it's a bug in the NCBI ASN2GO software. It looks to me > > that the former is the case since the file from NCBI website contains > > much more information than the batch file. Just bringing this to > > everyone's attention. > > > > > > -- > > Best Regards, > > > > > > Seth Johnson > > Senior Bioinformatics Associate > > > > Ph: (202) 470-0900 > > Fx: (775) 251-0358 > > > > On 6/2/06, Richard Holland wrote: > > > Hi Seth. > > > > > > Your second point, about the authors string not being read correctly in > > > Genbank format, has been fixed (or should have been if I got the code > > > right!). Could you check the latest version of biojava-live out of CVS > > > and give it another go? Basically the parser did not recognise the > > > CONSRTM tag, as it is not mentioned in the sample record provided by > > > NCBI, which is what I based the parser on. > > ... > > > > > > cheers, > > > Richard > > > > > > > -- > Richard Holland (BioMart Team) > EMBL-EBI > Wellcome Trust Genome Campus > Hinxton > Cambridge CB10 1SD > UNITED KINGDOM > Tel: +44-(0)1223-494416 > > -- Best Regards, Seth Johnson Senior Bioinformatics Associate Ph: (202) 470-0900 Fx: (775) 251-0358 From johnson.biotech at gmail.com Mon Jun 5 10:22:57 2006 From: johnson.biotech at gmail.com (Seth Johnson) Date: Mon, 5 Jun 2006 10:22:57 -0400 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: <1149497066.3947.12.camel@texas.ebi.ac.uk> References: <1149238900.3948.87.camel@texas.ebi.ac.uk> <1149497066.3947.12.camel@texas.ebi.ac.uk> Message-ID: I apologize again for not posting the stacktrace. Here it is: ========================== org.biojava.bio.BioException: Could not read sequence at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) at exonhit.parsers.GenBankParser.main(GenBankParser.java:347) Caused by: java.lang.NullPointerException at org.biojavax.bio.seq.io.SimpleRichSequenceBuilder.addFeatureProperty(SimpleRichSequenceBuilder.java:356) at org.biojavax.bio.seq.io.INSDseqFormat$INSDseqHandler.endElement(INSDseqFormat.java:853) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242) at javax.xml.parsers.SAXParser.parse(SAXParser.java:375) at org.biojavax.utils.XMLTools.readXMLChunk(XMLTools.java:97) at org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:246) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) ... 1 more Java Result: -1 ============================ Here's the XML that causes that exception (taken out of a bigger file of several hundred sequences): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ DQ485973 1356 DNA linear ENV

08-MAY-2006

Uncultured Mollicutes bacterium clone P7 16S ribosomal RNA gene, partial sequence

DQ485973

DQ485973.1

gb|DQ485973.1| gi|94482885

ENV uncultured Mollicutes bacterium uncultured Mollicutes bacterium Bacteria; Firmicutes; Mollicutes; environmental samples 1 (bases 1 to 1356) 1..1356 Kostanjsek,R. Strus,J. Avgustin,G. A novel lineage of Mollicutes associated with the hindgut wall of the terrestrial isopod Porcellio scaber (Crustacea: Isopoda) Unpublished 2 (bases 1 to 1356) 1..1356 Kostanjsek,R. Strus,J. Avgustin,G. Direct Submission Submitted (07-APR-2006) Department of Biology, Biotechnical Faculty, University of Ljubljana, Vecna Pot 111, Ljubljana 1000, Slovenia

source 1..1356 1 1356 DQ485973.1 organism uncultured Mollicutes bacterium mol_type genomic DNA isolation_source isopod gut specific_host Porcellio scaber db_xref taxon:220137 clone P7 environmental_sample rRNA <1..>1356 1 1356 DQ485973.1 product 16S ribosomal RNA

AACGCTGGCGGCATGCCTAATACATGCAAGTCGAACGAACTGCCCCTGAACTAAAAGAAGTGCTTGCACGGAAGTTAGGGACGGAATTTGCAGTTAGTGGCGAACGGGTGAGTAACACGTGGGTAACCTACCATAGAGATTGGGATAACTGTTGGAAACGACAGCTAAAACCGAATAAGATTAATTCTACAAAGAGGAATAATTTAAATAGGCGTTTGCCTAGCTTTATGATGGGCCCGCGGTGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCGACGATGCATAGCCGGACTGAGAGGTTGAACGGCCACATTGGGACTGAGACACGGCCCAGACAACTACGGTTGGCAGCAGTAGGGAATTTTTCGCAATGGACGAAAGTCTGACGGAGCAATGCCGCGTGAGTGAAGACGGTTTTCGGATTGTAAAACTCTGTTGTGTGGGGGGAACACCTATATGAGAGGAATTGCTCATTAATTGACGCCACCACACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCGAGCGTTTTCCGGAATTATTGGGCGTAAAGAGCGTGTAGGCGGGTATGAATAAGTCTGGTGTGAAATCTAAGTGGCTCAACCACTTAAATTGCATTGGAAACTGCCAAACTAGAATACGGAGGGGTAAGTGGAATTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAGGGACACCAATGGCGAAGGCAGCTTAATGGACCCGAGATTGACGCTGAGACGCGAAAGCTTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTTAAACGATGAGTGCTAGGTATTGGATTAATTTCAGTGCCCGGAGTTAACGCATTAAGCCCTCCGCCTGAGGAGTACGGTCGCAAGGCTGAAACTCAAAGGAATTGACGGGGACCCGCACAAGTGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCAAAACTTGACATCCCCTGCGAAGCTATAGAAGTATAGTGGAGGTTATCAGGGTGACAGATGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTAGGTTAAGTCCTGCAACGAGCGCAACCCCTGTCTGCAGTTGCTACCATTAAGTTGAGGACTCTGCAGAGACTGCTAGTGTAAGCTAGAGGAAGGTGGGGATGACGTCAAATCATCATGCCTCTTACGTTTTGGGCTACACACGTGCTACAATGGCTGATACAAAGGGCTGCGAACTCGCGAGAGTAAGCGAATCCCAAAAAGTCAGTCTAAGTTCGGATTGAAGTTCTGCAACTCGACTTTCATGAAGTCGGAATGCNCTAGTAATACG ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On 6/5/06, Richard Holland wrote: > This one should be fixed in CVS now. Typo on my behalf - I put in code > to make it work with both 87+ and pre-87 version of EMBL, then got the > regexes the wrong way round!! > > Could you send the full stacktrace for the INSDseq format problem you're > having? (The one where you say you've tracked it down to the qualifier > value being missing). I can't see anything wrong there, so I need the > stacktrace in order to know which exact sequence of events is throwing > the exception. > > cheers, > Richard > > > On Fri, 2006-06-02 at 13:04 -0400, Seth Johnson wrote: > > Hi Richard, > > > > I made sure I have the latest source code from CVS compiled > > (EMBLFormat.java & GenbankFormat.java are from 05/24/06). I'm happy > > to report that GenBank issue is solved!!!! > > As far as EMBL parsing, I apologize for not providing the stack dump > > for ISSUE #1. Here's the dump of the exception: > > -------------------------------------------------------- > > org.biojava.bio.BioException: Could not read sequence > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:359) > > Caused by: java.lang.NumberFormatException: null > > at java.lang.Integer.parseInt(Integer.java:415) > > at java.lang.Integer.parseInt(Integer.java:497) > > at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:299) > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > ... 1 more > > Java Result: -1 > > ------------------------------------------------------- > > Here, again, is the code that I'm using to to parse: > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > BufferedReader gbBR = null; > > try { > > gbBR = new BufferedReader(new > > FileReader("C:\\Download\\ASN2BSML\\seth_06_02.emb")); > > } catch (FileNotFoundException fnfe) { > > fnfe.printStackTrace(); > > System.exit(-1); > > } > > Namespace gbNspace = (Namespace) > > RichObjectFactory.getObject(SimpleNamespace.class, new > > Object[]{"gbSpace"} ); > > RichSequenceIterator gbSeqs = > > RichSequence.IOTools.readEMBLDNA(gbBR,gbNspace); > > while (gbSeqs.hasNext()) { > > try { > > RichSequence rs = gbSeqs.nextRichSequence(); > > NCBITaxon myTaxon = rs.getTaxon(); > > }catch (BioException be){ > > be.printStackTrace(); > > System.exit(-1); > > } > > } > > ~~~~~~~~~~~~~~~~~~~~~~~~~ > > And here's the EMBL file that I'm trying to parse: > > +++++++++++++++++++++++++ > > ID DQ472184 standard; DNA; INV; 546 BP. > > XX > > AC DQ472184; > > XX > > SV DQ472184.1 > > DT 15-MAY-2006 > > XX > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > > DE complete cds. > > XX > > KW . > > XX > > OS Trypanosoma cruzi strain CL Brener > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > OC Schizotrypanum. > > XX > > RN [1] > > RP 1-546 > > RA De Melo L.D.B.; > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > RL Unpublished. > > XX > > RN [2] > > RP 1-546 > > RA De Melo L.D.B.; > > RT ; > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > RL 21949-900, Brazil > > XX > > FH Key Location/Qualifiers > > FH > > FT source 1..546 > > FT /organism="Trypanosoma cruzi strain CL Brener" > > FT /mol_type="genomic DNA" > > FT /strain="CL Brener" > > FT /db_xref="taxon:353153" > > FT gene <1..>546 > > FT /gene="ARC21" > > FT /note="TcARC21" > > FT mRNA <1..>546 > > FT /gene="ARC21" > > FT /product="actin-related protein 3" > > FT CDS 1..546 > > FT /gene="ARC21" > > FT /note="actin-binding protein; ARPC3 21 kDa; putative > > FT member of Arp2/3 complex" > > FT /codon_start=1 > > FT /product="actin-related protein 3" > > FT /protein_id="ABF13401.1" > > FT /db_xref="GI:93360014" > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > > FT FPEKDGTGNKFWMAFAKRPFLASS" > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > > agttag 546 > > // > > ID DQ472185 standard; DNA; INV; 543 BP. > > XX > > AC DQ472185; > > XX > > SV DQ472185.1 > > DT 15-MAY-2006 > > XX > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > > DE complete cds. > > XX > > KW . > > XX > > OS Trypanosoma cruzi strain CL Brener > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > OC Schizotrypanum. > > XX > > RN [1] > > RP 1-543 > > RA De Melo L.D.B.; > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > RL Unpublished. > > XX > > RN [2] > > RP 1-543 > > RA De Melo L.D.B.; > > RT ; > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > RL 21949-900, Brazil > > XX > > FH Key Location/Qualifiers > > FH > > FT source 1..543 > > FT /organism="Trypanosoma cruzi strain CL Brener" > > FT /mol_type="genomic DNA" > > FT /strain="CL Brener" > > FT /db_xref="taxon:353153" > > FT gene <1..>543 > > FT /gene="ARC20" > > FT /note="TcARC20" > > FT mRNA <1..>543 > > FT /gene="ARC20" > > FT /product="actin-related protein 4" > > FT CDS 1..543 > > FT /gene="ARC20" > > FT /note="actin-binding protein; ARPC4 20 kDa; putative > > FT member of Arp2/3 complex" > > FT /codon_start=1 > > FT /product="actin-related protein 4" > > FT /protein_id="ABF13402.1" > > FT /db_xref="GI:93360016" > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > > FT MKLNVNQRARRAAMEFFLALNFT" > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > > tga 543 > > // > > +++++++++++++++++++++++++++++++++ > > > > It looks to me like there's some kind of problem with parsing the > > sequence version number. I even tried the sequence from test directory > > (AY069118.em) with same outcome. > > > > Regards, > > > > Seth > > > > On 6/2/06, Richard Holland wrote: > > > Hi Seth. > > > > > > Your second point, about the authors string not being read correctly in > > > Genbank format, has been fixed (or should have been if I got the code > > > right!). Could you check the latest version of biojava-live out of CVS > > > and give it another go? Basically the parser did not recognise the > > > CONSRTM tag, as it is not mentioned in the sample record provided by > > > NCBI, which is what I based the parser on. > > > > > > I've set it up now so that it reads the CONSRTM tag, but the value is > > > merged with the authors tag with (consortium) appended. There will still > > > be problems if the consortium value has commas in it - not sure how to > > > fix this yet. > > > > > > Your first point is harder to solve because you did not provide a > > > complete stack trace for the exceptions you are getting. The complete > > > stack trace would enable me to identify exactly where things are going > > > wrong and give me a better chance of fixing them. Could you send the > > > stack trace, and I'll see what I can do. > > > > > > cheers, > > > Richard > > > > > > > > > On Thu, 2006-06-01 at 18:03 -0400, Seth Johnson wrote: > > > > Hi All, > > > > > > > > I'm a newbie to the whole BioJava(X) API and was hoping to get some > > > > clarification on several issues that I'm having. > > > > I am developing a parser that would take as input "NCBI Incremental > > > > ASN.1 Sequence Updates to Genbank" files ( > > > > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the > > > > ASN2GB converter ( > > > > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert > > > > resulting sequences to a format parsable by BioJava(X) ( > > > > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where > > > > my problems start. > > > > > > > > ISSUE 1: > > > > I've tried to parse all of the formats that ASN2GB outputs ( GenBank > > > > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML), > > > > tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank > > > > format is recognized by the > > > > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with > > > > some exceptions that I'll describe in issue #2. This is the code that > > > > I'm using to parse, for example, the EMBL output: > > > > > > > > BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb")); > > > > Namespace gbNspace = (Namespace) > > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > > Object[]{"gbSpace"} ); > > > > RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace); > > > > while (gbSeqs.hasNext()) { > > > > try { > > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > > // Further processing or RichSequence object from here > > > > > > > > } catch (BioException be){ > > > > be.printStackTrace(); > > > > } > > > > } > > > > > > > > The multi-sequence EMBL file looks like this: > > > > --------------------------------------------------------------------------------- > > > > ID DQ472184 standard; DNA; INV; 546 BP. > > > > XX > > > > AC DQ472184; > > > > XX > > > > SV DQ472184.1 > > > > DT 15-MAY-2006 > > > > XX > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > > > > DE complete cds. > > > > XX > > > > KW . > > > > XX > > > > OS Trypanosoma cruzi strain CL Brener > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > OC Schizotrypanum. > > > > XX > > > > RN [1] > > > > RP 1-546 > > > > RA De Melo L.D.B.; > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > RL Unpublished. > > > > XX > > > > RN [2] > > > > RP 1-546 > > > > RA De Melo L.D.B.; > > > > RT ; > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > RL 21949-900, Brazil > > > > XX > > > > FH Key Location/Qualifiers > > > > FH > > > > FT source 1..546 > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > FT /mol_type="genomic DNA" > > > > FT /strain="CL Brener" > > > > FT /db_xref="taxon:353153" > > > > FT gene <1..>546 > > > > FT /gene="ARC21" > > > > FT /note="TcARC21" > > > > FT mRNA <1..>546 > > > > FT /gene="ARC21" > > > > FT /product="actin-related protein 3" > > > > FT CDS 1..546 > > > > FT /gene="ARC21" > > > > FT /note="actin-binding protein; ARPC3 21 kDa; putative > > > > FT member of Arp2/3 complex" > > > > FT /codon_start=1 > > > > FT /product="actin-related protein 3" > > > > FT /protein_id="ABF13401.1" > > > > FT /db_xref="GI:93360014" > > > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > > > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > > > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > > > > FT FPEKDGTGNKFWMAFAKRPFLASS" > > > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > > > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > > > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > > > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > > > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > > > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > > > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > > > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > > > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > > > > agttag 546 > > > > // > > > > ID DQ472185 standard; DNA; INV; 543 BP. > > > > XX > > > > AC DQ472185; > > > > XX > > > > SV DQ472185.1 > > > > DT 15-MAY-2006 > > > > XX > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > > > > DE complete cds. > > > > XX > > > > KW . > > > > XX > > > > OS Trypanosoma cruzi strain CL Brener > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > OC Schizotrypanum. > > > > XX > > > > RN [1] > > > > RP 1-543 > > > > RA De Melo L.D.B.; > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > RL Unpublished. > > > > XX > > > > RN [2] > > > > RP 1-543 > > > > RA De Melo L.D.B.; > > > > RT ; > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > RL 21949-900, Brazil > > > > XX > > > > FH Key Location/Qualifiers > > > > FH > > > > FT source 1..543 > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > FT /mol_type="genomic DNA" > > > > FT /strain="CL Brener" > > > > FT /db_xref="taxon:353153" > > > > FT gene <1..>543 > > > > FT /gene="ARC20" > > > > FT /note="TcARC20" > > > > FT mRNA <1..>543 > > > > FT /gene="ARC20" > > > > FT /product="actin-related protein 4" > > > > FT CDS 1..543 > > > > FT /gene="ARC20" > > > > FT /note="actin-binding protein; ARPC4 20 kDa; putative > > > > FT member of Arp2/3 complex" > > > > FT /codon_start=1 > > > > FT /product="actin-related protein 4" > > > > FT /protein_id="ABF13402.1" > > > > FT /db_xref="GI:93360016" > > > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > > > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > > > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > > > > FT MKLNVNQRARRAAMEFFLALNFT" > > > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > > > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > > > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > > > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > > > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > > > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > > > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > > > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > > > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > > > > tga 543 > > > > // > > > > ----------------------------------------------------------------------- > > > > I get an exception message "Could Not Read Sequence". Same thing > > > > happens if I use the readINSDSetDNA reader instead of readEMBLDNA one > > > > with the following INSDset file (beginning of the file): > > > > > > > > > > > > > > > > > > > > DQ022078 > > > > 16729 > > > > DNA > > > > linear > > > > ENV > > > > 15-MAY-2006 > > > > 15-MAY-2006 > > > > Uncultured bacterium WWRS-2005 putative > > > > aminoglycoside phosphotransferase (a3.001), putative oxidoreductase > > > > (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase > > > > class C (estA3), putative permease (a3.005), putative transmembrane > > > > signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone > > > > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative > > > > asparaginase (a3.010), hypothetical protein (a3.011), hypothetical > > > > protein (a3.012), putative membrane protease subunit (a3.013), > > > > putative haloalkane dehalogenase (a3.014), putative transcriptional > > > > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and > > > > hypothetical protein (a3.017) genes, complete cds > > > > DQ022078 > > > > > > > > gb|DQ022078.1| > > > > gi|71842722 > > > > > > > > > > > > ENV > > > > > > > > > > > > > > > > ? > > > > 1..16729 > > > > > > > > Schmeisser,C. > > > > Elend,C. > > > > Streit,W.R. > > > > > > > > Isolation and biochemical characterization > > > > of two novel metagenome derived esterases > > > > Appl. Environ. Microbiol. 0:0-0 > > > > (2006) > > > > > > > > > > > > ? > > > > 1..16729 > > > > > > > > Schmeisser,C. > > > > Elend,C. > > > > Streit,W.R. > > > > > > > > Submitted (29-APR-2005) to the > > > > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University > > > > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057, > > > > Germany > > > > > > > > > > > > > > > > So my question is wether the ASN2GB produces output that's > > > > incompatible with BioJava parsers or is there a problem with the > > > > sequence themselves or the problems with the majority of parsers??? > > > > Could it be that I'm using the API wrongly for the above formats, > > > > although GenBank parser works as advertised with some exceptions > > > > below: > > > > > > > > ISSUE #2: > > > > When I try to parse GenBank files using the following code: > > > > > > > > BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb")); > > > > Namespace gbNspace = (Namespace) > > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > > Object[]{"gbSpace"} ); > > > > RichSequenceIterator gbSeqs = > > > > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace); > > > > while (gbSeqs.hasNext()) { > > > > try { > > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > > // Further processing or RichSequence object from here > > > > > > > > } catch (BioException be){ > > > > be.printStackTrace(); > > > > } > > > > } > > > > > > > > Genbank file in question: > > > > > > > > LOCUS BC074905 838 bp mRNA linear PRI 15-APR-2006 > > > > DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038 > > > > IMAGE:30915482), complete cds. > > > > ACCESSION BC074905 > > > > VERSION BC074905.2 GI:50959825 > > > > KEYWORDS MGC. > > > > SOURCE Homo sapiens (human) > > > > ORGANISM Homo sapiens > > > > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; > > > > Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; > > > > Catarrhini; Hominidae; Homo. > > > > REFERENCE 1 (bases 1 to 838) > > > > AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G., > > > > Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D., > > > > Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K., > > > > Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F., > > > > Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L., > > > > Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L., > > > > Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S., > > > > Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J., > > > > Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J., > > > > McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S., > > > > Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W., > > > > Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A., > > > > Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S., > > > > Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y., > > > > Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D., > > > > Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M., > > > > Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E., > > > > Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A. > > > > CONSRTM Mammalian Gene Collection Program Team > > > > TITLE Generation and initial analysis of more than 15,000 full-length > > > > human and mouse cDNA sequences > > > > JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002) > > > > PUBMED 12477932 > > > > REFERENCE 2 (bases 1 to 838) > > > > CONSRTM NIH MGC Project > > > > TITLE Direct Submission > > > > JOURNAL Submitted (25-JUN-2004) National Institutes of Health, Mammalian > > > > Gene Collection (MGC), Bethesda, MD 20892-2590, USA > > > > REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov > > > > COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832. > > > > Contact: MGC help desk > > > > Email: cgapbs-r at mail.nih.gov > > > > Tissue Procurement: Genome Sequence Centre, British Columbia Cancer > > > > Center > > > > cDNA Library Preparation: British Columbia Cancer Research Center > > > > cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL) > > > > DNA Sequencing by: Genome Sequence Centre, > > > > BC Cancer Agency, Vancouver, BC, Canada > > > > info at bcgsc.bc.ca > > > > Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson > > > > Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen > > > > Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel > > > > Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave > > > > Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth > > > > Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao, > > > > Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR > > > > Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang, > > > > Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra. > > > > > > > > Clone distribution: MGC clone distribution information can be found > > > > through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov > > > > Series: IRBU Plate: 4 Row: C Column: 3. > > > > > > > > Differences found between this sequence and the human reference > > > > genome (build 36) are described in misc_difference features below. > > > > FEATURES Location/Qualifiers > > > > source 1..838 > > > > /organism="Homo sapiens" > > > > /mol_type="mRNA" > > > > /db_xref="taxon:9606" > > > > /clone="MGC:104038 IMAGE:30915482" > > > > /tissue_type="Lung, PCR rescued clones" > > > > /clone_lib="NIH_MGC_273" > > > > /lab_host="DH10B" > > > > /note="Vector: pCR4 Topo TA with reversed insert" > > > > gene 1..838 > > > > /gene="KLK14" > > > > /note="synonym: KLK-L6" > > > > /db_xref="GeneID:43847" > > > > /db_xref="HGNC:6362" > > > > /db_xref="IMGT/GENE-DB:6362" > > > > /db_xref="MIM:606135" > > > > CDS 49..804 > > > > /gene="KLK14" > > > > /codon_start=1 > > > > /product="KLK14 protein" > > > > /protein_id="AAH74905.1" > > > > /db_xref="GI:50959826" > > > > /db_xref="GeneID:43847" > > > > /db_xref="HGNC:6362" > > > > /db_xref="IMGT/GENE-DB:6362" > > > > /db_xref="MIM:606135" > > > > /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA > > > > GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN > > > > YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA > > > > SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV > > > > SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK" > > > > misc_difference 98 > > > > /gene="KLK14" > > > > /note="'G' in cDNA is 'A' in the human genome; amino acid > > > > difference: 'R' in cDNA, 'Q' in the human genome." > > > > misc_difference 133 > > > > /gene="KLK14" > > > > /note="'T' in cDNA is 'C' in the human genome; amino acid > > > > difference: 'Y' in cDNA, 'H' in the human genome." > > > > ORIGIN > > > > 1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg > > > > 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag > > > > 121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg > > > > 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact > > > > 241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg > > > > 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac > > > > 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg > > > > 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga > > > > 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc > > > > 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg > > > > 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct > > > > 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc > > > > 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt > > > > 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc > > > > // > > > > > > > > I get the following exception: > > > > > > > > java.lang.IllegalArgumentException: Authors string cannot be null > > > > org.biojava.bio.BioException: Could not read sequence > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107) > > > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258) > > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:341) > > > > Caused by: java.lang.IllegalArgumentException: Authors string cannot be null > > > > at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76) > > > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356) > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > > > > > ----------------------------------------------------------------------- > > > > > > > > I'm trying to see what could be the problem with this particular > > > > sequence. Looks to me like the AUTHORS portion is not getting parsed > > > > correctly. Any ideas would be greatly appreciated! > > > > > > > -- > > > Richard Holland (BioMart Team) > > > EMBL-EBI > > > Wellcome Trust Genome Campus > > > Hinxton > > > Cambridge CB10 1SD > > > UNITED KINGDOM > > > Tel: +44-(0)1223-494416 > > > > > > > > > > > -- > Richard Holland (BioMart Team) > EMBL-EBI > Wellcome Trust Genome Campus > Hinxton > Cambridge CB10 1SD > UNITED KINGDOM > Tel: +44-(0)1223-494416 > > -- Best Regards, Seth Johnson Senior Bioinformatics Associate Ph: (202) 470-0900 Fx: (775) 251-0358 From johnson.biotech at gmail.com Mon Jun 5 11:53:46 2006 From: johnson.biotech at gmail.com (Seth Johnson) Date: Mon, 5 Jun 2006 11:53:46 -0400 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: <1149520598.3947.38.camel@texas.ebi.ac.uk> References: <1149238900.3948.87.camel@texas.ebi.ac.uk> <1149497066.3947.12.camel@texas.ebi.ac.uk> <1149520598.3947.38.camel@texas.ebi.ac.uk> Message-ID: :) I got another one for you: ========================= org.biojava.bio.BioException: Could not read sequence at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) at exonhit.parsers.GenBankParser.main(GenBankParser.java:348) Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -3 at java.lang.String.substring(String.java:1768) at java.lang.String.substring(String.java:1735) at org.biojavax.bio.seq.io.EMBLFormat.readSection(EMBLFormat.java:672) at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:281) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) ... 1 more Java Result: -1 ========================= File used to produce the above: ~~~~~~~~~~~~~~~~~~~~~~~~~ ID DQ472184 standard; DNA; INV; 546 BP. XX AC DQ472184; XX SV DQ472184.1 DT 15-MAY-2006 XX DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, DE complete cds. XX KW . XX OS Trypanosoma cruzi strain CL Brener OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; OC Schizotrypanum. XX RN [1] RP 1-546 RA De Melo L.D.B.; RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; RL Unpublished. XX RN [2] RP 1-546 RA De Melo L.D.B.; RT ; RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ RL 21949-900, Brazil XX FH Key Location/Qualifiers FH FT source 1..546 FT /organism="Trypanosoma cruzi strain CL Brener" FT /mol_type="genomic DNA" FT /strain="CL Brener" FT /db_xref="taxon:353153" FT gene <1..>546 FT /gene="ARC21" FT /note="TcARC21" FT mRNA <1..>546 FT /gene="ARC21" FT /product="actin-related protein 3" FT CDS 1..546 FT /gene="ARC21" FT /note="actin-binding protein; ARPC3 21 kDa; putative FT member of Arp2/3 complex" FT /codon_start=1 FT /product="actin-related protein 3" FT /protein_id="ABF13401.1" FT /db_xref="GI:93360014" FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL FT FPEKDGTGNKFWMAFAKRPFLASS" atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 agttag 546 // ID DQ472185 standard; DNA; INV; 543 BP. XX AC DQ472185; XX SV DQ472185.1 DT 15-MAY-2006 XX DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, DE complete cds. XX KW . XX OS Trypanosoma cruzi strain CL Brener OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; OC Schizotrypanum. XX RN [1] RP 1-543 RA De Melo L.D.B.; RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; RL Unpublished. XX RN [2] RP 1-543 RA De Melo L.D.B.; RT ; RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ RL 21949-900, Brazil XX FH Key Location/Qualifiers FH FT source 1..543 FT /organism="Trypanosoma cruzi strain CL Brener" FT /mol_type="genomic DNA" FT /strain="CL Brener" FT /db_xref="taxon:353153" FT gene <1..>543 FT /gene="ARC20" FT /note="TcARC20" FT mRNA <1..>543 FT /gene="ARC20" FT /product="actin-related protein 4" FT CDS 1..543 FT /gene="ARC20" FT /note="actin-binding protein; ARPC4 20 kDa; putative FT member of Arp2/3 complex" FT /codon_start=1 FT /product="actin-related protein 4" FT /protein_id="ABF13402.1" FT /db_xref="GI:93360016" FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA FT MKLNVNQRARRAAMEFFLALNFT" atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 tga 543 // ~~~~~~~~~~~~~~~~~~~~~~~~~ On 6/5/06, Richard Holland wrote: > Doh! > > I am in desparate need of coffee methinks... that's the second error in > EMBLFormat directly related to me being stupid when I cut-and-pasted the > stuff for the new 87+ ID line format... > > Should be fixed now in CVS (as of about 30 seconds ago). > > cheers, > Richard > > On Mon, 2006-06-05 at 11:05 -0400, Seth Johnson wrote: > > Hi Richard, > > > > I go another exception on EMBL format: > > ============================= > > org.biojava.bio.BioException: Could not read sequence > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:347) > > Caused by: java.lang.IllegalStateException: No match found > > at java.util.regex.Matcher.group(Matcher.java:461) > > at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:311) > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > ... 1 more > > Java Result: -1 > > ============================= > > I used the same file from test directory:(AY069118.em) > > > > > > Seth > > > > On 6/5/06, Richard Holland wrote: > > > This one should be fixed in CVS now. Typo on my behalf - I put in code > > > to make it work with both 87+ and pre-87 version of EMBL, then got the > > > regexes the wrong way round!! > > > > > ... > > > > > > cheers, > > > Richard > > > > > > > > > On Fri, 2006-06-02 at 13:04 -0400, Seth Johnson wrote: > > > > Hi Richard, > > > > > > > > I made sure I have the latest source code from CVS compiled > > > > (EMBLFormat.java & GenbankFormat.java are from 05/24/06). I'm happy > > > > to report that GenBank issue is solved!!!! > > > > As far as EMBL parsing, I apologize for not providing the stack dump > > > > for ISSUE #1. Here's the dump of the exception: > > > > -------------------------------------------------------- > > > > org.biojava.bio.BioException: Could not read sequence > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:359) > > > > Caused by: java.lang.NumberFormatException: null > > > > at java.lang.Integer.parseInt(Integer.java:415) > > > > at java.lang.Integer.parseInt(Integer.java:497) > > > > at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:299) > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > ... 1 more > > > > Java Result: -1 > > > > ------------------------------------------------------- > > > > Here, again, is the code that I'm using to to parse: > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > BufferedReader gbBR = null; > > > > try { > > > > gbBR = new BufferedReader(new > > > > FileReader("C:\\Download\\ASN2BSML\\seth_06_02.emb")); > > > > } catch (FileNotFoundException fnfe) { > > > > fnfe.printStackTrace(); > > > > System.exit(-1); > > > > } > > > > Namespace gbNspace = (Namespace) > > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > > Object[]{"gbSpace"} ); > > > > RichSequenceIterator gbSeqs = > > > > RichSequence.IOTools.readEMBLDNA(gbBR,gbNspace); > > > > while (gbSeqs.hasNext()) { > > > > try { > > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > > NCBITaxon myTaxon = rs.getTaxon(); > > > > }catch (BioException be){ > > > > be.printStackTrace(); > > > > System.exit(-1); > > > > } > > > > } > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > And here's the EMBL file that I'm trying to parse: > > > > +++++++++++++++++++++++++ > > > > ID DQ472184 standard; DNA; INV; 546 BP. > > > > XX > > > > AC DQ472184; > > > > XX > > > > SV DQ472184.1 > > > > DT 15-MAY-2006 > > > > XX > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > > > > DE complete cds. > > > > XX > > > > KW . > > > > XX > > > > OS Trypanosoma cruzi strain CL Brener > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > OC Schizotrypanum. > > > > XX > > > > RN [1] > > > > RP 1-546 > > > > RA De Melo L.D.B.; > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > RL Unpublished. > > > > XX > > > > RN [2] > > > > RP 1-546 > > > > RA De Melo L.D.B.; > > > > RT ; > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > RL 21949-900, Brazil > > > > XX > > > > FH Key Location/Qualifiers > > > > FH > > > > FT source 1..546 > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > FT /mol_type="genomic DNA" > > > > FT /strain="CL Brener" > > > > FT /db_xref="taxon:353153" > > > > FT gene <1..>546 > > > > FT /gene="ARC21" > > > > FT /note="TcARC21" > > > > FT mRNA <1..>546 > > > > FT /gene="ARC21" > > > > FT /product="actin-related protein 3" > > > > FT CDS 1..546 > > > > FT /gene="ARC21" > > > > FT /note="actin-binding protein; ARPC3 21 kDa; putative > > > > FT member of Arp2/3 complex" > > > > FT /codon_start=1 > > > > FT /product="actin-related protein 3" > > > > FT /protein_id="ABF13401.1" > > > > FT /db_xref="GI:93360014" > > > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > > > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > > > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > > > > FT FPEKDGTGNKFWMAFAKRPFLASS" > > > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > > > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > > > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > > > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > > > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > > > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > > > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > > > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > > > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > > > > agttag 546 > > > > // > > > > ID DQ472185 standard; DNA; INV; 543 BP. > > > > XX > > > > AC DQ472185; > > > > XX > > > > SV DQ472185.1 > > > > DT 15-MAY-2006 > > > > XX > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > > > > DE complete cds. > > > > XX > > > > KW . > > > > XX > > > > OS Trypanosoma cruzi strain CL Brener > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > OC Schizotrypanum. > > > > XX > > > > RN [1] > > > > RP 1-543 > > > > RA De Melo L.D.B.; > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > RL Unpublished. > > > > XX > > > > RN [2] > > > > RP 1-543 > > > > RA De Melo L.D.B.; > > > > RT ; > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > RL 21949-900, Brazil > > > > XX > > > > FH Key Location/Qualifiers > > > > FH > > > > FT source 1..543 > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > FT /mol_type="genomic DNA" > > > > FT /strain="CL Brener" > > > > FT /db_xref="taxon:353153" > > > > FT gene <1..>543 > > > > FT /gene="ARC20" > > > > FT /note="TcARC20" > > > > FT mRNA <1..>543 > > > > FT /gene="ARC20" > > > > FT /product="actin-related protein 4" > > > > FT CDS 1..543 > > > > FT /gene="ARC20" > > > > FT /note="actin-binding protein; ARPC4 20 kDa; putative > > > > FT member of Arp2/3 complex" > > > > FT /codon_start=1 > > > > FT /product="actin-related protein 4" > > > > FT /protein_id="ABF13402.1" > > > > FT /db_xref="GI:93360016" > > > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > > > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > > > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > > > > FT MKLNVNQRARRAAMEFFLALNFT" > > > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > > > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > > > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > > > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > > > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > > > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > > > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > > > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > > > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > > > > tga 543 > > > > // > > > > +++++++++++++++++++++++++++++++++ > > > > > > > > It looks to me like there's some kind of problem with parsing the > > > > sequence version number. I even tried the sequence from test directory > > > > (AY069118.em) with same outcome. > > > > > > > > Regards, > > > > > > > > Seth > > > > > > > > On 6/2/06, Richard Holland wrote: > > > > > Hi Seth. > > > > > > > > > > Your second point, about the authors string not being read correctly in > > > > > Genbank format, has been fixed (or should have been if I got the code > > > > > right!). Could you check the latest version of biojava-live out of CVS > > > > > and give it another go? Basically the parser did not recognise the > > > > > CONSRTM tag, as it is not mentioned in the sample record provided by > > > > > NCBI, which is what I based the parser on. > > > > > > > > > > I've set it up now so that it reads the CONSRTM tag, but the value is > > > > > merged with the authors tag with (consortium) appended. There will still > > > > > be problems if the consortium value has commas in it - not sure how to > > > > > fix this yet. > > > > > > > > > > Your first point is harder to solve because you did not provide a > > > > > complete stack trace for the exceptions you are getting. The complete > > > > > stack trace would enable me to identify exactly where things are going > > > > > wrong and give me a better chance of fixing them. Could you send the > > > > > stack trace, and I'll see what I can do. > > > > > > > > > > cheers, > > > > > Richard > > > > > > > > > > > > > > > On Thu, 2006-06-01 at 18:03 -0400, Seth Johnson wrote: > > > > > > Hi All, > > > > > > > > > > > > I'm a newbie to the whole BioJava(X) API and was hoping to get some > > > > > > clarification on several issues that I'm having. > > > > > > I am developing a parser that would take as input "NCBI Incremental > > > > > > ASN.1 Sequence Updates to Genbank" files ( > > > > > > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the > > > > > > ASN2GB converter ( > > > > > > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert > > > > > > resulting sequences to a format parsable by BioJava(X) ( > > > > > > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where > > > > > > my problems start. > > > > > > > > > > > > ISSUE 1: > > > > > > I've tried to parse all of the formats that ASN2GB outputs ( GenBank > > > > > > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML), > > > > > > tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank > > > > > > format is recognized by the > > > > > > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with > > > > > > some exceptions that I'll describe in issue #2. This is the code that > > > > > > I'm using to parse, for example, the EMBL output: > > > > > > > > > > > > BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb")); > > > > > > Namespace gbNspace = (Namespace) > > > > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > > > > Object[]{"gbSpace"} ); > > > > > > RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace); > > > > > > while (gbSeqs.hasNext()) { > > > > > > try { > > > > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > > > > // Further processing or RichSequence object from here > > > > > > > > > > > > } catch (BioException be){ > > > > > > be.printStackTrace(); > > > > > > } > > > > > > } > > > > > > > > > > > > The multi-sequence EMBL file looks like this: > > > > > > --------------------------------------------------------------------------------- > > > > > > ID DQ472184 standard; DNA; INV; 546 BP. > > > > > > XX > > > > > > AC DQ472184; > > > > > > XX > > > > > > SV DQ472184.1 > > > > > > DT 15-MAY-2006 > > > > > > XX > > > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > > > > > > DE complete cds. > > > > > > XX > > > > > > KW . > > > > > > XX > > > > > > OS Trypanosoma cruzi strain CL Brener > > > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > > > OC Schizotrypanum. > > > > > > XX > > > > > > RN [1] > > > > > > RP 1-546 > > > > > > RA De Melo L.D.B.; > > > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > > > RL Unpublished. > > > > > > XX > > > > > > RN [2] > > > > > > RP 1-546 > > > > > > RA De Melo L.D.B.; > > > > > > RT ; > > > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > > > RL 21949-900, Brazil > > > > > > XX > > > > > > FH Key Location/Qualifiers > > > > > > FH > > > > > > FT source 1..546 > > > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > > > FT /mol_type="genomic DNA" > > > > > > FT /strain="CL Brener" > > > > > > FT /db_xref="taxon:353153" > > > > > > FT gene <1..>546 > > > > > > FT /gene="ARC21" > > > > > > FT /note="TcARC21" > > > > > > FT mRNA <1..>546 > > > > > > FT /gene="ARC21" > > > > > > FT /product="actin-related protein 3" > > > > > > FT CDS 1..546 > > > > > > FT /gene="ARC21" > > > > > > FT /note="actin-binding protein; ARPC3 21 kDa; putative > > > > > > FT member of Arp2/3 complex" > > > > > > FT /codon_start=1 > > > > > > FT /product="actin-related protein 3" > > > > > > FT /protein_id="ABF13401.1" > > > > > > FT /db_xref="GI:93360014" > > > > > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > > > > > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > > > > > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > > > > > > FT FPEKDGTGNKFWMAFAKRPFLASS" > > > > > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > > > > > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > > > > > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > > > > > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > > > > > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > > > > > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > > > > > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > > > > > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > > > > > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > > > > > > agttag 546 > > > > > > // > > > > > > ID DQ472185 standard; DNA; INV; 543 BP. > > > > > > XX > > > > > > AC DQ472185; > > > > > > XX > > > > > > SV DQ472185.1 > > > > > > DT 15-MAY-2006 > > > > > > XX > > > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > > > > > > DE complete cds. > > > > > > XX > > > > > > KW . > > > > > > XX > > > > > > OS Trypanosoma cruzi strain CL Brener > > > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > > > OC Schizotrypanum. > > > > > > XX > > > > > > RN [1] > > > > > > RP 1-543 > > > > > > RA De Melo L.D.B.; > > > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > > > RL Unpublished. > > > > > > XX > > > > > > RN [2] > > > > > > RP 1-543 > > > > > > RA De Melo L.D.B.; > > > > > > RT ; > > > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > > > RL 21949-900, Brazil > > > > > > XX > > > > > > FH Key Location/Qualifiers > > > > > > FH > > > > > > FT source 1..543 > > > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > > > FT /mol_type="genomic DNA" > > > > > > FT /strain="CL Brener" > > > > > > FT /db_xref="taxon:353153" > > > > > > FT gene <1..>543 > > > > > > FT /gene="ARC20" > > > > > > FT /note="TcARC20" > > > > > > FT mRNA <1..>543 > > > > > > FT /gene="ARC20" > > > > > > FT /product="actin-related protein 4" > > > > > > FT CDS 1..543 > > > > > > FT /gene="ARC20" > > > > > > FT /note="actin-binding protein; ARPC4 20 kDa; putative > > > > > > FT member of Arp2/3 complex" > > > > > > FT /codon_start=1 > > > > > > FT /product="actin-related protein 4" > > > > > > FT /protein_id="ABF13402.1" > > > > > > FT /db_xref="GI:93360016" > > > > > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > > > > > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > > > > > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > > > > > > FT MKLNVNQRARRAAMEFFLALNFT" > > > > > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > > > > > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > > > > > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > > > > > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > > > > > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > > > > > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > > > > > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > > > > > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > > > > > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > > > > > > tga 543 > > > > > > // > > > > > > ----------------------------------------------------------------------- > > > > > > I get an exception message "Could Not Read Sequence". Same thing > > > > > > happens if I use the readINSDSetDNA reader instead of readEMBLDNA one > > > > > > with the following INSDset file (beginning of the file): > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > DQ022078 > > > > > > 16729 > > > > > > DNA > > > > > > linear > > > > > > ENV > > > > > > 15-MAY-2006 > > > > > > 15-MAY-2006 > > > > > > Uncultured bacterium WWRS-2005 putative > > > > > > aminoglycoside phosphotransferase (a3.001), putative oxidoreductase > > > > > > (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase > > > > > > class C (estA3), putative permease (a3.005), putative transmembrane > > > > > > signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone > > > > > > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative > > > > > > asparaginase (a3.010), hypothetical protein (a3.011), hypothetical > > > > > > protein (a3.012), putative membrane protease subunit (a3.013), > > > > > > putative haloalkane dehalogenase (a3.014), putative transcriptional > > > > > > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and > > > > > > hypothetical protein (a3.017) genes, complete cds > > > > > > DQ022078 > > > > > > > > > > > > gb|DQ022078.1| > > > > > > gi|71842722 > > > > > > > > > > > > > > > > > > ENV > > > > > > > > > > > > > > > > > > > > > > > > ? > > > > > > 1..16729 > > > > > > > > > > > > Schmeisser,C. > > > > > > Elend,C. > > > > > > Streit,W.R. > > > > > > > > > > > > Isolation and biochemical characterization > > > > > > of two novel metagenome derived esterases > > > > > > Appl. Environ. Microbiol. 0:0-0 > > > > > > (2006) > > > > > > > > > > > > > > > > > > ? > > > > > > 1..16729 > > > > > > > > > > > > Schmeisser,C. > > > > > > Elend,C. > > > > > > Streit,W.R. > > > > > > > > > > > > Submitted (29-APR-2005) to the > > > > > > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University > > > > > > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057, > > > > > > Germany > > > > > > > > > > > > > > > > > > > > > > > > So my question is wether the ASN2GB produces output that's > > > > > > incompatible with BioJava parsers or is there a problem with the > > > > > > sequence themselves or the problems with the majority of parsers??? > > > > > > Could it be that I'm using the API wrongly for the above formats, > > > > > > although GenBank parser works as advertised with some exceptions > > > > > > below: > > > > > > > > > > > > ISSUE #2: > > > > > > When I try to parse GenBank files using the following code: > > > > > > > > > > > > BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb")); > > > > > > Namespace gbNspace = (Namespace) > > > > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > > > > Object[]{"gbSpace"} ); > > > > > > RichSequenceIterator gbSeqs = > > > > > > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace); > > > > > > while (gbSeqs.hasNext()) { > > > > > > try { > > > > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > > > > // Further processing or RichSequence object from here > > > > > > > > > > > > } catch (BioException be){ > > > > > > be.printStackTrace(); > > > > > > } > > > > > > } > > > > > > > > > > > > Genbank file in question: > > > > > > > > > > > > LOCUS BC074905 838 bp mRNA linear PRI 15-APR-2006 > > > > > > DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038 > > > > > > IMAGE:30915482), complete cds. > > > > > > ACCESSION BC074905 > > > > > > VERSION BC074905.2 GI:50959825 > > > > > > KEYWORDS MGC. > > > > > > SOURCE Homo sapiens (human) > > > > > > ORGANISM Homo sapiens > > > > > > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; > > > > > > Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; > > > > > > Catarrhini; Hominidae; Homo. > > > > > > REFERENCE 1 (bases 1 to 838) > > > > > > AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G., > > > > > > Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D., > > > > > > Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K., > > > > > > Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F., > > > > > > Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L., > > > > > > Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L., > > > > > > Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S., > > > > > > Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J., > > > > > > Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J., > > > > > > McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S., > > > > > > Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W., > > > > > > Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A., > > > > > > Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S., > > > > > > Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y., > > > > > > Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D., > > > > > > Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M., > > > > > > Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E., > > > > > > Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A. > > > > > > CONSRTM Mammalian Gene Collection Program Team > > > > > > TITLE Generation and initial analysis of more than 15,000 full-length > > > > > > human and mouse cDNA sequences > > > > > > JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002) > > > > > > PUBMED 12477932 > > > > > > REFERENCE 2 (bases 1 to 838) > > > > > > CONSRTM NIH MGC Project > > > > > > TITLE Direct Submission > > > > > > JOURNAL Submitted (25-JUN-2004) National Institutes of Health, Mammalian > > > > > > Gene Collection (MGC), Bethesda, MD 20892-2590, USA > > > > > > REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov > > > > > > COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832. > > > > > > Contact: MGC help desk > > > > > > Email: cgapbs-r at mail.nih.gov > > > > > > Tissue Procurement: Genome Sequence Centre, British Columbia Cancer > > > > > > Center > > > > > > cDNA Library Preparation: British Columbia Cancer Research Center > > > > > > cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL) > > > > > > DNA Sequencing by: Genome Sequence Centre, > > > > > > BC Cancer Agency, Vancouver, BC, Canada > > > > > > info at bcgsc.bc.ca > > > > > > Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson > > > > > > Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen > > > > > > Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel > > > > > > Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave > > > > > > Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth > > > > > > Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao, > > > > > > Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR > > > > > > Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang, > > > > > > Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra. > > > > > > > > > > > > Clone distribution: MGC clone distribution information can be found > > > > > > through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov > > > > > > Series: IRBU Plate: 4 Row: C Column: 3. > > > > > > > > > > > > Differences found between this sequence and the human reference > > > > > > genome (build 36) are described in misc_difference features below. > > > > > > FEATURES Location/Qualifiers > > > > > > source 1..838 > > > > > > /organism="Homo sapiens" > > > > > > /mol_type="mRNA" > > > > > > /db_xref="taxon:9606" > > > > > > /clone="MGC:104038 IMAGE:30915482" > > > > > > /tissue_type="Lung, PCR rescued clones" > > > > > > /clone_lib="NIH_MGC_273" > > > > > > /lab_host="DH10B" > > > > > > /note="Vector: pCR4 Topo TA with reversed insert" > > > > > > gene 1..838 > > > > > > /gene="KLK14" > > > > > > /note="synonym: KLK-L6" > > > > > > /db_xref="GeneID:43847" > > > > > > /db_xref="HGNC:6362" > > > > > > /db_xref="IMGT/GENE-DB:6362" > > > > > > /db_xref="MIM:606135" > > > > > > CDS 49..804 > > > > > > /gene="KLK14" > > > > > > /codon_start=1 > > > > > > /product="KLK14 protein" > > > > > > /protein_id="AAH74905.1" > > > > > > /db_xref="GI:50959826" > > > > > > /db_xref="GeneID:43847" > > > > > > /db_xref="HGNC:6362" > > > > > > /db_xref="IMGT/GENE-DB:6362" > > > > > > /db_xref="MIM:606135" > > > > > > /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA > > > > > > GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN > > > > > > YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA > > > > > > SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV > > > > > > SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK" > > > > > > misc_difference 98 > > > > > > /gene="KLK14" > > > > > > /note="'G' in cDNA is 'A' in the human genome; amino acid > > > > > > difference: 'R' in cDNA, 'Q' in the human genome." > > > > > > misc_difference 133 > > > > > > /gene="KLK14" > > > > > > /note="'T' in cDNA is 'C' in the human genome; amino acid > > > > > > difference: 'Y' in cDNA, 'H' in the human genome." > > > > > > ORIGIN > > > > > > 1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg > > > > > > 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag > > > > > > 121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg > > > > > > 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact > > > > > > 241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg > > > > > > 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac > > > > > > 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg > > > > > > 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga > > > > > > 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc > > > > > > 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg > > > > > > 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct > > > > > > 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc > > > > > > 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt > > > > > > 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc > > > > > > // > > > > > > > > > > > > I get the following exception: > > > > > > > > > > > > java.lang.IllegalArgumentException: Authors string cannot be null > > > > > > org.biojava.bio.BioException: Could not read sequence > > > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > > > at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107) > > > > > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258) > > > > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:341) > > > > > > Caused by: java.lang.IllegalArgumentException: Authors string cannot be null > > > > > > at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76) > > > > > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356) > > > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > > > > > > > > > ----------------------------------------------------------------------- > > > > > > > > > > > > I'm trying to see what could be the problem with this particular > > > > > > sequence. Looks to me like the AUTHORS portion is not getting parsed > > > > > > correctly. Any ideas would be greatly appreciated! > > > > > > > > > > > -- > > > > > Richard Holland (BioMart Team) > > > > > EMBL-EBI > > > > > Wellcome Trust Genome Campus > > > > > Hinxton > > > > > Cambridge CB10 1SD > > > > > UNITED KINGDOM > > > > > Tel: +44-(0)1223-494416 > > > > > > > > > > > > > > > > > > > > > -- > > > Richard Holland (BioMart Team) > > > EMBL-EBI > > > Wellcome Trust Genome Campus > > > Hinxton > > > Cambridge CB10 1SD > > > UNITED KINGDOM > > > Tel: +44-(0)1223-494416 > > > > > > > > > > > -- > Richard Holland (BioMart Team) > EMBL-EBI > Wellcome Trust Genome Campus > Hinxton > Cambridge CB10 1SD > UNITED KINGDOM > Tel: +44-(0)1223-494416 > > -- Best Regards, Seth Johnson Senior Bioinformatics Associate Ph: (202) 470-0900 Fx: (775) 251-0358 From johnson.biotech at gmail.com Mon Jun 5 12:28:30 2006 From: johnson.biotech at gmail.com (Seth Johnson) Date: Mon, 5 Jun 2006 12:28:30 -0400 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: <1149522313.3947.48.camel@texas.ebi.ac.uk> References: <1149238900.3948.87.camel@texas.ebi.ac.uk> <1149520267.3947.36.camel@texas.ebi.ac.uk> <1149522313.3947.48.camel@texas.ebi.ac.uk> Message-ID: I agree with you on that one. However, the problem might be a little deeper. Same '?' appear in the INSDseq format bounded by tags and cause the following exception. This tells me that the '?' are actually values that are being incorrectly parsed. Further examination of the .dtd reveals that INSDseqFormat.java is tailord towards the INSDSeq v. 1.3 whereas the files I obtain are in the INSDSeq v. 1.4 (which among other things contain a new tag ). Here're links to both .dtd's: http://www.ebi.ac.uk/embl/Documentation/DTD/INSDSeq_v1.3.dtd.txt http://www.ebi.ac.uk/embl/Documentation/DTD/INSDC_V1.4.dtd.txt I think it might be worth accommodating changes for the INSDseq format, not sure how that would affect the '?' in Genbank. Seth ====================== org.biojava.bio.BioException: Could not read sequence at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) at exonhit.parsers.GenBankParser.main(GenBankParser.java:348) Caused by: org.biojava.bio.seq.io.ParseException: org.biojava.bio.seq.io.ParseException: Bad reference line found: ? at org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:250) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) ... 1 more Caused by: org.biojava.bio.seq.io.ParseException: Bad reference line found: ? at org.biojavax.bio.seq.io.INSDseqFormat$INSDseqHandler.endElement(INSDseqFormat.java:901) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242) at javax.xml.parsers.SAXParser.parse(SAXParser.java:375) at org.biojavax.utils.XMLTools.readXMLChunk(XMLTools.java:97) at org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:246) ... 2 more Java Result: -1 ====================== ~~~~~~~~~~~~~~~~~~~~~~ ? 1..16732 Bjornerfeldt,S. Webster,M.T. Vila,C. Relaxation of Selective Constraint on Dog Mitochondrial DNA Following Domestication Unpublished ? 1..16732 Bjornerfeldt,S. Webster,M.T. Vila,C. Submitted (06-APR-2006) to the EMBL/GenBank/DDBJ databases. Evolutionary Biology, Evolutionary Biology, Norbyvagen 18D, Uppsala 752 36, Sweden ~~~~~~~~~~~~~~~~~~~~~~ On 6/5/06, Richard Holland wrote: > Hmmm... interesting. I _could_ put in a special case that ignores the > question marks, but that wouldn't be 'nice' really - this is more of a > problem with the program that is producing the Genbank files than a > problem with the parser trying to read them. '?' is not a valid tag in > the official Genbank format, and has no meaning attached to it that I > can work out, so I'm reluctant to make the parser recognise it. > > I'd suggest you contact the people who write the software you are using > to produce the Genbank files and ask them if they could stick to the > rules! > > In the meantime you could work around the problem by stripping the > question marks in some kind of pre-processor before passing it onto > BioJavaX for parsing. > > cheers, > Richard > > On Mon, 2006-06-05 at 11:39 -0400, Seth Johnson wrote: > > Removing '?' (or several of them in my case) avoids the following exception: > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > org.biojava.bio.BioException: Could not read sequence > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:348) > > Caused by: org.biojava.bio.seq.io.ParseException: DQ415957 > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > ... 1 more > > Java Result: -1 > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > I don't know where that previous tokenization problem came from since > > I can no longer reproduce it. This time it's more or less straight > > forward. > > Here's the original file with question marks: > > ============================ > > LOCUS DQ415957 1437 bp mRNA linear VRT 01-JUN-2006 > > DEFINITION Danio rerio capillary morphogenesis protein 2A (cmg2a) mRNA, > > complete cds. > > ACCESSION DQ415957 > > VERSION DQ415957.1 GI:89513612 > > KEYWORDS . > > SOURCE Unknown. > > ORGANISM Unknown. > > Unclassified. > > ? > > ? > > FEATURES Location/Qualifiers > > ? > > gene 1..1437 > > /gene="cmg2a" > > CDS 1..1437 > > /gene="cmg2a" > > /note="cell surface receptor; similar to anthrax toxin > > receptor 2 (ANTXR2, ATR2, CMG2)" > > /codon_start=1 > > /product="capillary morphogenesis protein 2A" > > /protein_id="ABD74633.1" > > /db_xref="GI:89513613" > > /translation="MTKENLWSVATTATLFFCLCFSSFKAETPSCHGAYDLYFVLDRS > > GSVSTDWSEIYDFVKNLTERFVSPNLRVSFIVFSSRAEIVLPLTGDRSEINKGLKTLS > > EVNPAGETYMHEGIKLATEQMKKEPKKSSSIIVALTDGKLETYIHQLTIDEADSARKY > > GARVYCVGVKDFDEEQLADVADSKEQVFPVKGGFQALKGIVNSILKQSCTEILTVEPS > > SVCVNQSFDIVLRGNGFAVGRQTEGVICSFIVDGVTYKQKPTKVKIDYILCPAPVLYT > > VGQQMEVLISLNSGTSYITSAFIITASSCSDGTVVAIVFLVLFLLLALALMWWFWPLC > > CTVVIKDPPPQRPPPPPPKLEPDPEPKKKWPTVDASYYGGRGAGGIKRMEVRWGEKGS > > TEEGARLEMAKNAVVSIQEESEEPMVKKPRAPAQTCHQSESKWYTPIRGRLDALWALL > > RRQYDRVSVMRPTSADKGRCMNFSRTQH" > > ORIGIN > > 1 atgacaaagg aaaatctctg gagcgtggca accacggcga ctcttttctt ctgtttatgc > > 61 ttttcatctt ttaaagcgga aaccccatct tgtcatggtg cctacgacct gtactttgtg > > 121 ttggaccgat ctggaagtgt ttcgactgac tggagtgaaa tctatgactt tgtcaaaaat > > 181 cttacagaga gatttgtgag tccaaatctg cgagtgtcct tcattgtttt ttcatcaaga > > 241 gcagagattg tgttaccgct cactggagac aggtcagaaa ttaataaagg cctgaagacc > > 301 ttaagtgagg tcaatccagc tggagaaaca tacatgcatg aaggaattaa attggcaact > > 361 gaacaaatga aaaaagagcc taaaaagtcc tctagtatta ttgtggcctt gactgatgga > > 421 aagcttgaaa cgtatatcca tcaactcact attgacgagg ctgattcagc aaggaagtat > > 481 ggggctcgtg tgtactgtgt tggtgtaaaa gactttgatg aagaacagct agccgatgtg > > 541 gctgattcca aggagcaagt gttcccagtc aaaggaggct ttcaggctct caaaggcatc > > 601 gttaactcga tcctcaagca atcatgcacc gaaatcctaa cagtggaacc gtccagcgtc > > 661 tgcgtgaacc agtcctttga cattgttttg agagggaacg ggttcgcagt ggggagacaa > > 721 acagaaggag tcatctgcag tttcatagtg gatggagtta cttacaaaca aaaaccaacc > > 781 aaagtgaaga ttgactacat cctatgtcct gctccagtgc tgtatacagt tggacagcaa > > 841 atggaggttc tgatcagttt gaacagtgga acatcatata tcaccagtgc tttcatcatc > > 901 actgcctctt catgttcgga cggcacagtg gtggccattg tgttcttggt gctttttctc > > 961 ctgttggctt tggctctgat gtggtggttc tggcctctat gctgcactgt cgttattaaa > > 1021 gacccacctc cacaaagacc tcctccacct ccacctaagc tagagccaga cccggaaccc > > 1081 aagaagaagt ggccaactgt ggatgcatct tactatgggg gaagaggagc tggtggaatc > > 1141 aaacgcatgg aggtccgttg gggagaaaaa gggtctacag aggaaggtgc aagactagag > > 1201 atggctaaga atgcagtagt gtcaatacaa gaggaatcag aagaacccat ggtcaaaaag > > 1261 ccaagagcac ctgcacaaac atgccatcaa tctgaatcca agtggtatac accaatcaga > > 1321 ggccgtcttg acgcactgtg ggctcttttg cggcggcaat atgaccgagt ttcagttatg > > 1381 cgaccaactt ctgcagataa gggtcgctgt atgaatttca gtcgcacgca gcattaa > > // > > > > ============================ > > > > > > On 6/5/06, Richard Holland wrote: > > > Hi again. > > > > > > Could you remove the offending question mark from the GenBank file and > > > try it again to see if that fixes it? The parser should just ignore it > > > but apparently not. The error looks weird to me because the tokenization > > > for a DNA GenBank file _does_ contain the letter 't'! Not sure what's > > > going on here. > > ... > > > > > > cheers, > > > Richard > > > > > > On Mon, 2006-06-05 at 10:37 -0400, Seth Johnson wrote: > > > > Hell again Richard, > > > > > > > > No sooner I've said about the fix of the last parsing exception than > > > > another one came up with Genbank format: > > > > -------------------------------------- > > > > org.biojava.bio.seq.io.ParseException: DQ431065 > > > > org.biojava.bio.BioException: Could not read sequence > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > at exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151) > > > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246) > > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:326) > > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ431065 > > > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > ... 3 more > > > > org.biojava.bio.seq.io.ParseException: > > > > org.biojava.bio.symbol.IllegalSymbolException: This tokenization > > > > doesn't contain character: 't' > > > > ---------------------------------------- > > > > The Genbank file that caused it is as follows: > > > > ========================================= > > > > LOCUS DQ431065 425 bp DNA linear INV 01-JUN-2006 > > > > DEFINITION Reticulitermes sp. ALS-2006c 16S ribosomal RNA gene, partial > > > > sequence; mitochondrial. > > > > ACCESSION DQ431065 > > > > VERSION DQ431065.1 GI:90102206 > > > > KEYWORDS . > > > > SOURCE Vaccinium corymbosum > > > > ORGANISM Vaccinium corymbosum > > > > Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; > > > > Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; > > > > asterids; Ericales; Ericaceae; Vaccinioideae; Vaccinieae; > > > > Vaccinium. > > > > ? > > > > REFERENCE 2 (bases 1 to 425) > > > > AUTHORS Naik,L.D. and Rowland,L.J. > > > > TITLE Expressed Sequence Tags of cDNA clones from subtracted library of > > > > Vaccinium corymbosum > > > > JOURNAL Unpublished (2005) > > > > FEATURES Location/Qualifiers > > > > source 1..425 > > > > /organism="Vaccinium corymbosum" > > > > /mol_type="genomic DNA" > > > > /cultivar="Bluecrop" > > > > /db_xref="taxon:69266" > > > > /tissue_type="Flower buds" > > > > /clone_lib="Subtracted cDNA library of Vaccinium > > > > corymbosum" > > > > /dev_stage="399 hour chill unit exposure" > > > > /note="Vector: pCR4TOPO; Site_1: Eco R I; Site_2: Eco R I" > > > > rRNA <1..>425 > > > > /product="16S ribosomal RNA" > > > > ORIGIN > > > > 1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga agtatggcct gcccgctgac > > > > 61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg tagcatagtc attagttctt > > > > 121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc tgtcttaatt ttgaattgtt > > > > 181 tattgaattt ggtctttgag ttaaaattct tagatgtttt tatgggacga gaagacccta > > > > 241 tagagtttaa catttattat ggtccttttc tgtttgtgag ggctcactgg gccgtctaat > > > > 301 atgttttgtt ggggtgatgg gagggaataa tttaacccct cctttttatt attatattta > > > > 361 tttatattta tttgatccat ttattttgat tgtaagatta aattacctta gggataacag > > > > 421 cgtaa > > > > // > > > > ================================== > > > > I think it's the presence of the '?' at the beginning of the line?!?! > > > > I'm not sure wether the information that was supposed to be present > > > > instead of those question marks is absent from the original ASN.1 > > > > batch file or it's a bug in the NCBI ASN2GO software. It looks to me > > > > that the former is the case since the file from NCBI website contains > > > > much more information than the batch file. Just bringing this to > > > > everyone's attention. > > > > > > > > > > > > -- > > > > Best Regards, > > > > > > > > > > > > Seth Johnson > > > > Senior Bioinformatics Associate > > > > > > > > Ph: (202) 470-0900 > > > > Fx: (775) 251-0358 > > > > > > > > On 6/2/06, Richard Holland wrote: > > > > > Hi Seth. > > > > > > > > > > Your second point, about the authors string not being read correctly in > > > > > Genbank format, has been fixed (or should have been if I got the code > > > > > right!). Could you check the latest version of biojava-live out of CVS > > > > > and give it another go? Basically the parser did not recognise the > > > > > CONSRTM tag, as it is not mentioned in the sample record provided by > > > > > NCBI, which is what I based the parser on. > > > > ... > > > > > > > > > > cheers, > > > > > Richard > > > > > > > > > > > > > -- > > > Richard Holland (BioMart Team) > > > EMBL-EBI > > > Wellcome Trust Genome Campus > > > Hinxton > > > Cambridge CB10 1SD > > > UNITED KINGDOM > > > Tel: +44-(0)1223-494416 > > > > > > > > > > > -- > Richard Holland (BioMart Team) > EMBL-EBI > Wellcome Trust Genome Campus > Hinxton > Cambridge CB10 1SD > UNITED KINGDOM > Tel: +44-(0)1223-494416 > > -- Best Regards, Seth Johnson Senior Bioinformatics Associate Ph: (202) 470-0900 Fx: (775) 251-0358 From richard.holland at ebi.ac.uk Tue Jun 6 04:50:14 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Tue, 06 Jun 2006 09:50:14 +0100 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: References: <1149238900.3948.87.camel@texas.ebi.ac.uk> <1149497066.3947.12.camel@texas.ebi.ac.uk> <1149520598.3947.38.camel@texas.ebi.ac.uk> Message-ID: <1149583814.3947.59.camel@texas.ebi.ac.uk> The program used to generate that EMBL file is doing it incorrectly - it is missing the XX tag after the feature table, and is also missing the SQ tag before the sequence begins. If you generated it using BJX then that's my problem to fix so let me know ASAP if that is the case! cheers, Richard On Mon, 2006-06-05 at 11:53 -0400, Seth Johnson wrote: > :) I got another one for you: > ========================= > org.biojava.bio.BioException: Could not read sequence > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > at exonhit.parsers.GenBankParser.main(GenBankParser.java:348) > Caused by: java.lang.StringIndexOutOfBoundsException: String index out > of range: -3 > at java.lang.String.substring(String.java:1768) > at java.lang.String.substring(String.java:1735) > at org.biojavax.bio.seq.io.EMBLFormat.readSection(EMBLFormat.java:672) > at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:281) > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > ... 1 more > Java Result: -1 > ========================= > File used to produce the above: > ~~~~~~~~~~~~~~~~~~~~~~~~~ > ID DQ472184 standard; DNA; INV; 546 BP. > XX > AC DQ472184; > XX > SV DQ472184.1 > DT 15-MAY-2006 > XX > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > DE complete cds. > XX > KW . > XX > OS Trypanosoma cruzi strain CL Brener > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > OC Schizotrypanum. > XX > RN [1] > RP 1-546 > RA De Melo L.D.B.; > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > RL Unpublished. > XX > RN [2] > RP 1-546 > RA De Melo L.D.B.; > RT ; > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > RL 21949-900, Brazil > XX > FH Key Location/Qualifiers > FH > FT source 1..546 > FT /organism="Trypanosoma cruzi strain CL Brener" > FT /mol_type="genomic DNA" > FT /strain="CL Brener" > FT /db_xref="taxon:353153" > FT gene <1..>546 > FT /gene="ARC21" > FT /note="TcARC21" > FT mRNA <1..>546 > FT /gene="ARC21" > FT /product="actin-related protein 3" > FT CDS 1..546 > FT /gene="ARC21" > FT /note="actin-binding protein; ARPC3 21 kDa; putative > FT member of Arp2/3 complex" > FT /codon_start=1 > FT /product="actin-related protein 3" > FT /protein_id="ABF13401.1" > FT /db_xref="GI:93360014" > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > FT FPEKDGTGNKFWMAFAKRPFLASS" > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > agttag 546 > // > ID DQ472185 standard; DNA; INV; 543 BP. > XX > AC DQ472185; > XX > SV DQ472185.1 > DT 15-MAY-2006 > XX > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > DE complete cds. > XX > KW . > XX > OS Trypanosoma cruzi strain CL Brener > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > OC Schizotrypanum. > XX > RN [1] > RP 1-543 > RA De Melo L.D.B.; > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > RL Unpublished. > XX > RN [2] > RP 1-543 > RA De Melo L.D.B.; > RT ; > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > RL 21949-900, Brazil > XX > FH Key Location/Qualifiers > FH > FT source 1..543 > FT /organism="Trypanosoma cruzi strain CL Brener" > FT /mol_type="genomic DNA" > FT /strain="CL Brener" > FT /db_xref="taxon:353153" > FT gene <1..>543 > FT /gene="ARC20" > FT /note="TcARC20" > FT mRNA <1..>543 > FT /gene="ARC20" > FT /product="actin-related protein 4" > FT CDS 1..543 > FT /gene="ARC20" > FT /note="actin-binding protein; ARPC4 20 kDa; putative > FT member of Arp2/3 complex" > FT /codon_start=1 > FT /product="actin-related protein 4" > FT /protein_id="ABF13402.1" > FT /db_xref="GI:93360016" > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > FT MKLNVNQRARRAAMEFFLALNFT" > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > tga 543 > // > > ~~~~~~~~~~~~~~~~~~~~~~~~~ > > On 6/5/06, Richard Holland wrote: > > Doh! > > > > I am in desparate need of coffee methinks... that's the second error in > > EMBLFormat directly related to me being stupid when I cut-and-pasted the > > stuff for the new 87+ ID line format... > > > > Should be fixed now in CVS (as of about 30 seconds ago). > > > > cheers, > > Richard > > > > On Mon, 2006-06-05 at 11:05 -0400, Seth Johnson wrote: > > > Hi Richard, > > > > > > I go another exception on EMBL format: > > > ============================= > > > org.biojava.bio.BioException: Could not read sequence > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:347) > > > Caused by: java.lang.IllegalStateException: No match found > > > at java.util.regex.Matcher.group(Matcher.java:461) > > > at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:311) > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > ... 1 more > > > Java Result: -1 > > > ============================= > > > I used the same file from test directory:(AY069118.em) > > > > > > > > > Seth > > > > > > On 6/5/06, Richard Holland wrote: > > > > This one should be fixed in CVS now. Typo on my behalf - I put in code > > > > to make it work with both 87+ and pre-87 version of EMBL, then got the > > > > regexes the wrong way round!! > > > > > > > ... > > > > > > > > cheers, > > > > Richard > > > > > > > > > > > > On Fri, 2006-06-02 at 13:04 -0400, Seth Johnson wrote: > > > > > Hi Richard, > > > > > > > > > > I made sure I have the latest source code from CVS compiled > > > > > (EMBLFormat.java & GenbankFormat.java are from 05/24/06). I'm happy > > > > > to report that GenBank issue is solved!!!! > > > > > As far as EMBL parsing, I apologize for not providing the stack dump > > > > > for ISSUE #1. Here's the dump of the exception: > > > > > -------------------------------------------------------- > > > > > org.biojava.bio.BioException: Could not read sequence > > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:359) > > > > > Caused by: java.lang.NumberFormatException: null > > > > > at java.lang.Integer.parseInt(Integer.java:415) > > > > > at java.lang.Integer.parseInt(Integer.java:497) > > > > > at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:299) > > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > > ... 1 more > > > > > Java Result: -1 > > > > > ------------------------------------------------------- > > > > > Here, again, is the code that I'm using to to parse: > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > BufferedReader gbBR = null; > > > > > try { > > > > > gbBR = new BufferedReader(new > > > > > FileReader("C:\\Download\\ASN2BSML\\seth_06_02.emb")); > > > > > } catch (FileNotFoundException fnfe) { > > > > > fnfe.printStackTrace(); > > > > > System.exit(-1); > > > > > } > > > > > Namespace gbNspace = (Namespace) > > > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > > > Object[]{"gbSpace"} ); > > > > > RichSequenceIterator gbSeqs = > > > > > RichSequence.IOTools.readEMBLDNA(gbBR,gbNspace); > > > > > while (gbSeqs.hasNext()) { > > > > > try { > > > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > > > NCBITaxon myTaxon = rs.getTaxon(); > > > > > }catch (BioException be){ > > > > > be.printStackTrace(); > > > > > System.exit(-1); > > > > > } > > > > > } > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > And here's the EMBL file that I'm trying to parse: > > > > > +++++++++++++++++++++++++ > > > > > ID DQ472184 standard; DNA; INV; 546 BP. > > > > > XX > > > > > AC DQ472184; > > > > > XX > > > > > SV DQ472184.1 > > > > > DT 15-MAY-2006 > > > > > XX > > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > > > > > DE complete cds. > > > > > XX > > > > > KW . > > > > > XX > > > > > OS Trypanosoma cruzi strain CL Brener > > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > > OC Schizotrypanum. > > > > > XX > > > > > RN [1] > > > > > RP 1-546 > > > > > RA De Melo L.D.B.; > > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > > RL Unpublished. > > > > > XX > > > > > RN [2] > > > > > RP 1-546 > > > > > RA De Melo L.D.B.; > > > > > RT ; > > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > > RL 21949-900, Brazil > > > > > XX > > > > > FH Key Location/Qualifiers > > > > > FH > > > > > FT source 1..546 > > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > > FT /mol_type="genomic DNA" > > > > > FT /strain="CL Brener" > > > > > FT /db_xref="taxon:353153" > > > > > FT gene <1..>546 > > > > > FT /gene="ARC21" > > > > > FT /note="TcARC21" > > > > > FT mRNA <1..>546 > > > > > FT /gene="ARC21" > > > > > FT /product="actin-related protein 3" > > > > > FT CDS 1..546 > > > > > FT /gene="ARC21" > > > > > FT /note="actin-binding protein; ARPC3 21 kDa; putative > > > > > FT member of Arp2/3 complex" > > > > > FT /codon_start=1 > > > > > FT /product="actin-related protein 3" > > > > > FT /protein_id="ABF13401.1" > > > > > FT /db_xref="GI:93360014" > > > > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > > > > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > > > > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > > > > > FT FPEKDGTGNKFWMAFAKRPFLASS" > > > > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > > > > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > > > > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > > > > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > > > > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > > > > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > > > > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > > > > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > > > > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > > > > > agttag 546 > > > > > // > > > > > ID DQ472185 standard; DNA; INV; 543 BP. > > > > > XX > > > > > AC DQ472185; > > > > > XX > > > > > SV DQ472185.1 > > > > > DT 15-MAY-2006 > > > > > XX > > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > > > > > DE complete cds. > > > > > XX > > > > > KW . > > > > > XX > > > > > OS Trypanosoma cruzi strain CL Brener > > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > > OC Schizotrypanum. > > > > > XX > > > > > RN [1] > > > > > RP 1-543 > > > > > RA De Melo L.D.B.; > > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > > RL Unpublished. > > > > > XX > > > > > RN [2] > > > > > RP 1-543 > > > > > RA De Melo L.D.B.; > > > > > RT ; > > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > > RL 21949-900, Brazil > > > > > XX > > > > > FH Key Location/Qualifiers > > > > > FH > > > > > FT source 1..543 > > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > > FT /mol_type="genomic DNA" > > > > > FT /strain="CL Brener" > > > > > FT /db_xref="taxon:353153" > > > > > FT gene <1..>543 > > > > > FT /gene="ARC20" > > > > > FT /note="TcARC20" > > > > > FT mRNA <1..>543 > > > > > FT /gene="ARC20" > > > > > FT /product="actin-related protein 4" > > > > > FT CDS 1..543 > > > > > FT /gene="ARC20" > > > > > FT /note="actin-binding protein; ARPC4 20 kDa; putative > > > > > FT member of Arp2/3 complex" > > > > > FT /codon_start=1 > > > > > FT /product="actin-related protein 4" > > > > > FT /protein_id="ABF13402.1" > > > > > FT /db_xref="GI:93360016" > > > > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > > > > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > > > > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > > > > > FT MKLNVNQRARRAAMEFFLALNFT" > > > > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > > > > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > > > > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > > > > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > > > > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > > > > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > > > > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > > > > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > > > > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > > > > > tga 543 > > > > > // > > > > > +++++++++++++++++++++++++++++++++ > > > > > > > > > > It looks to me like there's some kind of problem with parsing the > > > > > sequence version number. I even tried the sequence from test directory > > > > > (AY069118.em) with same outcome. > > > > > > > > > > Regards, > > > > > > > > > > Seth > > > > > > > > > > On 6/2/06, Richard Holland wrote: > > > > > > Hi Seth. > > > > > > > > > > > > Your second point, about the authors string not being read correctly in > > > > > > Genbank format, has been fixed (or should have been if I got the code > > > > > > right!). Could you check the latest version of biojava-live out of CVS > > > > > > and give it another go? Basically the parser did not recognise the > > > > > > CONSRTM tag, as it is not mentioned in the sample record provided by > > > > > > NCBI, which is what I based the parser on. > > > > > > > > > > > > I've set it up now so that it reads the CONSRTM tag, but the value is > > > > > > merged with the authors tag with (consortium) appended. There will still > > > > > > be problems if the consortium value has commas in it - not sure how to > > > > > > fix this yet. > > > > > > > > > > > > Your first point is harder to solve because you did not provide a > > > > > > complete stack trace for the exceptions you are getting. The complete > > > > > > stack trace would enable me to identify exactly where things are going > > > > > > wrong and give me a better chance of fixing them. Could you send the > > > > > > stack trace, and I'll see what I can do. > > > > > > > > > > > > cheers, > > > > > > Richard > > > > > > > > > > > > > > > > > > On Thu, 2006-06-01 at 18:03 -0400, Seth Johnson wrote: > > > > > > > Hi All, > > > > > > > > > > > > > > I'm a newbie to the whole BioJava(X) API and was hoping to get some > > > > > > > clarification on several issues that I'm having. > > > > > > > I am developing a parser that would take as input "NCBI Incremental > > > > > > > ASN.1 Sequence Updates to Genbank" files ( > > > > > > > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the > > > > > > > ASN2GB converter ( > > > > > > > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert > > > > > > > resulting sequences to a format parsable by BioJava(X) ( > > > > > > > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where > > > > > > > my problems start. > > > > > > > > > > > > > > ISSUE 1: > > > > > > > I've tried to parse all of the formats that ASN2GB outputs ( GenBank > > > > > > > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML), > > > > > > > tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank > > > > > > > format is recognized by the > > > > > > > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with > > > > > > > some exceptions that I'll describe in issue #2. This is the code that > > > > > > > I'm using to parse, for example, the EMBL output: > > > > > > > > > > > > > > BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb")); > > > > > > > Namespace gbNspace = (Namespace) > > > > > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > > > > > Object[]{"gbSpace"} ); > > > > > > > RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace); > > > > > > > while (gbSeqs.hasNext()) { > > > > > > > try { > > > > > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > > > > > // Further processing or RichSequence object from here > > > > > > > > > > > > > > } catch (BioException be){ > > > > > > > be.printStackTrace(); > > > > > > > } > > > > > > > } > > > > > > > > > > > > > > The multi-sequence EMBL file looks like this: > > > > > > > --------------------------------------------------------------------------------- > > > > > > > ID DQ472184 standard; DNA; INV; 546 BP. > > > > > > > XX > > > > > > > AC DQ472184; > > > > > > > XX > > > > > > > SV DQ472184.1 > > > > > > > DT 15-MAY-2006 > > > > > > > XX > > > > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > > > > > > > DE complete cds. > > > > > > > XX > > > > > > > KW . > > > > > > > XX > > > > > > > OS Trypanosoma cruzi strain CL Brener > > > > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > > > > OC Schizotrypanum. > > > > > > > XX > > > > > > > RN [1] > > > > > > > RP 1-546 > > > > > > > RA De Melo L.D.B.; > > > > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > > > > RL Unpublished. > > > > > > > XX > > > > > > > RN [2] > > > > > > > RP 1-546 > > > > > > > RA De Melo L.D.B.; > > > > > > > RT ; > > > > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > > > > RL 21949-900, Brazil > > > > > > > XX > > > > > > > FH Key Location/Qualifiers > > > > > > > FH > > > > > > > FT source 1..546 > > > > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > > > > FT /mol_type="genomic DNA" > > > > > > > FT /strain="CL Brener" > > > > > > > FT /db_xref="taxon:353153" > > > > > > > FT gene <1..>546 > > > > > > > FT /gene="ARC21" > > > > > > > FT /note="TcARC21" > > > > > > > FT mRNA <1..>546 > > > > > > > FT /gene="ARC21" > > > > > > > FT /product="actin-related protein 3" > > > > > > > FT CDS 1..546 > > > > > > > FT /gene="ARC21" > > > > > > > FT /note="actin-binding protein; ARPC3 21 kDa; putative > > > > > > > FT member of Arp2/3 complex" > > > > > > > FT /codon_start=1 > > > > > > > FT /product="actin-related protein 3" > > > > > > > FT /protein_id="ABF13401.1" > > > > > > > FT /db_xref="GI:93360014" > > > > > > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > > > > > > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > > > > > > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > > > > > > > FT FPEKDGTGNKFWMAFAKRPFLASS" > > > > > > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > > > > > > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > > > > > > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > > > > > > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > > > > > > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > > > > > > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > > > > > > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > > > > > > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > > > > > > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > > > > > > > agttag 546 > > > > > > > // > > > > > > > ID DQ472185 standard; DNA; INV; 543 BP. > > > > > > > XX > > > > > > > AC DQ472185; > > > > > > > XX > > > > > > > SV DQ472185.1 > > > > > > > DT 15-MAY-2006 > > > > > > > XX > > > > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > > > > > > > DE complete cds. > > > > > > > XX > > > > > > > KW . > > > > > > > XX > > > > > > > OS Trypanosoma cruzi strain CL Brener > > > > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > > > > OC Schizotrypanum. > > > > > > > XX > > > > > > > RN [1] > > > > > > > RP 1-543 > > > > > > > RA De Melo L.D.B.; > > > > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > > > > RL Unpublished. > > > > > > > XX > > > > > > > RN [2] > > > > > > > RP 1-543 > > > > > > > RA De Melo L.D.B.; > > > > > > > RT ; > > > > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > > > > RL 21949-900, Brazil > > > > > > > XX > > > > > > > FH Key Location/Qualifiers > > > > > > > FH > > > > > > > FT source 1..543 > > > > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > > > > FT /mol_type="genomic DNA" > > > > > > > FT /strain="CL Brener" > > > > > > > FT /db_xref="taxon:353153" > > > > > > > FT gene <1..>543 > > > > > > > FT /gene="ARC20" > > > > > > > FT /note="TcARC20" > > > > > > > FT mRNA <1..>543 > > > > > > > FT /gene="ARC20" > > > > > > > FT /product="actin-related protein 4" > > > > > > > FT CDS 1..543 > > > > > > > FT /gene="ARC20" > > > > > > > FT /note="actin-binding protein; ARPC4 20 kDa; putative > > > > > > > FT member of Arp2/3 complex" > > > > > > > FT /codon_start=1 > > > > > > > FT /product="actin-related protein 4" > > > > > > > FT /protein_id="ABF13402.1" > > > > > > > FT /db_xref="GI:93360016" > > > > > > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > > > > > > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > > > > > > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > > > > > > > FT MKLNVNQRARRAAMEFFLALNFT" > > > > > > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > > > > > > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > > > > > > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > > > > > > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > > > > > > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > > > > > > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > > > > > > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > > > > > > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > > > > > > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > > > > > > > tga 543 > > > > > > > // > > > > > > > ----------------------------------------------------------------------- > > > > > > > I get an exception message "Could Not Read Sequence". Same thing > > > > > > > happens if I use the readINSDSetDNA reader instead of readEMBLDNA one > > > > > > > with the following INSDset file (beginning of the file): > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > DQ022078 > > > > > > > 16729 > > > > > > > DNA > > > > > > > linear > > > > > > > ENV > > > > > > > 15-MAY-2006 > > > > > > > 15-MAY-2006 > > > > > > > Uncultured bacterium WWRS-2005 putative > > > > > > > aminoglycoside phosphotransferase (a3.001), putative oxidoreductase > > > > > > > (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase > > > > > > > class C (estA3), putative permease (a3.005), putative transmembrane > > > > > > > signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone > > > > > > > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative > > > > > > > asparaginase (a3.010), hypothetical protein (a3.011), hypothetical > > > > > > > protein (a3.012), putative membrane protease subunit (a3.013), > > > > > > > putative haloalkane dehalogenase (a3.014), putative transcriptional > > > > > > > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and > > > > > > > hypothetical protein (a3.017) genes, complete cds > > > > > > > DQ022078 > > > > > > > > > > > > > > gb|DQ022078.1| > > > > > > > gi|71842722 > > > > > > > > > > > > > > > > > > > > > ENV > > > > > > > > > > > > > > > > > > > > > > > > > > > > ? > > > > > > > 1..16729 > > > > > > > > > > > > > > Schmeisser,C. > > > > > > > Elend,C. > > > > > > > Streit,W.R. > > > > > > > > > > > > > > Isolation and biochemical characterization > > > > > > > of two novel metagenome derived esterases > > > > > > > Appl. Environ. Microbiol. 0:0-0 > > > > > > > (2006) > > > > > > > > > > > > > > > > > > > > > ? > > > > > > > 1..16729 > > > > > > > > > > > > > > Schmeisser,C. > > > > > > > Elend,C. > > > > > > > Streit,W.R. > > > > > > > > > > > > > > Submitted (29-APR-2005) to the > > > > > > > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University > > > > > > > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057, > > > > > > > Germany > > > > > > > > > > > > > > > > > > > > > > > > > > > > So my question is wether the ASN2GB produces output that's > > > > > > > incompatible with BioJava parsers or is there a problem with the > > > > > > > sequence themselves or the problems with the majority of parsers??? > > > > > > > Could it be that I'm using the API wrongly for the above formats, > > > > > > > although GenBank parser works as advertised with some exceptions > > > > > > > below: > > > > > > > > > > > > > > ISSUE #2: > > > > > > > When I try to parse GenBank files using the following code: > > > > > > > > > > > > > > BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb")); > > > > > > > Namespace gbNspace = (Namespace) > > > > > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > > > > > Object[]{"gbSpace"} ); > > > > > > > RichSequenceIterator gbSeqs = > > > > > > > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace); > > > > > > > while (gbSeqs.hasNext()) { > > > > > > > try { > > > > > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > > > > > // Further processing or RichSequence object from here > > > > > > > > > > > > > > } catch (BioException be){ > > > > > > > be.printStackTrace(); > > > > > > > } > > > > > > > } > > > > > > > > > > > > > > Genbank file in question: > > > > > > > > > > > > > > LOCUS BC074905 838 bp mRNA linear PRI 15-APR-2006 > > > > > > > DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038 > > > > > > > IMAGE:30915482), complete cds. > > > > > > > ACCESSION BC074905 > > > > > > > VERSION BC074905.2 GI:50959825 > > > > > > > KEYWORDS MGC. > > > > > > > SOURCE Homo sapiens (human) > > > > > > > ORGANISM Homo sapiens > > > > > > > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; > > > > > > > Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; > > > > > > > Catarrhini; Hominidae; Homo. > > > > > > > REFERENCE 1 (bases 1 to 838) > > > > > > > AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G., > > > > > > > Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D., > > > > > > > Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K., > > > > > > > Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F., > > > > > > > Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L., > > > > > > > Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L., > > > > > > > Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S., > > > > > > > Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J., > > > > > > > Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J., > > > > > > > McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S., > > > > > > > Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W., > > > > > > > Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A., > > > > > > > Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S., > > > > > > > Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y., > > > > > > > Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D., > > > > > > > Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M., > > > > > > > Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E., > > > > > > > Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A. > > > > > > > CONSRTM Mammalian Gene Collection Program Team > > > > > > > TITLE Generation and initial analysis of more than 15,000 full-length > > > > > > > human and mouse cDNA sequences > > > > > > > JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002) > > > > > > > PUBMED 12477932 > > > > > > > REFERENCE 2 (bases 1 to 838) > > > > > > > CONSRTM NIH MGC Project > > > > > > > TITLE Direct Submission > > > > > > > JOURNAL Submitted (25-JUN-2004) National Institutes of Health, Mammalian > > > > > > > Gene Collection (MGC), Bethesda, MD 20892-2590, USA > > > > > > > REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov > > > > > > > COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832. > > > > > > > Contact: MGC help desk > > > > > > > Email: cgapbs-r at mail.nih.gov > > > > > > > Tissue Procurement: Genome Sequence Centre, British Columbia Cancer > > > > > > > Center > > > > > > > cDNA Library Preparation: British Columbia Cancer Research Center > > > > > > > cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL) > > > > > > > DNA Sequencing by: Genome Sequence Centre, > > > > > > > BC Cancer Agency, Vancouver, BC, Canada > > > > > > > info at bcgsc.bc.ca > > > > > > > Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson > > > > > > > Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen > > > > > > > Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel > > > > > > > Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave > > > > > > > Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth > > > > > > > Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao, > > > > > > > Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR > > > > > > > Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang, > > > > > > > Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra. > > > > > > > > > > > > > > Clone distribution: MGC clone distribution information can be found > > > > > > > through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov > > > > > > > Series: IRBU Plate: 4 Row: C Column: 3. > > > > > > > > > > > > > > Differences found between this sequence and the human reference > > > > > > > genome (build 36) are described in misc_difference features below. > > > > > > > FEATURES Location/Qualifiers > > > > > > > source 1..838 > > > > > > > /organism="Homo sapiens" > > > > > > > /mol_type="mRNA" > > > > > > > /db_xref="taxon:9606" > > > > > > > /clone="MGC:104038 IMAGE:30915482" > > > > > > > /tissue_type="Lung, PCR rescued clones" > > > > > > > /clone_lib="NIH_MGC_273" > > > > > > > /lab_host="DH10B" > > > > > > > /note="Vector: pCR4 Topo TA with reversed insert" > > > > > > > gene 1..838 > > > > > > > /gene="KLK14" > > > > > > > /note="synonym: KLK-L6" > > > > > > > /db_xref="GeneID:43847" > > > > > > > /db_xref="HGNC:6362" > > > > > > > /db_xref="IMGT/GENE-DB:6362" > > > > > > > /db_xref="MIM:606135" > > > > > > > CDS 49..804 > > > > > > > /gene="KLK14" > > > > > > > /codon_start=1 > > > > > > > /product="KLK14 protein" > > > > > > > /protein_id="AAH74905.1" > > > > > > > /db_xref="GI:50959826" > > > > > > > /db_xref="GeneID:43847" > > > > > > > /db_xref="HGNC:6362" > > > > > > > /db_xref="IMGT/GENE-DB:6362" > > > > > > > /db_xref="MIM:606135" > > > > > > > /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA > > > > > > > GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN > > > > > > > YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA > > > > > > > SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV > > > > > > > SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK" > > > > > > > misc_difference 98 > > > > > > > /gene="KLK14" > > > > > > > /note="'G' in cDNA is 'A' in the human genome; amino acid > > > > > > > difference: 'R' in cDNA, 'Q' in the human genome." > > > > > > > misc_difference 133 > > > > > > > /gene="KLK14" > > > > > > > /note="'T' in cDNA is 'C' in the human genome; amino acid > > > > > > > difference: 'Y' in cDNA, 'H' in the human genome." > > > > > > > ORIGIN > > > > > > > 1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg > > > > > > > 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag > > > > > > > 121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg > > > > > > > 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact > > > > > > > 241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg > > > > > > > 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac > > > > > > > 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg > > > > > > > 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga > > > > > > > 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc > > > > > > > 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg > > > > > > > 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct > > > > > > > 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc > > > > > > > 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt > > > > > > > 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc > > > > > > > // > > > > > > > > > > > > > > I get the following exception: > > > > > > > > > > > > > > java.lang.IllegalArgumentException: Authors string cannot be null > > > > > > > org.biojava.bio.BioException: Could not read sequence > > > > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > > > > at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107) > > > > > > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258) > > > > > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:341) > > > > > > > Caused by: java.lang.IllegalArgumentException: Authors string cannot be null > > > > > > > at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76) > > > > > > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356) > > > > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > > > > > > > > > > > ----------------------------------------------------------------------- > > > > > > > > > > > > > > I'm trying to see what could be the problem with this particular > > > > > > > sequence. Looks to me like the AUTHORS portion is not getting parsed > > > > > > > correctly. Any ideas would be greatly appreciated! > > > > > > > > > > > > > -- > > > > > > Richard Holland (BioMart Team) > > > > > > EMBL-EBI > > > > > > Wellcome Trust Genome Campus > > > > > > Hinxton > > > > > > Cambridge CB10 1SD > > > > > > UNITED KINGDOM > > > > > > Tel: +44-(0)1223-494416 > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Richard Holland (BioMart Team) > > > > EMBL-EBI > > > > Wellcome Trust Genome Campus > > > > Hinxton > > > > Cambridge CB10 1SD > > > > UNITED KINGDOM > > > > Tel: +44-(0)1223-494416 > > > > > > > > > > > > > > > > -- > > Richard Holland (BioMart Team) > > EMBL-EBI > > Wellcome Trust Genome Campus > > Hinxton > > Cambridge CB10 1SD > > UNITED KINGDOM > > Tel: +44-(0)1223-494416 > > > > > > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From richard.holland at ebi.ac.uk Tue Jun 6 05:30:01 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Tue, 06 Jun 2006 10:30:01 +0100 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: References: <1149238900.3948.87.camel@texas.ebi.ac.uk> <1149520267.3947.36.camel@texas.ebi.ac.uk> <1149522313.3947.48.camel@texas.ebi.ac.uk> Message-ID: <1149586202.3947.75.camel@texas.ebi.ac.uk> I can't find any document detailing the differences between INSDseq XML versions 1.3 and 1.4, so I've asked the guys over in the data library section here to see if they have one or can produce one for me. They wrote it so they should know! Once I have this I'll get the INSDseq parser up-to-date. (I could go through the DTDs by hand and work it all out manually, but that would take rather longer than I've got time for at the moment!). It's a bit of a pain trying to keep the parsers up-to-date all the time, especially when people start wanting backwards-compatibility. Does anyone have any bright ideas as to how to manage version changes in file formats? cheers, Richard On Mon, 2006-06-05 at 12:28 -0400, Seth Johnson wrote: > I agree with you on that one. However, the problem might be a little > deeper. Same '?' appear in the INSDseq format bounded by > tags and cause the following exception. > This tells me that the '?' are actually values that are being > incorrectly parsed. Further examination of the .dtd reveals that > INSDseqFormat.java is tailord towards the INSDSeq v. 1.3 whereas the > files I obtain are in the INSDSeq v. 1.4 (which among other things > contain a new tag ). Here're links to both > .dtd's: > > http://www.ebi.ac.uk/embl/Documentation/DTD/INSDSeq_v1.3.dtd.txt > > http://www.ebi.ac.uk/embl/Documentation/DTD/INSDC_V1.4.dtd.txt > > I think it might be worth accommodating changes for the INSDseq > format, not sure how that would affect the '?' in Genbank. > > Seth > > ====================== > org.biojava.bio.BioException: Could not read sequence > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > at exonhit.parsers.GenBankParser.main(GenBankParser.java:348) > Caused by: org.biojava.bio.seq.io.ParseException: > org.biojava.bio.seq.io.ParseException: Bad reference line found: ? > at org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:250) > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > ... 1 more > Caused by: org.biojava.bio.seq.io.ParseException: Bad reference line found: ? > at org.biojavax.bio.seq.io.INSDseqFormat$INSDseqHandler.endElement(INSDseqFormat.java:901) > at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633) > at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241) > at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685) > at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368) > at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834) > at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764) > at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148) > at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:375) > at org.biojavax.utils.XMLTools.readXMLChunk(XMLTools.java:97) > at org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:246) > ... 2 more > Java Result: -1 > ====================== > > ~~~~~~~~~~~~~~~~~~~~~~ > > > ? > 1..16732 > > Bjornerfeldt,S. > Webster,M.T. > Vila,C. > > Relaxation of Selective Constraint on Dog > Mitochondrial DNA Following Domestication > Unpublished > > > ? > 1..16732 > > Bjornerfeldt,S. > Webster,M.T. > Vila,C. > > Submitted (06-APR-2006) to the > EMBL/GenBank/DDBJ databases. Evolutionary Biology, Evolutionary > Biology, Norbyvagen 18D, Uppsala 752 36, > Sweden > > > ~~~~~~~~~~~~~~~~~~~~~~ > > On 6/5/06, Richard Holland wrote: > > Hmmm... interesting. I _could_ put in a special case that ignores the > > question marks, but that wouldn't be 'nice' really - this is more of a > > problem with the program that is producing the Genbank files than a > > problem with the parser trying to read them. '?' is not a valid tag in > > the official Genbank format, and has no meaning attached to it that I > > can work out, so I'm reluctant to make the parser recognise it. > > > > I'd suggest you contact the people who write the software you are using > > to produce the Genbank files and ask them if they could stick to the > > rules! > > > > In the meantime you could work around the problem by stripping the > > question marks in some kind of pre-processor before passing it onto > > BioJavaX for parsing. > > > > cheers, > > Richard > > > > On Mon, 2006-06-05 at 11:39 -0400, Seth Johnson wrote: > > > Removing '?' (or several of them in my case) avoids the following exception: > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > org.biojava.bio.BioException: Could not read sequence > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:348) > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ415957 > > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > ... 1 more > > > Java Result: -1 > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > I don't know where that previous tokenization problem came from since > > > I can no longer reproduce it. This time it's more or less straight > > > forward. > > > Here's the original file with question marks: > > > ============================ > > > LOCUS DQ415957 1437 bp mRNA linear VRT 01-JUN-2006 > > > DEFINITION Danio rerio capillary morphogenesis protein 2A (cmg2a) mRNA, > > > complete cds. > > > ACCESSION DQ415957 > > > VERSION DQ415957.1 GI:89513612 > > > KEYWORDS . > > > SOURCE Unknown. > > > ORGANISM Unknown. > > > Unclassified. > > > ? > > > ? > > > FEATURES Location/Qualifiers > > > ? > > > gene 1..1437 > > > /gene="cmg2a" > > > CDS 1..1437 > > > /gene="cmg2a" > > > /note="cell surface receptor; similar to anthrax toxin > > > receptor 2 (ANTXR2, ATR2, CMG2)" > > > /codon_start=1 > > > /product="capillary morphogenesis protein 2A" > > > /protein_id="ABD74633.1" > > > /db_xref="GI:89513613" > > > /translation="MTKENLWSVATTATLFFCLCFSSFKAETPSCHGAYDLYFVLDRS > > > GSVSTDWSEIYDFVKNLTERFVSPNLRVSFIVFSSRAEIVLPLTGDRSEINKGLKTLS > > > EVNPAGETYMHEGIKLATEQMKKEPKKSSSIIVALTDGKLETYIHQLTIDEADSARKY > > > GARVYCVGVKDFDEEQLADVADSKEQVFPVKGGFQALKGIVNSILKQSCTEILTVEPS > > > SVCVNQSFDIVLRGNGFAVGRQTEGVICSFIVDGVTYKQKPTKVKIDYILCPAPVLYT > > > VGQQMEVLISLNSGTSYITSAFIITASSCSDGTVVAIVFLVLFLLLALALMWWFWPLC > > > CTVVIKDPPPQRPPPPPPKLEPDPEPKKKWPTVDASYYGGRGAGGIKRMEVRWGEKGS > > > TEEGARLEMAKNAVVSIQEESEEPMVKKPRAPAQTCHQSESKWYTPIRGRLDALWALL > > > RRQYDRVSVMRPTSADKGRCMNFSRTQH" > > > ORIGIN > > > 1 atgacaaagg aaaatctctg gagcgtggca accacggcga ctcttttctt ctgtttatgc > > > 61 ttttcatctt ttaaagcgga aaccccatct tgtcatggtg cctacgacct gtactttgtg > > > 121 ttggaccgat ctggaagtgt ttcgactgac tggagtgaaa tctatgactt tgtcaaaaat > > > 181 cttacagaga gatttgtgag tccaaatctg cgagtgtcct tcattgtttt ttcatcaaga > > > 241 gcagagattg tgttaccgct cactggagac aggtcagaaa ttaataaagg cctgaagacc > > > 301 ttaagtgagg tcaatccagc tggagaaaca tacatgcatg aaggaattaa attggcaact > > > 361 gaacaaatga aaaaagagcc taaaaagtcc tctagtatta ttgtggcctt gactgatgga > > > 421 aagcttgaaa cgtatatcca tcaactcact attgacgagg ctgattcagc aaggaagtat > > > 481 ggggctcgtg tgtactgtgt tggtgtaaaa gactttgatg aagaacagct agccgatgtg > > > 541 gctgattcca aggagcaagt gttcccagtc aaaggaggct ttcaggctct caaaggcatc > > > 601 gttaactcga tcctcaagca atcatgcacc gaaatcctaa cagtggaacc gtccagcgtc > > > 661 tgcgtgaacc agtcctttga cattgttttg agagggaacg ggttcgcagt ggggagacaa > > > 721 acagaaggag tcatctgcag tttcatagtg gatggagtta cttacaaaca aaaaccaacc > > > 781 aaagtgaaga ttgactacat cctatgtcct gctccagtgc tgtatacagt tggacagcaa > > > 841 atggaggttc tgatcagttt gaacagtgga acatcatata tcaccagtgc tttcatcatc > > > 901 actgcctctt catgttcgga cggcacagtg gtggccattg tgttcttggt gctttttctc > > > 961 ctgttggctt tggctctgat gtggtggttc tggcctctat gctgcactgt cgttattaaa > > > 1021 gacccacctc cacaaagacc tcctccacct ccacctaagc tagagccaga cccggaaccc > > > 1081 aagaagaagt ggccaactgt ggatgcatct tactatgggg gaagaggagc tggtggaatc > > > 1141 aaacgcatgg aggtccgttg gggagaaaaa gggtctacag aggaaggtgc aagactagag > > > 1201 atggctaaga atgcagtagt gtcaatacaa gaggaatcag aagaacccat ggtcaaaaag > > > 1261 ccaagagcac ctgcacaaac atgccatcaa tctgaatcca agtggtatac accaatcaga > > > 1321 ggccgtcttg acgcactgtg ggctcttttg cggcggcaat atgaccgagt ttcagttatg > > > 1381 cgaccaactt ctgcagataa gggtcgctgt atgaatttca gtcgcacgca gcattaa > > > // > > > > > > ============================ > > > > > > > > > On 6/5/06, Richard Holland wrote: > > > > Hi again. > > > > > > > > Could you remove the offending question mark from the GenBank file and > > > > try it again to see if that fixes it? The parser should just ignore it > > > > but apparently not. The error looks weird to me because the tokenization > > > > for a DNA GenBank file _does_ contain the letter 't'! Not sure what's > > > > going on here. > > > ... > > > > > > > > cheers, > > > > Richard > > > > > > > > On Mon, 2006-06-05 at 10:37 -0400, Seth Johnson wrote: > > > > > Hell again Richard, > > > > > > > > > > No sooner I've said about the fix of the last parsing exception than > > > > > another one came up with Genbank format: > > > > > -------------------------------------- > > > > > org.biojava.bio.seq.io.ParseException: DQ431065 > > > > > org.biojava.bio.BioException: Could not read sequence > > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > > at exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151) > > > > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246) > > > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:326) > > > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ431065 > > > > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > > ... 3 more > > > > > org.biojava.bio.seq.io.ParseException: > > > > > org.biojava.bio.symbol.IllegalSymbolException: This tokenization > > > > > doesn't contain character: 't' > > > > > ---------------------------------------- > > > > > The Genbank file that caused it is as follows: > > > > > ========================================= > > > > > LOCUS DQ431065 425 bp DNA linear INV 01-JUN-2006 > > > > > DEFINITION Reticulitermes sp. ALS-2006c 16S ribosomal RNA gene, partial > > > > > sequence; mitochondrial. > > > > > ACCESSION DQ431065 > > > > > VERSION DQ431065.1 GI:90102206 > > > > > KEYWORDS . > > > > > SOURCE Vaccinium corymbosum > > > > > ORGANISM Vaccinium corymbosum > > > > > Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; > > > > > Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; > > > > > asterids; Ericales; Ericaceae; Vaccinioideae; Vaccinieae; > > > > > Vaccinium. > > > > > ? > > > > > REFERENCE 2 (bases 1 to 425) > > > > > AUTHORS Naik,L.D. and Rowland,L.J. > > > > > TITLE Expressed Sequence Tags of cDNA clones from subtracted library of > > > > > Vaccinium corymbosum > > > > > JOURNAL Unpublished (2005) > > > > > FEATURES Location/Qualifiers > > > > > source 1..425 > > > > > /organism="Vaccinium corymbosum" > > > > > /mol_type="genomic DNA" > > > > > /cultivar="Bluecrop" > > > > > /db_xref="taxon:69266" > > > > > /tissue_type="Flower buds" > > > > > /clone_lib="Subtracted cDNA library of Vaccinium > > > > > corymbosum" > > > > > /dev_stage="399 hour chill unit exposure" > > > > > /note="Vector: pCR4TOPO; Site_1: Eco R I; Site_2: Eco R I" > > > > > rRNA <1..>425 > > > > > /product="16S ribosomal RNA" > > > > > ORIGIN > > > > > 1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga agtatggcct gcccgctgac > > > > > 61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg tagcatagtc attagttctt > > > > > 121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc tgtcttaatt ttgaattgtt > > > > > 181 tattgaattt ggtctttgag ttaaaattct tagatgtttt tatgggacga gaagacccta > > > > > 241 tagagtttaa catttattat ggtccttttc tgtttgtgag ggctcactgg gccgtctaat > > > > > 301 atgttttgtt ggggtgatgg gagggaataa tttaacccct cctttttatt attatattta > > > > > 361 tttatattta tttgatccat ttattttgat tgtaagatta aattacctta gggataacag > > > > > 421 cgtaa > > > > > // > > > > > ================================== > > > > > I think it's the presence of the '?' at the beginning of the line?!?! > > > > > I'm not sure wether the information that was supposed to be present > > > > > instead of those question marks is absent from the original ASN.1 > > > > > batch file or it's a bug in the NCBI ASN2GO software. It looks to me > > > > > that the former is the case since the file from NCBI website contains > > > > > much more information than the batch file. Just bringing this to > > > > > everyone's attention. > > > > > > > > > > > > > > > -- > > > > > Best Regards, > > > > > > > > > > > > > > > Seth Johnson > > > > > Senior Bioinformatics Associate > > > > > > > > > > Ph: (202) 470-0900 > > > > > Fx: (775) 251-0358 > > > > > > > > > > On 6/2/06, Richard Holland wrote: > > > > > > Hi Seth. > > > > > > > > > > > > Your second point, about the authors string not being read correctly in > > > > > > Genbank format, has been fixed (or should have been if I got the code > > > > > > right!). Could you check the latest version of biojava-live out of CVS > > > > > > and give it another go? Basically the parser did not recognise the > > > > > > CONSRTM tag, as it is not mentioned in the sample record provided by > > > > > > NCBI, which is what I based the parser on. > > > > > ... > > > > > > > > > > > > cheers, > > > > > > Richard > > > > > > > > > > > > > > > > -- > > > > Richard Holland (BioMart Team) > > > > EMBL-EBI > > > > Wellcome Trust Genome Campus > > > > Hinxton > > > > Cambridge CB10 1SD > > > > UNITED KINGDOM > > > > Tel: +44-(0)1223-494416 > > > > > > > > > > > > > > > > -- > > Richard Holland (BioMart Team) > > EMBL-EBI > > Wellcome Trust Genome Campus > > Hinxton > > Cambridge CB10 1SD > > UNITED KINGDOM > > Tel: +44-(0)1223-494416 > > > > > > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From johnson.biotech at gmail.com Tue Jun 6 08:40:15 2006 From: johnson.biotech at gmail.com (Seth Johnson) Date: Tue, 6 Jun 2006 08:40:15 -0400 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: <1149583814.3947.59.camel@texas.ebi.ac.uk> References: <1149238900.3948.87.camel@texas.ebi.ac.uk> <1149497066.3947.12.camel@texas.ebi.ac.uk> <1149520598.3947.38.camel@texas.ebi.ac.uk> <1149583814.3947.59.camel@texas.ebi.ac.uk> Message-ID: I see now! It looks like the ASN2GB converter is taking some liberties with EMBL format. I'll try to experiment with command line options of that software and if all else fails get hold of the NCBI developers. On 6/6/06, Richard Holland wrote: > The program used to generate that EMBL file is doing it incorrectly - it > is missing the XX tag after the feature table, and is also missing the > SQ tag before the sequence begins. > > If you generated it using BJX then that's my problem to fix so let me > know ASAP if that is the case! > > cheers, > Richard > > On Mon, 2006-06-05 at 11:53 -0400, Seth Johnson wrote: > > :) I got another one for you: > > ========================= > > org.biojava.bio.BioException: Could not read sequence > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:348) > > Caused by: java.lang.StringIndexOutOfBoundsException: String index out > > of range: -3 > > at java.lang.String.substring(String.java:1768) > > at java.lang.String.substring(String.java:1735) > > at org.biojavax.bio.seq.io.EMBLFormat.readSection(EMBLFormat.java:672) > > at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:281) > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > ... 1 more > > Java Result: -1 > > ========================= > > File used to produce the above: > > ~~~~~~~~~~~~~~~~~~~~~~~~~ > > ID DQ472184 standard; DNA; INV; 546 BP. > > XX > > AC DQ472184; > > XX > > SV DQ472184.1 > > DT 15-MAY-2006 > > XX > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > > DE complete cds. > > XX > > KW . > > XX > > OS Trypanosoma cruzi strain CL Brener > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > OC Schizotrypanum. > > XX > > RN [1] > > RP 1-546 > > RA De Melo L.D.B.; > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > RL Unpublished. > > XX > > RN [2] > > RP 1-546 > > RA De Melo L.D.B.; > > RT ; > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > RL 21949-900, Brazil > > XX > > FH Key Location/Qualifiers > > FH > > FT source 1..546 > > FT /organism="Trypanosoma cruzi strain CL Brener" > > FT /mol_type="genomic DNA" > > FT /strain="CL Brener" > > FT /db_xref="taxon:353153" > > FT gene <1..>546 > > FT /gene="ARC21" > > FT /note="TcARC21" > > FT mRNA <1..>546 > > FT /gene="ARC21" > > FT /product="actin-related protein 3" > > FT CDS 1..546 > > FT /gene="ARC21" > > FT /note="actin-binding protein; ARPC3 21 kDa; putative > > FT member of Arp2/3 complex" > > FT /codon_start=1 > > FT /product="actin-related protein 3" > > FT /protein_id="ABF13401.1" > > FT /db_xref="GI:93360014" > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > > FT FPEKDGTGNKFWMAFAKRPFLASS" > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > > agttag 546 > > // > > ID DQ472185 standard; DNA; INV; 543 BP. > > XX > > AC DQ472185; > > XX > > SV DQ472185.1 > > DT 15-MAY-2006 > > XX > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > > DE complete cds. > > XX > > KW . > > XX > > OS Trypanosoma cruzi strain CL Brener > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > OC Schizotrypanum. > > XX > > RN [1] > > RP 1-543 > > RA De Melo L.D.B.; > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > RL Unpublished. > > XX > > RN [2] > > RP 1-543 > > RA De Melo L.D.B.; > > RT ; > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > RL 21949-900, Brazil > > XX > > FH Key Location/Qualifiers > > FH > > FT source 1..543 > > FT /organism="Trypanosoma cruzi strain CL Brener" > > FT /mol_type="genomic DNA" > > FT /strain="CL Brener" > > FT /db_xref="taxon:353153" > > FT gene <1..>543 > > FT /gene="ARC20" > > FT /note="TcARC20" > > FT mRNA <1..>543 > > FT /gene="ARC20" > > FT /product="actin-related protein 4" > > FT CDS 1..543 > > FT /gene="ARC20" > > FT /note="actin-binding protein; ARPC4 20 kDa; putative > > FT member of Arp2/3 complex" > > FT /codon_start=1 > > FT /product="actin-related protein 4" > > FT /protein_id="ABF13402.1" > > FT /db_xref="GI:93360016" > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > > FT MKLNVNQRARRAAMEFFLALNFT" > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > > tga 543 > > // > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > On 6/5/06, Richard Holland wrote: > > > Doh! > > > > > > I am in desparate need of coffee methinks... that's the second error in > > > EMBLFormat directly related to me being stupid when I cut-and-pasted the > > > stuff for the new 87+ ID line format... > > > > > > Should be fixed now in CVS (as of about 30 seconds ago). > > > > > > cheers, > > > Richard > > > > > > On Mon, 2006-06-05 at 11:05 -0400, Seth Johnson wrote: > > > > Hi Richard, > > > > > > > > I go another exception on EMBL format: > > > > ============================= > > > > org.biojava.bio.BioException: Could not read sequence > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:347) > > > > Caused by: java.lang.IllegalStateException: No match found > > > > at java.util.regex.Matcher.group(Matcher.java:461) > > > > at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:311) > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > ... 1 more > > > > Java Result: -1 > > > > ============================= > > > > I used the same file from test directory:(AY069118.em) > > > > > > > > > > > > Seth > > > > > > > > On 6/5/06, Richard Holland wrote: > > > > > This one should be fixed in CVS now. Typo on my behalf - I put in code > > > > > to make it work with both 87+ and pre-87 version of EMBL, then got the > > > > > regexes the wrong way round!! > > > > > > > > > ... > > > > > > > > > > cheers, > > > > > Richard > > > > > > > > > > > > > > > On Fri, 2006-06-02 at 13:04 -0400, Seth Johnson wrote: > > > > > > Hi Richard, > > > > > > > > > > > > I made sure I have the latest source code from CVS compiled > > > > > > (EMBLFormat.java & GenbankFormat.java are from 05/24/06). I'm happy > > > > > > to report that GenBank issue is solved!!!! > > > > > > As far as EMBL parsing, I apologize for not providing the stack dump > > > > > > for ISSUE #1. Here's the dump of the exception: > > > > > > -------------------------------------------------------- > > > > > > org.biojava.bio.BioException: Could not read sequence > > > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:359) > > > > > > Caused by: java.lang.NumberFormatException: null > > > > > > at java.lang.Integer.parseInt(Integer.java:415) > > > > > > at java.lang.Integer.parseInt(Integer.java:497) > > > > > > at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:299) > > > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > > > ... 1 more > > > > > > Java Result: -1 > > > > > > ------------------------------------------------------- > > > > > > Here, again, is the code that I'm using to to parse: > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > BufferedReader gbBR = null; > > > > > > try { > > > > > > gbBR = new BufferedReader(new > > > > > > FileReader("C:\\Download\\ASN2BSML\\seth_06_02.emb")); > > > > > > } catch (FileNotFoundException fnfe) { > > > > > > fnfe.printStackTrace(); > > > > > > System.exit(-1); > > > > > > } > > > > > > Namespace gbNspace = (Namespace) > > > > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > > > > Object[]{"gbSpace"} ); > > > > > > RichSequenceIterator gbSeqs = > > > > > > RichSequence.IOTools.readEMBLDNA(gbBR,gbNspace); > > > > > > while (gbSeqs.hasNext()) { > > > > > > try { > > > > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > > > > NCBITaxon myTaxon = rs.getTaxon(); > > > > > > }catch (BioException be){ > > > > > > be.printStackTrace(); > > > > > > System.exit(-1); > > > > > > } > > > > > > } > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > And here's the EMBL file that I'm trying to parse: > > > > > > +++++++++++++++++++++++++ > > > > > > ID DQ472184 standard; DNA; INV; 546 BP. > > > > > > XX > > > > > > AC DQ472184; > > > > > > XX > > > > > > SV DQ472184.1 > > > > > > DT 15-MAY-2006 > > > > > > XX > > > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > > > > > > DE complete cds. > > > > > > XX > > > > > > KW . > > > > > > XX > > > > > > OS Trypanosoma cruzi strain CL Brener > > > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > > > OC Schizotrypanum. > > > > > > XX > > > > > > RN [1] > > > > > > RP 1-546 > > > > > > RA De Melo L.D.B.; > > > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > > > RL Unpublished. > > > > > > XX > > > > > > RN [2] > > > > > > RP 1-546 > > > > > > RA De Melo L.D.B.; > > > > > > RT ; > > > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > > > RL 21949-900, Brazil > > > > > > XX > > > > > > FH Key Location/Qualifiers > > > > > > FH > > > > > > FT source 1..546 > > > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > > > FT /mol_type="genomic DNA" > > > > > > FT /strain="CL Brener" > > > > > > FT /db_xref="taxon:353153" > > > > > > FT gene <1..>546 > > > > > > FT /gene="ARC21" > > > > > > FT /note="TcARC21" > > > > > > FT mRNA <1..>546 > > > > > > FT /gene="ARC21" > > > > > > FT /product="actin-related protein 3" > > > > > > FT CDS 1..546 > > > > > > FT /gene="ARC21" > > > > > > FT /note="actin-binding protein; ARPC3 21 kDa; putative > > > > > > FT member of Arp2/3 complex" > > > > > > FT /codon_start=1 > > > > > > FT /product="actin-related protein 3" > > > > > > FT /protein_id="ABF13401.1" > > > > > > FT /db_xref="GI:93360014" > > > > > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > > > > > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > > > > > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > > > > > > FT FPEKDGTGNKFWMAFAKRPFLASS" > > > > > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > > > > > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > > > > > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > > > > > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > > > > > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > > > > > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > > > > > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > > > > > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > > > > > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > > > > > > agttag 546 > > > > > > // > > > > > > ID DQ472185 standard; DNA; INV; 543 BP. > > > > > > XX > > > > > > AC DQ472185; > > > > > > XX > > > > > > SV DQ472185.1 > > > > > > DT 15-MAY-2006 > > > > > > XX > > > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > > > > > > DE complete cds. > > > > > > XX > > > > > > KW . > > > > > > XX > > > > > > OS Trypanosoma cruzi strain CL Brener > > > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > > > OC Schizotrypanum. > > > > > > XX > > > > > > RN [1] > > > > > > RP 1-543 > > > > > > RA De Melo L.D.B.; > > > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > > > RL Unpublished. > > > > > > XX > > > > > > RN [2] > > > > > > RP 1-543 > > > > > > RA De Melo L.D.B.; > > > > > > RT ; > > > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > > > RL 21949-900, Brazil > > > > > > XX > > > > > > FH Key Location/Qualifiers > > > > > > FH > > > > > > FT source 1..543 > > > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > > > FT /mol_type="genomic DNA" > > > > > > FT /strain="CL Brener" > > > > > > FT /db_xref="taxon:353153" > > > > > > FT gene <1..>543 > > > > > > FT /gene="ARC20" > > > > > > FT /note="TcARC20" > > > > > > FT mRNA <1..>543 > > > > > > FT /gene="ARC20" > > > > > > FT /product="actin-related protein 4" > > > > > > FT CDS 1..543 > > > > > > FT /gene="ARC20" > > > > > > FT /note="actin-binding protein; ARPC4 20 kDa; putative > > > > > > FT member of Arp2/3 complex" > > > > > > FT /codon_start=1 > > > > > > FT /product="actin-related protein 4" > > > > > > FT /protein_id="ABF13402.1" > > > > > > FT /db_xref="GI:93360016" > > > > > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > > > > > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > > > > > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > > > > > > FT MKLNVNQRARRAAMEFFLALNFT" > > > > > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > > > > > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > > > > > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > > > > > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > > > > > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > > > > > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > > > > > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > > > > > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > > > > > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > > > > > > tga 543 > > > > > > // > > > > > > +++++++++++++++++++++++++++++++++ > > > > > > > > > > > > It looks to me like there's some kind of problem with parsing the > > > > > > sequence version number. I even tried the sequence from test directory > > > > > > (AY069118.em) with same outcome. > > > > > > > > > > > > Regards, > > > > > > > > > > > > Seth > > > > > > > > > > > > On 6/2/06, Richard Holland wrote: > > > > > > > Hi Seth. > > > > > > > > > > > > > > Your second point, about the authors string not being read correctly in > > > > > > > Genbank format, has been fixed (or should have been if I got the code > > > > > > > right!). Could you check the latest version of biojava-live out of CVS > > > > > > > and give it another go? Basically the parser did not recognise the > > > > > > > CONSRTM tag, as it is not mentioned in the sample record provided by > > > > > > > NCBI, which is what I based the parser on. > > > > > > > > > > > > > > I've set it up now so that it reads the CONSRTM tag, but the value is > > > > > > > merged with the authors tag with (consortium) appended. There will still > > > > > > > be problems if the consortium value has commas in it - not sure how to > > > > > > > fix this yet. > > > > > > > > > > > > > > Your first point is harder to solve because you did not provide a > > > > > > > complete stack trace for the exceptions you are getting. The complete > > > > > > > stack trace would enable me to identify exactly where things are going > > > > > > > wrong and give me a better chance of fixing them. Could you send the > > > > > > > stack trace, and I'll see what I can do. > > > > > > > > > > > > > > cheers, > > > > > > > Richard > > > > > > > > > > > > > > > > > > > > > On Thu, 2006-06-01 at 18:03 -0400, Seth Johnson wrote: > > > > > > > > Hi All, > > > > > > > > > > > > > > > > I'm a newbie to the whole BioJava(X) API and was hoping to get some > > > > > > > > clarification on several issues that I'm having. > > > > > > > > I am developing a parser that would take as input "NCBI Incremental > > > > > > > > ASN.1 Sequence Updates to Genbank" files ( > > > > > > > > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the > > > > > > > > ASN2GB converter ( > > > > > > > > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert > > > > > > > > resulting sequences to a format parsable by BioJava(X) ( > > > > > > > > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where > > > > > > > > my problems start. > > > > > > > > > > > > > > > > ISSUE 1: > > > > > > > > I've tried to parse all of the formats that ASN2GB outputs ( GenBank > > > > > > > > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML), > > > > > > > > tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank > > > > > > > > format is recognized by the > > > > > > > > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with > > > > > > > > some exceptions that I'll describe in issue #2. This is the code that > > > > > > > > I'm using to parse, for example, the EMBL output: > > > > > > > > > > > > > > > > BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb")); > > > > > > > > Namespace gbNspace = (Namespace) > > > > > > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > > > > > > Object[]{"gbSpace"} ); > > > > > > > > RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace); > > > > > > > > while (gbSeqs.hasNext()) { > > > > > > > > try { > > > > > > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > > > > > > // Further processing or RichSequence object from here > > > > > > > > > > > > > > > > } catch (BioException be){ > > > > > > > > be.printStackTrace(); > > > > > > > > } > > > > > > > > } > > > > > > > > > > > > > > > > The multi-sequence EMBL file looks like this: > > > > > > > > --------------------------------------------------------------------------------- > > > > > > > > ID DQ472184 standard; DNA; INV; 546 BP. > > > > > > > > XX > > > > > > > > AC DQ472184; > > > > > > > > XX > > > > > > > > SV DQ472184.1 > > > > > > > > DT 15-MAY-2006 > > > > > > > > XX > > > > > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > > > > > > > > DE complete cds. > > > > > > > > XX > > > > > > > > KW . > > > > > > > > XX > > > > > > > > OS Trypanosoma cruzi strain CL Brener > > > > > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > > > > > OC Schizotrypanum. > > > > > > > > XX > > > > > > > > RN [1] > > > > > > > > RP 1-546 > > > > > > > > RA De Melo L.D.B.; > > > > > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > > > > > RL Unpublished. > > > > > > > > XX > > > > > > > > RN [2] > > > > > > > > RP 1-546 > > > > > > > > RA De Melo L.D.B.; > > > > > > > > RT ; > > > > > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > > > > > RL 21949-900, Brazil > > > > > > > > XX > > > > > > > > FH Key Location/Qualifiers > > > > > > > > FH > > > > > > > > FT source 1..546 > > > > > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > > > > > FT /mol_type="genomic DNA" > > > > > > > > FT /strain="CL Brener" > > > > > > > > FT /db_xref="taxon:353153" > > > > > > > > FT gene <1..>546 > > > > > > > > FT /gene="ARC21" > > > > > > > > FT /note="TcARC21" > > > > > > > > FT mRNA <1..>546 > > > > > > > > FT /gene="ARC21" > > > > > > > > FT /product="actin-related protein 3" > > > > > > > > FT CDS 1..546 > > > > > > > > FT /gene="ARC21" > > > > > > > > FT /note="actin-binding protein; ARPC3 21 kDa; putative > > > > > > > > FT member of Arp2/3 complex" > > > > > > > > FT /codon_start=1 > > > > > > > > FT /product="actin-related protein 3" > > > > > > > > FT /protein_id="ABF13401.1" > > > > > > > > FT /db_xref="GI:93360014" > > > > > > > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > > > > > > > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > > > > > > > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > > > > > > > > FT FPEKDGTGNKFWMAFAKRPFLASS" > > > > > > > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > > > > > > > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > > > > > > > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > > > > > > > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > > > > > > > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > > > > > > > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > > > > > > > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > > > > > > > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > > > > > > > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > > > > > > > > agttag 546 > > > > > > > > // > > > > > > > > ID DQ472185 standard; DNA; INV; 543 BP. > > > > > > > > XX > > > > > > > > AC DQ472185; > > > > > > > > XX > > > > > > > > SV DQ472185.1 > > > > > > > > DT 15-MAY-2006 > > > > > > > > XX > > > > > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > > > > > > > > DE complete cds. > > > > > > > > XX > > > > > > > > KW . > > > > > > > > XX > > > > > > > > OS Trypanosoma cruzi strain CL Brener > > > > > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > > > > > OC Schizotrypanum. > > > > > > > > XX > > > > > > > > RN [1] > > > > > > > > RP 1-543 > > > > > > > > RA De Melo L.D.B.; > > > > > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > > > > > RL Unpublished. > > > > > > > > XX > > > > > > > > RN [2] > > > > > > > > RP 1-543 > > > > > > > > RA De Melo L.D.B.; > > > > > > > > RT ; > > > > > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > > > > > RL 21949-900, Brazil > > > > > > > > XX > > > > > > > > FH Key Location/Qualifiers > > > > > > > > FH > > > > > > > > FT source 1..543 > > > > > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > > > > > FT /mol_type="genomic DNA" > > > > > > > > FT /strain="CL Brener" > > > > > > > > FT /db_xref="taxon:353153" > > > > > > > > FT gene <1..>543 > > > > > > > > FT /gene="ARC20" > > > > > > > > FT /note="TcARC20" > > > > > > > > FT mRNA <1..>543 > > > > > > > > FT /gene="ARC20" > > > > > > > > FT /product="actin-related protein 4" > > > > > > > > FT CDS 1..543 > > > > > > > > FT /gene="ARC20" > > > > > > > > FT /note="actin-binding protein; ARPC4 20 kDa; putative > > > > > > > > FT member of Arp2/3 complex" > > > > > > > > FT /codon_start=1 > > > > > > > > FT /product="actin-related protein 4" > > > > > > > > FT /protein_id="ABF13402.1" > > > > > > > > FT /db_xref="GI:93360016" > > > > > > > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > > > > > > > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > > > > > > > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > > > > > > > > FT MKLNVNQRARRAAMEFFLALNFT" > > > > > > > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > > > > > > > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > > > > > > > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > > > > > > > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > > > > > > > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > > > > > > > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > > > > > > > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > > > > > > > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > > > > > > > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > > > > > > > > tga 543 > > > > > > > > // > > > > > > > > ----------------------------------------------------------------------- > > > > > > > > I get an exception message "Could Not Read Sequence". Same thing > > > > > > > > happens if I use the readINSDSetDNA reader instead of readEMBLDNA one > > > > > > > > with the following INSDset file (beginning of the file): > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > DQ022078 > > > > > > > > 16729 > > > > > > > > DNA > > > > > > > > linear > > > > > > > > ENV > > > > > > > > 15-MAY-2006 > > > > > > > > 15-MAY-2006 > > > > > > > > Uncultured bacterium WWRS-2005 putative > > > > > > > > aminoglycoside phosphotransferase (a3.001), putative oxidoreductase > > > > > > > > (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase > > > > > > > > class C (estA3), putative permease (a3.005), putative transmembrane > > > > > > > > signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone > > > > > > > > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative > > > > > > > > asparaginase (a3.010), hypothetical protein (a3.011), hypothetical > > > > > > > > protein (a3.012), putative membrane protease subunit (a3.013), > > > > > > > > putative haloalkane dehalogenase (a3.014), putative transcriptional > > > > > > > > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and > > > > > > > > hypothetical protein (a3.017) genes, complete cds > > > > > > > > DQ022078 > > > > > > > > > > > > > > > > gb|DQ022078.1| > > > > > > > > gi|71842722 > > > > > > > > > > > > > > > > > > > > > > > > ENV > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ? > > > > > > > > 1..16729 > > > > > > > > > > > > > > > > Schmeisser,C. > > > > > > > > Elend,C. > > > > > > > > Streit,W.R. > > > > > > > > > > > > > > > > Isolation and biochemical characterization > > > > > > > > of two novel metagenome derived esterases > > > > > > > > Appl. Environ. Microbiol. 0:0-0 > > > > > > > > (2006) > > > > > > > > > > > > > > > > > > > > > > > > ? > > > > > > > > 1..16729 > > > > > > > > > > > > > > > > Schmeisser,C. > > > > > > > > Elend,C. > > > > > > > > Streit,W.R. > > > > > > > > > > > > > > > > Submitted (29-APR-2005) to the > > > > > > > > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University > > > > > > > > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057, > > > > > > > > Germany > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > So my question is wether the ASN2GB produces output that's > > > > > > > > incompatible with BioJava parsers or is there a problem with the > > > > > > > > sequence themselves or the problems with the majority of parsers??? > > > > > > > > Could it be that I'm using the API wrongly for the above formats, > > > > > > > > although GenBank parser works as advertised with some exceptions > > > > > > > > below: > > > > > > > > > > > > > > > > ISSUE #2: > > > > > > > > When I try to parse GenBank files using the following code: > > > > > > > > > > > > > > > > BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb")); > > > > > > > > Namespace gbNspace = (Namespace) > > > > > > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > > > > > > Object[]{"gbSpace"} ); > > > > > > > > RichSequenceIterator gbSeqs = > > > > > > > > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace); > > > > > > > > while (gbSeqs.hasNext()) { > > > > > > > > try { > > > > > > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > > > > > > // Further processing or RichSequence object from here > > > > > > > > > > > > > > > > } catch (BioException be){ > > > > > > > > be.printStackTrace(); > > > > > > > > } > > > > > > > > } > > > > > > > > > > > > > > > > Genbank file in question: > > > > > > > > > > > > > > > > LOCUS BC074905 838 bp mRNA linear PRI 15-APR-2006 > > > > > > > > DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038 > > > > > > > > IMAGE:30915482), complete cds. > > > > > > > > ACCESSION BC074905 > > > > > > > > VERSION BC074905.2 GI:50959825 > > > > > > > > KEYWORDS MGC. > > > > > > > > SOURCE Homo sapiens (human) > > > > > > > > ORGANISM Homo sapiens > > > > > > > > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; > > > > > > > > Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; > > > > > > > > Catarrhini; Hominidae; Homo. > > > > > > > > REFERENCE 1 (bases 1 to 838) > > > > > > > > AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G., > > > > > > > > Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D., > > > > > > > > Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K., > > > > > > > > Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F., > > > > > > > > Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L., > > > > > > > > Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L., > > > > > > > > Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S., > > > > > > > > Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J., > > > > > > > > Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J., > > > > > > > > McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S., > > > > > > > > Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W., > > > > > > > > Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A., > > > > > > > > Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S., > > > > > > > > Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y., > > > > > > > > Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D., > > > > > > > > Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M., > > > > > > > > Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E., > > > > > > > > Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A. > > > > > > > > CONSRTM Mammalian Gene Collection Program Team > > > > > > > > TITLE Generation and initial analysis of more than 15,000 full-length > > > > > > > > human and mouse cDNA sequences > > > > > > > > JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002) > > > > > > > > PUBMED 12477932 > > > > > > > > REFERENCE 2 (bases 1 to 838) > > > > > > > > CONSRTM NIH MGC Project > > > > > > > > TITLE Direct Submission > > > > > > > > JOURNAL Submitted (25-JUN-2004) National Institutes of Health, Mammalian > > > > > > > > Gene Collection (MGC), Bethesda, MD 20892-2590, USA > > > > > > > > REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov > > > > > > > > COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832. > > > > > > > > Contact: MGC help desk > > > > > > > > Email: cgapbs-r at mail.nih.gov > > > > > > > > Tissue Procurement: Genome Sequence Centre, British Columbia Cancer > > > > > > > > Center > > > > > > > > cDNA Library Preparation: British Columbia Cancer Research Center > > > > > > > > cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL) > > > > > > > > DNA Sequencing by: Genome Sequence Centre, > > > > > > > > BC Cancer Agency, Vancouver, BC, Canada > > > > > > > > info at bcgsc.bc.ca > > > > > > > > Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson > > > > > > > > Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen > > > > > > > > Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel > > > > > > > > Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave > > > > > > > > Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth > > > > > > > > Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao, > > > > > > > > Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR > > > > > > > > Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang, > > > > > > > > Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra. > > > > > > > > > > > > > > > > Clone distribution: MGC clone distribution information can be found > > > > > > > > through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov > > > > > > > > Series: IRBU Plate: 4 Row: C Column: 3. > > > > > > > > > > > > > > > > Differences found between this sequence and the human reference > > > > > > > > genome (build 36) are described in misc_difference features below. > > > > > > > > FEATURES Location/Qualifiers > > > > > > > > source 1..838 > > > > > > > > /organism="Homo sapiens" > > > > > > > > /mol_type="mRNA" > > > > > > > > /db_xref="taxon:9606" > > > > > > > > /clone="MGC:104038 IMAGE:30915482" > > > > > > > > /tissue_type="Lung, PCR rescued clones" > > > > > > > > /clone_lib="NIH_MGC_273" > > > > > > > > /lab_host="DH10B" > > > > > > > > /note="Vector: pCR4 Topo TA with reversed insert" > > > > > > > > gene 1..838 > > > > > > > > /gene="KLK14" > > > > > > > > /note="synonym: KLK-L6" > > > > > > > > /db_xref="GeneID:43847" > > > > > > > > /db_xref="HGNC:6362" > > > > > > > > /db_xref="IMGT/GENE-DB:6362" > > > > > > > > /db_xref="MIM:606135" > > > > > > > > CDS 49..804 > > > > > > > > /gene="KLK14" > > > > > > > > /codon_start=1 > > > > > > > > /product="KLK14 protein" > > > > > > > > /protein_id="AAH74905.1" > > > > > > > > /db_xref="GI:50959826" > > > > > > > > /db_xref="GeneID:43847" > > > > > > > > /db_xref="HGNC:6362" > > > > > > > > /db_xref="IMGT/GENE-DB:6362" > > > > > > > > /db_xref="MIM:606135" > > > > > > > > /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA > > > > > > > > GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN > > > > > > > > YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA > > > > > > > > SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV > > > > > > > > SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK" > > > > > > > > misc_difference 98 > > > > > > > > /gene="KLK14" > > > > > > > > /note="'G' in cDNA is 'A' in the human genome; amino acid > > > > > > > > difference: 'R' in cDNA, 'Q' in the human genome." > > > > > > > > misc_difference 133 > > > > > > > > /gene="KLK14" > > > > > > > > /note="'T' in cDNA is 'C' in the human genome; amino acid > > > > > > > > difference: 'Y' in cDNA, 'H' in the human genome." > > > > > > > > ORIGIN > > > > > > > > 1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg > > > > > > > > 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag > > > > > > > > 121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg > > > > > > > > 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact > > > > > > > > 241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg > > > > > > > > 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac > > > > > > > > 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg > > > > > > > > 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga > > > > > > > > 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc > > > > > > > > 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg > > > > > > > > 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct > > > > > > > > 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc > > > > > > > > 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt > > > > > > > > 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc > > > > > > > > // > > > > > > > > > > > > > > > > I get the following exception: > > > > > > > > > > > > > > > > java.lang.IllegalArgumentException: Authors string cannot be null > > > > > > > > org.biojava.bio.BioException: Could not read sequence > > > > > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > > > > > at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107) > > > > > > > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258) > > > > > > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:341) > > > > > > > > Caused by: java.lang.IllegalArgumentException: Authors string cannot be null > > > > > > > > at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76) > > > > > > > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356) > > > > > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > > > > > > > > > > > > > ----------------------------------------------------------------------- > > > > > > > > > > > > > > > > I'm trying to see what could be the problem with this particular > > > > > > > > sequence. Looks to me like the AUTHORS portion is not getting parsed > > > > > > > > correctly. Any ideas would be greatly appreciated! > > > > > > > > > > > > > > > -- > > > > > > > Richard Holland (BioMart Team) > > > > > > > EMBL-EBI > > > > > > > Wellcome Trust Genome Campus > > > > > > > Hinxton > > > > > > > Cambridge CB10 1SD > > > > > > > UNITED KINGDOM > > > > > > > Tel: +44-(0)1223-494416 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Richard Holland (BioMart Team) > > > > > EMBL-EBI > > > > > Wellcome Trust Genome Campus > > > > > Hinxton > > > > > Cambridge CB10 1SD > > > > > UNITED KINGDOM > > > > > Tel: +44-(0)1223-494416 > > > > > > > > > > > > > > > > > > > > > -- > > > Richard Holland (BioMart Team) > > > EMBL-EBI > > > Wellcome Trust Genome Campus > > > Hinxton > > > Cambridge CB10 1SD > > > UNITED KINGDOM > > > Tel: +44-(0)1223-494416 > > > > > > > > > > > -- > Richard Holland (BioMart Team) > EMBL-EBI > Wellcome Trust Genome Campus > Hinxton > Cambridge CB10 1SD > UNITED KINGDOM > Tel: +44-(0)1223-494416 > > -- Best Regards, Seth Johnson Senior Bioinformatics Associate Ph: (202) 470-0900 Fx: (775) 251-0358 From johnson.biotech at gmail.com Tue Jun 6 10:34:38 2006 From: johnson.biotech at gmail.com (Seth Johnson) Date: Tue, 6 Jun 2006 10:34:38 -0400 Subject: [Biojava-l] Parsing INSDseq Sequences (1.3 & 1.4) Message-ID: I think it would be best to wait for the 'official response'. I could only locate the general changes detailed here: http://www.bio.net/bionet/mm/genbankb/2005-December/000233.html As far as the solution to the ever changing formats I just don't see an elegant way. :( The only things that comes to mind is creating separate format "INSDseq14Format.java" and build new readers & writers on top of that. #1: And on that note I wanted to ask about differences between Genbank & INSDseq parsers and a ways to retrieve certain values. The tutorial states that those two formats are essentialy mirror images of each other with the latter being an XML. When parsing Genbank files "rs.getIdentifier()" retrieves the GI number, however, when the same function is used on RichSequence obtained by parsing INSDseq format, I get a 'null' value. Moreover, I could not even locate that number during debugging in the structure of RichSequence object. Is there a bug or GI number should be obtained differently??? #2: Also, what is the best way to obtain "mol_type" value from RichSequence object??? The tutorial states that it's "getNoteSet(Terms.getMolTypeTerm())". I guess it' either a simplified explanation or something has changed since .getNoteSet() does not take any parameters. I used "rs.getAnnotation().asMap().get(Terms.getMolTypeTerm())" and was wondering if that's how it was intended to be retrieved. As always, below is the INSDseq file I tried to parse: ================================ AY069118 1502 single mRNA linear INV

17-DEC-2001

15-DEC-2001

Drosophila melanogaster GH13089 full length cDNA

AY069118

AY069118.1

gb|AY069118.1| gi|17861571

FLI_CDNA Drosophila melanogaster (fruit fly) Drosophila melanogaster Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophila 1 (bases 1 to 1502) 1..1502 Stapleton,M. Brokstein,P. Hong,L. Agbayani,A. Carlson,J. Champe,M. Chavez,C. Dorsett,V. Farfan,D. Frise,E. George,R. Gonzalez,M. Guarin,H. Li,P. Liao,G. Miranda,A. Mungall,C.J. Nunoo,J. Pacleb,J. Paragas,V. Park,S. Phouanenavong,S. Wan,K. Yu,C. Lewis,S.E. Rubin,G.M. Celniker,S. Direct Submission Submitted (10-DEC-2001) Berkeley Drosophila Genome Project, Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley, CA 94720, USA Sequence submitted by: Berkeley Drosophila Genome Project Lawrence Berkeley National Laboratory Berkeley, CA 94720 This clone was sequenced as part of a high-throughput process to sequence clones from Drosophila Gene Collection 1 (Rubin et al., Science 2000). The sequence has been subjected to integrity checks for sequence accuracy, presence of a polyA tail and contiguity within 100 kb in the genome. Thus we believe the sequence to reflect accurately this particular cDNA clone. However, there are artifacts associated with the generation of cDNA clones that may have not been detected in our initial analyses such as internal priming, priming from contaminating genomic DNA, retained introns due to reverse transcription of unspliced precursor RNAs, and reverse transcriptase errors that result in single base changes. For further information about this sequence, including its location and relationship to other sequences, please visit our Web site (http://fruitfly.berkeley.edu) or send email to cdna at fruitfly.berkeley.edu.

source 1..1502 1 1502 AY069118.1 organism Drosophila melanogaster mol_type mRNA strain y; cn bw sp db_xref taxon:7227 map 39B3-39B3 gene 1..1502 1 1502 AY069118.1 gene E2f2 note alignment with genomic scaffold AE003669 db_xref FLYBASE:FBgn0024371 CDS 189..1301 189 1301 AY069118.1 gene E2f2 note Longest ORF codon_start 1 transl_table 1 product GH13089p protein_id AAL39263.1 db_xref GI:17861572 db_xref FLYBASE:FBgn0024371 translation MYKRKTASIVKRDSSAAGTTSSAMMMKVDSAETSVRSQSYESTPVSMDTSPDPPTPIKSPSNSQSQSQPGQQRSVGSLVLLTQKFVDLVKANEGSIDLKAATKILDVQKRRIYDITNVLEGIGLIDKGRHCSLVRWRGGGFNNAKDQENYDLARSRTNHLKMLEDDLDRQLEYAQRNLRYVMQDPSNRSYAYVTRDDLLDIFGDDSVFTIPNYDEEVDIKRNHYELAVSLDNGSAIDIRLVTNQGKSTTNPHDVDGFFDYHRLDTPSPSTSSHSSEDGNAPACAGNVITDEHGYSCNPGMKDEMKLLENELTAKIIFQNYLSGHSLRRFYPDDPNLENPPLLQLNPPQEDFNFALKSDEGICELFDVQCS

Message-ID: <1149608578.3947.105.camel@texas.ebi.ac.uk> Hullo. Here is the page where you can manage your subscription to the list, including unsubscribing: http://lists.open-bio.org/mailman/listinfo/biojava-l cheers, Richard On Tue, 2006-06-06 at 11:16 -0400, Luba wrote: > Hey, guys, > Do you know how to unsubscribe from the list? > > Thanks. > > ----- Original Message ----- > From: "Seth Johnson" > To: "Richard Holland" > Cc: > Sent: Tuesday, June 06, 2006 10:34 AM > Subject: Re: [Biojava-l] Parsing INSDseq Sequences (1.3 & 1.4) > > > >I think it would be best to wait for the 'official response'. I could > > only locate the general changes detailed here: > > > > http://www.bio.net/bionet/mm/genbankb/2005-December/000233.html > > > > As far as the solution to the ever changing formats I just don't see > > an elegant way. :( The only things that comes to mind is creating > > separate format "INSDseq14Format.java" and build new readers & writers > > on top of that. > > > > #1: And on that note I wanted to ask about differences between Genbank > > & INSDseq parsers and a ways to retrieve certain values. The tutorial > > states that those two formats are essentialy mirror images of each > > other with the latter being an XML. When parsing Genbank files > > "rs.getIdentifier()" retrieves the GI number, however, when the same > > function is used on RichSequence obtained by parsing INSDseq format, I > > get a 'null' value. Moreover, I could not even locate that number > > during debugging in the structure of RichSequence object. Is there a > > bug or GI number should be obtained differently??? > > > > #2: Also, what is the best way to obtain "mol_type" value from > > RichSequence object??? The tutorial states that it's > > "getNoteSet(Terms.getMolTypeTerm())". I guess it' either a simplified > > explanation or something has changed since .getNoteSet() does not take > > any parameters. I used > > "rs.getAnnotation().asMap().get(Terms.getMolTypeTerm())" and was > > wondering if that's how it was intended to be retrieved. > > > > As always, below is the INSDseq file I tried to parse: > > ================================ > > > > > "http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd"> > > > > > > AY069118 > > 1502 > > single > > mRNA > > linear > > INV > > 17-DEC-2001 > > 15-DEC-2001 > > Drosophila melanogaster GH13089 full length > > cDNA > > AY069118 > > AY069118.1 > > > > gb|AY069118.1| > > gi|17861571 > > > > > > FLI_CDNA > > > > Drosophila melanogaster (fruit fly) > > Drosophila melanogaster > > Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; > > Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; > > Ephydroidea; Drosophilidae; Drosophila > > > > > > 1 (bases 1 to > > 1502) > > 1..1502 > > > > Stapleton,M. > > Brokstein,P. > > Hong,L. > > Agbayani,A. > > Carlson,J. > > Champe,M. > > Chavez,C. > > Dorsett,V. > > Farfan,D. > > Frise,E. > > George,R. > > Gonzalez,M. > > Guarin,H. > > Li,P. > > Liao,G. > > Miranda,A. > > Mungall,C.J. > > Nunoo,J. > > Pacleb,J. > > Paragas,V. > > Park,S. > > Phouanenavong,S. > > Wan,K. > > Yu,C. > > Lewis,S.E. > > Rubin,G.M. > > Celniker,S. > > > > Direct Submission > > Submitted (10-DEC-2001) Berkeley > > Drosophila Genome Project, Lawrence Berkeley National Laboratory, One > > Cyclotron Road, Berkeley, CA 94720, USA > > > > > > Sequence submitted by: Berkeley Drosophila Genome > > Project Lawrence Berkeley National Laboratory Berkeley, CA 94720 This > > clone was sequenced as part of a high-throughput process to sequence > > clones from Drosophila Gene Collection 1 (Rubin et al., Science 2000). > > The sequence has been subjected to integrity checks for sequence > > accuracy, presence of a polyA tail and contiguity within 100 kb in the > > genome. Thus we believe the sequence to reflect accurately this > > particular cDNA clone. However, there are artifacts associated with > > the generation of cDNA clones that may have not been detected in our > > initial analyses such as internal priming, priming from contaminating > > genomic DNA, retained introns due to reverse transcription of > > unspliced precursor RNAs, and reverse transcriptase errors that result > > in single base changes. For further information about this sequence, > > including its location and relationship to other sequences, please > > visit our Web site (http://fruitfly.berkeley.edu) or send email to > > cdna at fruitfly.berkeley.edu. > > > > > > source > > 1..1502 > > > > > > 1 > > 1502 > > AY069118.1 > > > > > > > > > > organism > > Drosophila > > melanogaster > > > > > > mol_type > > mRNA > > > > > > strain > > y; cn bw sp > > > > > > db_xref > > taxon:7227 > > > > > > map > > 39B3-39B3 > > > > > > > > > > gene > > 1..1502 > > > > > > 1 > > 1502 > > AY069118.1 > > > > > > > > > > gene > > E2f2 > > > > > > note > > alignment with genomic scaffold > > AE003669 > > > > > > db_xref > > FLYBASE:FBgn0024371 > > > > > > > > > > CDS > > 189..1301 > > > > > > 189 > > 1301 > > AY069118.1 > > > > > > > > > > gene > > E2f2 > > > > > > note > > Longest ORF > > > > > > codon_start > > 1 > > > > > > transl_table > > 1 > > > > > > product > > GH13089p > > > > > > protein_id > > AAL39263.1 > > > > > > db_xref > > GI:17861572 > > > > > > db_xref > > FLYBASE:FBgn0024371 > > > > > > translation > > > > MYKRKTASIVKRDSSAAGTTSSAMMMKVDSAETSVRSQSYESTPVSMDTSPDPPTPIKSPSNSQSQSQPGQQRSVGSLVLLTQKFVDLVKANEGSIDLKAATKILDVQKRRIYDITNVLEGIGLIDKGRHCSLVRWRGGGFNNAKDQENYDLARSRTNHLKMLEDDLDRQLEYAQRNLRYVMQDPSNRSYAYVTRDDLLDIFGDDSVFTIPNYDEEVDIKRNHYELAVSLDNGSAIDIRLVTNQGKSTTNPHDVDGFFDYHRLDTPSPSTSSHSSEDGNAPACAGNVITDEHGYSCNPGMKDEMKLLENELTAKIIFQNYLSGHSLRRFYPDDPNLENPPLLQLNPPQEDFNFALKSDEGICELFDVQCS > > > > > > > > > > > > AAGAATAGAGGGAGAATGAAAAAAATGACATAAATGGCGGAAAGCAAACCTAGCGCCAACATTCGTATTTTCGTTTAATTTTCGCTCCAAAGTGCAATTAATTCCGGCTTCTTGATCGCTGCATATTGAGTGCAGCCACGCAAAGAGTTACAAGGACAGGAGTATAGTCATCGAGTCGATTGCGGACCATGTACAAGCGCAAAACCGCGAGTATTGTTAAAAGAGACAGCTCCGCAGCGGGCACCACCTCCTCGGCTATGATGATGAAGGTGGATTCGGCTGAGACTTCGGTCCGGTCGCAGAGCTACGAGTCTACACCCGTTAGCATGGACACATCACCGGATCCTCCAACGCCAATCAAGTCTCCGTCGAATTCACAATCGCAATCGCAGCCTGGACAACAGCGCTCCGTGGGCTCACTGGTCCTGCTCACACAGAAGTTTGTGGATCTCGTGAAGGCCAACGAAGGATCCATCGACCTGAAAGCGGCAACCAAAATCTTGGACGTACAGAAGCGCCGAATATACGATATTACCAATGTTTTAGAGGGCATTGGACTAATTGATAAGGGCAGACACTGCTCCCTAGTGCGCTGGCGCGGAGGGGGCTTTAACAATGCCAAGGACCAAGAGAACTACGACCTGGCACGTAGCCGGACTAATCATTTGAAAATGTTGGAGGATGACCTAGACAGGCAACTGGAGTATGCACAGCGCAATCTGCGCTACGTTATGCAGGATCCCTCGAATAGGTCGTATGCATATGTGACACGTGATGATCTGCTGGACATCTTTGGAGATGATTCCGTATTCACAATACCTAATTATGACGAGGAAGTAGATATCAAGCGTAATCATTACGAGCTGGCCGTGTCGCTGGACAATGGCAGCGCAATTGACATTCGCCTGGTGACGAACCAAGGAAAGAGTACTACAAATCCGCACGATGTGGATGGGTTCTTTGACTA! TC! > > ACCGTCTGGACACGCCCTCACCCTCGACGTCGTCGCACTCCAGCGAGGATGGTAACGCTCCAGCATGCGCGGGGAACGTGATCACCGACGAGCACGGTTACTCGTGCAATCCCGGGATGAAAGATGAGATGAAACTTTTGGAGAACGAGCTGACGGCCAAGATAATCTTCCAAAATTATCTGTCCGGTCATTCGCTGCGGCGATTTTATCCCGATGATCCGAATCTAGAAAACCCGCCGCTGCTGCAGCTGAATCCTCCGCAGGAAGACTTCAACTTTGCGTTAAAAAGCGACGAAGGTATTTGCGAGCTGTTTGATGTTCAGTGCTCCTAACTGTGGAAGGGGATGTACACCTTAGGACTATAGCTACACTGCAACTGGCCGCGTGCATTGTGCAAATATTTATGATTAGTACAATTTTGACTTTGGATTTCTCTATATCGTCTAGAAATTTTTAATTAGTGTAATACCTTGTAATTTCGCAAATAACAGCAAAACCAATAAATTCGTAAATGCAAAAAAAAAAAAAAAAAA > > > > > > ================================ > > On 6/6/06, Richard Holland wrote: > >> I can't find any document detailing the differences between INSDseq XML > >> versions 1.3 and 1.4, so I've asked the guys over in the data library > >> section here to see if they have one or can produce one for me. They > >> wrote it so they should know! > >> > >> Once I have this I'll get the INSDseq parser up-to-date. (I could go > >> through the DTDs by hand and work it all out manually, but that would > >> take rather longer than I've got time for at the moment!). > >> > >> It's a bit of a pain trying to keep the parsers up-to-date all the time, > >> especially when people start wanting backwards-compatibility. Does > >> anyone have any bright ideas as to how to manage version changes in file > >> formats? > >> > >> cheers, > >> Richard > >> > >> On Mon, 2006-06-05 at 12:28 -0400, Seth Johnson wrote: > >> > I agree with you on that one. However, the problem might be a little > >> > deeper. Same '?' appear in the INSDseq format bounded by > >> > tags and cause the following exception. > >> > This tells me that the '?' are actually values that are being > >> > incorrectly parsed. Further examination of the .dtd reveals that > >> > INSDseqFormat.java is tailord towards the INSDSeq v. 1.3 whereas the > >> > files I obtain are in the INSDSeq v. 1.4 (which among other things > >> > contain a new tag ). Here're links to both > >> > .dtd's: > >> > > >> > http://www.ebi.ac.uk/embl/Documentation/DTD/INSDSeq_v1.3.dtd.txt > >> > > >> > http://www.ebi.ac.uk/embl/Documentation/DTD/INSDC_V1.4.dtd.txt > >> > > >> > I think it might be worth accommodating changes for the INSDseq > >> > format, not sure how that would affect the '?' in Genbank. > >> > > >> > Seth > >> > > >> > ====================== > >> > org.biojava.bio.BioException: Could not read sequence > >> > at > >> > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > >> > at exonhit.parsers.GenBankParser.main(GenBankParser.java:348) > >> > Caused by: org.biojava.bio.seq.io.ParseException: > >> > org.biojava.bio.seq.io.ParseException: Bad reference line found: ? > >> > at > >> > org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:250) > >> > at > >> > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > >> > ... 1 more > >> > Caused by: org.biojava.bio.seq.io.ParseException: Bad reference line > >> > found: ? > >> > at > >> > org.biojavax.bio.seq.io.INSDseqFormat$INSDseqHandler.endElement(INSDseqFormat.java:901) > >> > at > >> > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633) > >> > at > >> > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241) > >> > at > >> > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685) > >> > at > >> > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368) > >> > at > >> > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834) > >> > at > >> > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764) > >> > at > >> > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148) > >> > at > >> > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242) > >> > at javax.xml.parsers.SAXParser.parse(SAXParser.java:375) > >> > at org.biojavax.utils.XMLTools.readXMLChunk(XMLTools.java:97) > >> > at > >> > org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:246) > >> > ... 2 more > >> > Java Result: -1 > >> > ====================== > >> > > >> > ~~~~~~~~~~~~~~~~~~~~~~ > >> > > >> > > >> > ? > >> > 1..16732 > >> > > >> > Bjornerfeldt,S. > >> > Webster,M.T. > >> > Vila,C. > >> > > >> > Relaxation of Selective Constraint on Dog > >> > Mitochondrial DNA Following Domestication > >> > Unpublished > >> > > >> > > >> > ? > >> > 1..16732 > >> > > >> > Bjornerfeldt,S. > >> > Webster,M.T. > >> > Vila,C. > >> > > >> > Submitted (06-APR-2006) to the > >> > EMBL/GenBank/DDBJ databases. Evolutionary Biology, Evolutionary > >> > Biology, Norbyvagen 18D, Uppsala 752 36, > >> > Sweden > >> > > >> > > >> > ~~~~~~~~~~~~~~~~~~~~~~ > >> > > >> > On 6/5/06, Richard Holland wrote: > >> > > Hmmm... interesting. I _could_ put in a special case that ignores the > >> > > question marks, but that wouldn't be 'nice' really - this is more of > >> > > a > >> > > problem with the program that is producing the Genbank files than a > >> > > problem with the parser trying to read them. '?' is not a valid tag > >> > > in > >> > > the official Genbank format, and has no meaning attached to it that I > >> > > can work out, so I'm reluctant to make the parser recognise it. > >> > > > >> > > I'd suggest you contact the people who write the software you are > >> > > using > >> > > to produce the Genbank files and ask them if they could stick to the > >> > > rules! > >> > > > >> > > In the meantime you could work around the problem by stripping the > >> > > question marks in some kind of pre-processor before passing it onto > >> > > BioJavaX for parsing. > >> > > > >> > > cheers, > >> > > Richard > >> > > > >> > > On Mon, 2006-06-05 at 11:39 -0400, Seth Johnson wrote: > >> > > > Removing '?' (or several of them in my case) avoids the following > >> > > > exception: > >> > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > >> > > > org.biojava.bio.BioException: Could not read sequence > >> > > > at > >> > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > >> > > > at > >> > > > exonhit.parsers.GenBankParser.main(GenBankParser.java:348) > >> > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ415957 > >> > > > at > >> > > > org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > >> > > > at > >> > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > >> > > > ... 1 more > >> > > > Java Result: -1 > >> > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > >> > > > I don't know where that previous tokenization problem came from > >> > > > since > >> > > > I can no longer reproduce it. This time it's more or less straight > >> > > > forward. > >> > > > Here's the original file with question marks: > >> > > > ============================ > >> > > > LOCUS DQ415957 1437 bp mRNA linear VRT > >> > > > 01-JUN-2006 > >> > > > DEFINITION Danio rerio capillary morphogenesis protein 2A (cmg2a) > >> > > > mRNA, > >> > > > complete cds. > >> > > > ACCESSION DQ415957 > >> > > > VERSION DQ415957.1 GI:89513612 > >> > > > KEYWORDS . > >> > > > SOURCE Unknown. > >> > > > ORGANISM Unknown. > >> > > > Unclassified. > >> > > > ? > >> > > > ? > >> > > > FEATURES Location/Qualifiers > >> > > > ? > >> > > > gene 1..1437 > >> > > > /gene="cmg2a" > >> > > > CDS 1..1437 > >> > > > /gene="cmg2a" > >> > > > /note="cell surface receptor; similar to > >> > > > anthrax toxin > >> > > > receptor 2 (ANTXR2, ATR2, CMG2)" > >> > > > /codon_start=1 > >> > > > /product="capillary morphogenesis protein 2A" > >> > > > /protein_id="ABD74633.1" > >> > > > /db_xref="GI:89513613" > >> > > > > >> > > > /translation="MTKENLWSVATTATLFFCLCFSSFKAETPSCHGAYDLYFVLDRS > >> > > > > >> > > > GSVSTDWSEIYDFVKNLTERFVSPNLRVSFIVFSSRAEIVLPLTGDRSEINKGLKTLS > >> > > > > >> > > > EVNPAGETYMHEGIKLATEQMKKEPKKSSSIIVALTDGKLETYIHQLTIDEADSARKY > >> > > > > >> > > > GARVYCVGVKDFDEEQLADVADSKEQVFPVKGGFQALKGIVNSILKQSCTEILTVEPS > >> > > > > >> > > > SVCVNQSFDIVLRGNGFAVGRQTEGVICSFIVDGVTYKQKPTKVKIDYILCPAPVLYT > >> > > > > >> > > > VGQQMEVLISLNSGTSYITSAFIITASSCSDGTVVAIVFLVLFLLLALALMWWFWPLC > >> > > > > >> > > > CTVVIKDPPPQRPPPPPPKLEPDPEPKKKWPTVDASYYGGRGAGGIKRMEVRWGEKGS > >> > > > > >> > > > TEEGARLEMAKNAVVSIQEESEEPMVKKPRAPAQTCHQSESKWYTPIRGRLDALWALL > >> > > > RRQYDRVSVMRPTSADKGRCMNFSRTQH" > >> > > > ORIGIN > >> > > > 1 atgacaaagg aaaatctctg gagcgtggca accacggcga ctcttttctt > >> > > > ctgtttatgc > >> > > > 61 ttttcatctt ttaaagcgga aaccccatct tgtcatggtg cctacgacct > >> > > > gtactttgtg > >> > > > 121 ttggaccgat ctggaagtgt ttcgactgac tggagtgaaa tctatgactt > >> > > > tgtcaaaaat > >> > > > 181 cttacagaga gatttgtgag tccaaatctg cgagtgtcct tcattgtttt > >> > > > ttcatcaaga > >> > > > 241 gcagagattg tgttaccgct cactggagac aggtcagaaa ttaataaagg > >> > > > cctgaagacc > >> > > > 301 ttaagtgagg tcaatccagc tggagaaaca tacatgcatg aaggaattaa > >> > > > attggcaact > >> > > > 361 gaacaaatga aaaaagagcc taaaaagtcc tctagtatta ttgtggcctt > >> > > > gactgatgga > >> > > > 421 aagcttgaaa cgtatatcca tcaactcact attgacgagg ctgattcagc > >> > > > aaggaagtat > >> > > > 481 ggggctcgtg tgtactgtgt tggtgtaaaa gactttgatg aagaacagct > >> > > > agccgatgtg > >> > > > 541 gctgattcca aggagcaagt gttcccagtc aaaggaggct ttcaggctct > >> > > > caaaggcatc > >> > > > 601 gttaactcga tcctcaagca atcatgcacc gaaatcctaa cagtggaacc > >> > > > gtccagcgtc > >> > > > 661 tgcgtgaacc agtcctttga cattgttttg agagggaacg ggttcgcagt > >> > > > ggggagacaa > >> > > > 721 acagaaggag tcatctgcag tttcatagtg gatggagtta cttacaaaca > >> > > > aaaaccaacc > >> > > > 781 aaagtgaaga ttgactacat cctatgtcct gctccagtgc tgtatacagt > >> > > > tggacagcaa > >> > > > 841 atggaggttc tgatcagttt gaacagtgga acatcatata tcaccagtgc > >> > > > tttcatcatc > >> > > > 901 actgcctctt catgttcgga cggcacagtg gtggccattg tgttcttggt > >> > > > gctttttctc > >> > > > 961 ctgttggctt tggctctgat gtggtggttc tggcctctat gctgcactgt > >> > > > cgttattaaa > >> > > > 1021 gacccacctc cacaaagacc tcctccacct ccacctaagc tagagccaga > >> > > > cccggaaccc > >> > > > 1081 aagaagaagt ggccaactgt ggatgcatct tactatgggg gaagaggagc > >> > > > tggtggaatc > >> > > > 1141 aaacgcatgg aggtccgttg gggagaaaaa gggtctacag aggaaggtgc > >> > > > aagactagag > >> > > > 1201 atggctaaga atgcagtagt gtcaatacaa gaggaatcag aagaacccat > >> > > > ggtcaaaaag > >> > > > 1261 ccaagagcac ctgcacaaac atgccatcaa tctgaatcca agtggtatac > >> > > > accaatcaga > >> > > > 1321 ggccgtcttg acgcactgtg ggctcttttg cggcggcaat atgaccgagt > >> > > > ttcagttatg > >> > > > 1381 cgaccaactt ctgcagataa gggtcgctgt atgaatttca gtcgcacgca > >> > > > gcattaa > >> > > > // > >> > > > > >> > > > ============================ > >> > > > > >> > > > > >> > > > On 6/5/06, Richard Holland wrote: > >> > > > > Hi again. > >> > > > > > >> > > > > Could you remove the offending question mark from the GenBank > >> > > > > file and > >> > > > > try it again to see if that fixes it? The parser should just > >> > > > > ignore it > >> > > > > but apparently not. The error looks weird to me because the > >> > > > > tokenization > >> > > > > for a DNA GenBank file _does_ contain the letter 't'! Not sure > >> > > > > what's > >> > > > > going on here. > >> > > > ... > >> > > > > > >> > > > > cheers, > >> > > > > Richard > >> > > > > > >> > > > > On Mon, 2006-06-05 at 10:37 -0400, Seth Johnson wrote: > >> > > > > > Hell again Richard, > >> > > > > > > >> > > > > > No sooner I've said about the fix of the last parsing exception > >> > > > > > than > >> > > > > > another one came up with Genbank format: > >> > > > > > -------------------------------------- > >> > > > > > org.biojava.bio.seq.io.ParseException: DQ431065 > >> > > > > > org.biojava.bio.BioException: Could not read sequence > >> > > > > > at > >> > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > >> > > > > > at > >> > > > > > exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151) > >> > > > > > at > >> > > > > > exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246) > >> > > > > > at > >> > > > > > exonhit.parsers.GenBankParser.main(GenBankParser.java:326) > >> > > > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ431065 > >> > > > > > at > >> > > > > > org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > >> > > > > > at > >> > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > >> > > > > > ... 3 more > >> > > > > > org.biojava.bio.seq.io.ParseException: > >> > > > > > org.biojava.bio.symbol.IllegalSymbolException: This > >> > > > > > tokenization > >> > > > > > doesn't contain character: 't' > >> > > > > > ---------------------------------------- > >> > > > > > The Genbank file that caused it is as follows: > >> > > > > > ========================================= > >> > > > > > LOCUS DQ431065 425 bp DNA linear > >> > > > > > INV 01-JUN-2006 > >> > > > > > DEFINITION Reticulitermes sp. ALS-2006c 16S ribosomal RNA > >> > > > > > gene, partial > >> > > > > > sequence; mitochondrial. > >> > > > > > ACCESSION DQ431065 > >> > > > > > VERSION DQ431065.1 GI:90102206 > >> > > > > > KEYWORDS . > >> > > > > > SOURCE Vaccinium corymbosum > >> > > > > > ORGANISM Vaccinium corymbosum > >> > > > > > Eukaryota; Viridiplantae; Streptophyta; > >> > > > > > Embryophyta; Tracheophyta; > >> > > > > > Spermatophyta; Magnoliophyta; eudicotyledons; core > >> > > > > > eudicotyledons; > >> > > > > > asterids; Ericales; Ericaceae; Vaccinioideae; > >> > > > > > Vaccinieae; > >> > > > > > Vaccinium. > >> > > > > > ? > >> > > > > > REFERENCE 2 (bases 1 to 425) > >> > > > > > AUTHORS Naik,L.D. and Rowland,L.J. > >> > > > > > TITLE Expressed Sequence Tags of cDNA clones from > >> > > > > > subtracted library of > >> > > > > > Vaccinium corymbosum > >> > > > > > JOURNAL Unpublished (2005) > >> > > > > > FEATURES Location/Qualifiers > >> > > > > > source 1..425 > >> > > > > > /organism="Vaccinium corymbosum" > >> > > > > > /mol_type="genomic DNA" > >> > > > > > /cultivar="Bluecrop" > >> > > > > > /db_xref="taxon:69266" > >> > > > > > /tissue_type="Flower buds" > >> > > > > > /clone_lib="Subtracted cDNA library of > >> > > > > > Vaccinium > >> > > > > > corymbosum" > >> > > > > > /dev_stage="399 hour chill unit exposure" > >> > > > > > /note="Vector: pCR4TOPO; Site_1: Eco R I; > >> > > > > > Site_2: Eco R I" > >> > > > > > rRNA <1..>425 > >> > > > > > /product="16S ribosomal RNA" > >> > > > > > ORIGIN > >> > > > > > 1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga > >> > > > > > agtatggcct gcccgctgac > >> > > > > > 61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg > >> > > > > > tagcatagtc attagttctt > >> > > > > > 121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc > >> > > > > > tgtcttaatt ttgaattgtt > >> > > > > > 181 tattgaattt ggtctttgag ttaaaattct tagatgtttt > >> > > > > > tatgggacga gaagacccta > >> > > > > > 241 tagagtttaa catttattat ggtccttttc tgtttgtgag > >> > > > > > ggctcactgg gccgtctaat > >> > > > > > 301 atgttttgtt ggggtgatgg gagggaataa tttaacccct > >> > > > > > cctttttatt attatattta > >> > > > > > 361 tttatattta tttgatccat ttattttgat tgtaagatta > >> > > > > > aattacctta gggataacag > >> > > > > > 421 cgtaa > >> > > > > > // > >> > > > > > ================================== > >> > > > > > I think it's the presence of the '?' at the beginning of the > >> > > > > > line?!?! > >> > > > > > I'm not sure wether the information that was supposed to be > >> > > > > > present > >> > > > > > instead of those question marks is absent from the original > >> > > > > > ASN.1 > >> > > > > > batch file or it's a bug in the NCBI ASN2GO software. It looks > >> > > > > > to me > >> > > > > > that the former is the case since the file from NCBI website > >> > > > > > contains > >> > > > > > much more information than the batch file. Just bringing this > >> > > > > > to > >> > > > > > everyone's attention. > >> > > > > > > >> > > > > > > >> > > > > > -- > >> > > > > > Best Regards, > >> > > > > > > >> > > > > > > >> > > > > > Seth Johnson > >> > > > > > Senior Bioinformatics Associate > >> > > > > > > >> > > > > > Ph: (202) 470-0900 > >> > > > > > Fx: (775) 251-0358 > >> > > > > > > >> > > > > > On 6/2/06, Richard Holland wrote: > >> > > > > > > Hi Seth. > >> > > > > > > > >> > > > > > > Your second point, about the authors string not being read > >> > > > > > > correctly in > >> > > > > > > Genbank format, has been fixed (or should have been if I got > >> > > > > > > the code > >> > > > > > > right!). Could you check the latest version of biojava-live > >> > > > > > > out of CVS > >> > > > > > > and give it another go? Basically the parser did not > >> > > > > > > recognise the > >> > > > > > > CONSRTM tag, as it is not mentioned in the sample record > >> > > > > > > provided by > >> > > > > > > NCBI, which is what I based the parser on. > >> > > > > > ... > >> > > > > > > > >> > > > > > > cheers, > >> > > > > > > Richard > >> > > > > > > > >> > > > > > > > >> > > > > -- > >> > > > > Richard Holland (BioMart Team) > >> > > > > EMBL-EBI > >> > > > > Wellcome Trust Genome Campus > >> > > > > Hinxton > >> > > > > Cambridge CB10 1SD > >> > > > > UNITED KINGDOM > >> > > > > Tel: +44-(0)1223-494416 > >> > > > > > >> > > > > > >> > > > > >> > > > > >> > > -- > >> > > Richard Holland (BioMart Team) > >> > > EMBL-EBI > >> > > Wellcome Trust Genome Campus > >> > > Hinxton > >> > > Cambridge CB10 1SD > >> > > UNITED KINGDOM > >> > > Tel: +44-(0)1223-494416 > >> > > > >> > > > >> > > >> > > >> -- > >> Richard Holland (BioMart Team) > >> EMBL-EBI > >> Wellcome Trust Genome Campus > >> Hinxton > >> Cambridge CB10 1SD > >> UNITED KINGDOM > >> Tel: +44-(0)1223-494416 > >> > >> > > > > > > -- > > Best Regards, > > > > > > Seth Johnson > > Senior Bioinformatics Associate > > > > Ph: (202) 470-0900 > > Fx: (775) 251-0358 > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From richard.holland at ebi.ac.uk Wed Jun 7 05:01:54 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Wed, 07 Jun 2006 10:01:54 +0100 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: References: <1149238900.3948.87.camel@texas.ebi.ac.uk> <1149520267.3947.36.camel@texas.ebi.ac.uk> <1149522313.3947.48.camel@texas.ebi.ac.uk> Message-ID: <1149670914.3947.119.camel@texas.ebi.ac.uk> OK, I've updated INSDseqFormat to 1.4, or my interpretation of it based on what the guys next door told me. Please let me know if you have trouble running the XML it produces through any other parsers that can read it, or if it throws a wobbly whilst reading stuff you are 100% sure is valid. cheers, Richard On Mon, 2006-06-05 at 12:28 -0400, Seth Johnson wrote: > I agree with you on that one. However, the problem might be a little > deeper. Same '?' appear in the INSDseq format bounded by > tags and cause the following exception. > This tells me that the '?' are actually values that are being > incorrectly parsed. Further examination of the .dtd reveals that > INSDseqFormat.java is tailord towards the INSDSeq v. 1.3 whereas the > files I obtain are in the INSDSeq v. 1.4 (which among other things > contain a new tag ). Here're links to both > .dtd's: > > http://www.ebi.ac.uk/embl/Documentation/DTD/INSDSeq_v1.3.dtd.txt > > http://www.ebi.ac.uk/embl/Documentation/DTD/INSDC_V1.4.dtd.txt > > I think it might be worth accommodating changes for the INSDseq > format, not sure how that would affect the '?' in Genbank. > > Seth > > ====================== > org.biojava.bio.BioException: Could not read sequence > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > at exonhit.parsers.GenBankParser.main(GenBankParser.java:348) > Caused by: org.biojava.bio.seq.io.ParseException: > org.biojava.bio.seq.io.ParseException: Bad reference line found: ? > at org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:250) > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > ... 1 more > Caused by: org.biojava.bio.seq.io.ParseException: Bad reference line found: ? > at org.biojavax.bio.seq.io.INSDseqFormat$INSDseqHandler.endElement(INSDseqFormat.java:901) > at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633) > at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241) > at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685) > at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368) > at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834) > at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764) > at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148) > at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:375) > at org.biojavax.utils.XMLTools.readXMLChunk(XMLTools.java:97) > at org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:246) > ... 2 more > Java Result: -1 > ====================== > > ~~~~~~~~~~~~~~~~~~~~~~ > > > ? > 1..16732 > > Bjornerfeldt,S. > Webster,M.T. > Vila,C. > > Relaxation of Selective Constraint on Dog > Mitochondrial DNA Following Domestication > Unpublished > > > ? > 1..16732 > > Bjornerfeldt,S. > Webster,M.T. > Vila,C. > > Submitted (06-APR-2006) to the > EMBL/GenBank/DDBJ databases. Evolutionary Biology, Evolutionary > Biology, Norbyvagen 18D, Uppsala 752 36, > Sweden > > > ~~~~~~~~~~~~~~~~~~~~~~ > > On 6/5/06, Richard Holland wrote: > > Hmmm... interesting. I _could_ put in a special case that ignores the > > question marks, but that wouldn't be 'nice' really - this is more of a > > problem with the program that is producing the Genbank files than a > > problem with the parser trying to read them. '?' is not a valid tag in > > the official Genbank format, and has no meaning attached to it that I > > can work out, so I'm reluctant to make the parser recognise it. > > > > I'd suggest you contact the people who write the software you are using > > to produce the Genbank files and ask them if they could stick to the > > rules! > > > > In the meantime you could work around the problem by stripping the > > question marks in some kind of pre-processor before passing it onto > > BioJavaX for parsing. > > > > cheers, > > Richard > > > > On Mon, 2006-06-05 at 11:39 -0400, Seth Johnson wrote: > > > Removing '?' (or several of them in my case) avoids the following exception: > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > org.biojava.bio.BioException: Could not read sequence > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:348) > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ415957 > > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > ... 1 more > > > Java Result: -1 > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > I don't know where that previous tokenization problem came from since > > > I can no longer reproduce it. This time it's more or less straight > > > forward. > > > Here's the original file with question marks: > > > ============================ > > > LOCUS DQ415957 1437 bp mRNA linear VRT 01-JUN-2006 > > > DEFINITION Danio rerio capillary morphogenesis protein 2A (cmg2a) mRNA, > > > complete cds. > > > ACCESSION DQ415957 > > > VERSION DQ415957.1 GI:89513612 > > > KEYWORDS . > > > SOURCE Unknown. > > > ORGANISM Unknown. > > > Unclassified. > > > ? > > > ? > > > FEATURES Location/Qualifiers > > > ? > > > gene 1..1437 > > > /gene="cmg2a" > > > CDS 1..1437 > > > /gene="cmg2a" > > > /note="cell surface receptor; similar to anthrax toxin > > > receptor 2 (ANTXR2, ATR2, CMG2)" > > > /codon_start=1 > > > /product="capillary morphogenesis protein 2A" > > > /protein_id="ABD74633.1" > > > /db_xref="GI:89513613" > > > /translation="MTKENLWSVATTATLFFCLCFSSFKAETPSCHGAYDLYFVLDRS > > > GSVSTDWSEIYDFVKNLTERFVSPNLRVSFIVFSSRAEIVLPLTGDRSEINKGLKTLS > > > EVNPAGETYMHEGIKLATEQMKKEPKKSSSIIVALTDGKLETYIHQLTIDEADSARKY > > > GARVYCVGVKDFDEEQLADVADSKEQVFPVKGGFQALKGIVNSILKQSCTEILTVEPS > > > SVCVNQSFDIVLRGNGFAVGRQTEGVICSFIVDGVTYKQKPTKVKIDYILCPAPVLYT > > > VGQQMEVLISLNSGTSYITSAFIITASSCSDGTVVAIVFLVLFLLLALALMWWFWPLC > > > CTVVIKDPPPQRPPPPPPKLEPDPEPKKKWPTVDASYYGGRGAGGIKRMEVRWGEKGS > > > TEEGARLEMAKNAVVSIQEESEEPMVKKPRAPAQTCHQSESKWYTPIRGRLDALWALL > > > RRQYDRVSVMRPTSADKGRCMNFSRTQH" > > > ORIGIN > > > 1 atgacaaagg aaaatctctg gagcgtggca accacggcga ctcttttctt ctgtttatgc > > > 61 ttttcatctt ttaaagcgga aaccccatct tgtcatggtg cctacgacct gtactttgtg > > > 121 ttggaccgat ctggaagtgt ttcgactgac tggagtgaaa tctatgactt tgtcaaaaat > > > 181 cttacagaga gatttgtgag tccaaatctg cgagtgtcct tcattgtttt ttcatcaaga > > > 241 gcagagattg tgttaccgct cactggagac aggtcagaaa ttaataaagg cctgaagacc > > > 301 ttaagtgagg tcaatccagc tggagaaaca tacatgcatg aaggaattaa attggcaact > > > 361 gaacaaatga aaaaagagcc taaaaagtcc tctagtatta ttgtggcctt gactgatgga > > > 421 aagcttgaaa cgtatatcca tcaactcact attgacgagg ctgattcagc aaggaagtat > > > 481 ggggctcgtg tgtactgtgt tggtgtaaaa gactttgatg aagaacagct agccgatgtg > > > 541 gctgattcca aggagcaagt gttcccagtc aaaggaggct ttcaggctct caaaggcatc > > > 601 gttaactcga tcctcaagca atcatgcacc gaaatcctaa cagtggaacc gtccagcgtc > > > 661 tgcgtgaacc agtcctttga cattgttttg agagggaacg ggttcgcagt ggggagacaa > > > 721 acagaaggag tcatctgcag tttcatagtg gatggagtta cttacaaaca aaaaccaacc > > > 781 aaagtgaaga ttgactacat cctatgtcct gctccagtgc tgtatacagt tggacagcaa > > > 841 atggaggttc tgatcagttt gaacagtgga acatcatata tcaccagtgc tttcatcatc > > > 901 actgcctctt catgttcgga cggcacagtg gtggccattg tgttcttggt gctttttctc > > > 961 ctgttggctt tggctctgat gtggtggttc tggcctctat gctgcactgt cgttattaaa > > > 1021 gacccacctc cacaaagacc tcctccacct ccacctaagc tagagccaga cccggaaccc > > > 1081 aagaagaagt ggccaactgt ggatgcatct tactatgggg gaagaggagc tggtggaatc > > > 1141 aaacgcatgg aggtccgttg gggagaaaaa gggtctacag aggaaggtgc aagactagag > > > 1201 atggctaaga atgcagtagt gtcaatacaa gaggaatcag aagaacccat ggtcaaaaag > > > 1261 ccaagagcac ctgcacaaac atgccatcaa tctgaatcca agtggtatac accaatcaga > > > 1321 ggccgtcttg acgcactgtg ggctcttttg cggcggcaat atgaccgagt ttcagttatg > > > 1381 cgaccaactt ctgcagataa gggtcgctgt atgaatttca gtcgcacgca gcattaa > > > // > > > > > > ============================ > > > > > > > > > On 6/5/06, Richard Holland wrote: > > > > Hi again. > > > > > > > > Could you remove the offending question mark from the GenBank file and > > > > try it again to see if that fixes it? The parser should just ignore it > > > > but apparently not. The error looks weird to me because the tokenization > > > > for a DNA GenBank file _does_ contain the letter 't'! Not sure what's > > > > going on here. > > > ... > > > > > > > > cheers, > > > > Richard > > > > > > > > On Mon, 2006-06-05 at 10:37 -0400, Seth Johnson wrote: > > > > > Hell again Richard, > > > > > > > > > > No sooner I've said about the fix of the last parsing exception than > > > > > another one came up with Genbank format: > > > > > -------------------------------------- > > > > > org.biojava.bio.seq.io.ParseException: DQ431065 > > > > > org.biojava.bio.BioException: Could not read sequence > > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > > at exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151) > > > > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246) > > > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:326) > > > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ431065 > > > > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > > ... 3 more > > > > > org.biojava.bio.seq.io.ParseException: > > > > > org.biojava.bio.symbol.IllegalSymbolException: This tokenization > > > > > doesn't contain character: 't' > > > > > ---------------------------------------- > > > > > The Genbank file that caused it is as follows: > > > > > ========================================= > > > > > LOCUS DQ431065 425 bp DNA linear INV 01-JUN-2006 > > > > > DEFINITION Reticulitermes sp. ALS-2006c 16S ribosomal RNA gene, partial > > > > > sequence; mitochondrial. > > > > > ACCESSION DQ431065 > > > > > VERSION DQ431065.1 GI:90102206 > > > > > KEYWORDS . > > > > > SOURCE Vaccinium corymbosum > > > > > ORGANISM Vaccinium corymbosum > > > > > Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; > > > > > Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; > > > > > asterids; Ericales; Ericaceae; Vaccinioideae; Vaccinieae; > > > > > Vaccinium. > > > > > ? > > > > > REFERENCE 2 (bases 1 to 425) > > > > > AUTHORS Naik,L.D. and Rowland,L.J. > > > > > TITLE Expressed Sequence Tags of cDNA clones from subtracted library of > > > > > Vaccinium corymbosum > > > > > JOURNAL Unpublished (2005) > > > > > FEATURES Location/Qualifiers > > > > > source 1..425 > > > > > /organism="Vaccinium corymbosum" > > > > > /mol_type="genomic DNA" > > > > > /cultivar="Bluecrop" > > > > > /db_xref="taxon:69266" > > > > > /tissue_type="Flower buds" > > > > > /clone_lib="Subtracted cDNA library of Vaccinium > > > > > corymbosum" > > > > > /dev_stage="399 hour chill unit exposure" > > > > > /note="Vector: pCR4TOPO; Site_1: Eco R I; Site_2: Eco R I" > > > > > rRNA <1..>425 > > > > > /product="16S ribosomal RNA" > > > > > ORIGIN > > > > > 1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga agtatggcct gcccgctgac > > > > > 61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg tagcatagtc attagttctt > > > > > 121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc tgtcttaatt ttgaattgtt > > > > > 181 tattgaattt ggtctttgag ttaaaattct tagatgtttt tatgggacga gaagacccta > > > > > 241 tagagtttaa catttattat ggtccttttc tgtttgtgag ggctcactgg gccgtctaat > > > > > 301 atgttttgtt ggggtgatgg gagggaataa tttaacccct cctttttatt attatattta > > > > > 361 tttatattta tttgatccat ttattttgat tgtaagatta aattacctta gggataacag > > > > > 421 cgtaa > > > > > // > > > > > ================================== > > > > > I think it's the presence of the '?' at the beginning of the line?!?! > > > > > I'm not sure wether the information that was supposed to be present > > > > > instead of those question marks is absent from the original ASN.1 > > > > > batch file or it's a bug in the NCBI ASN2GO software. It looks to me > > > > > that the former is the case since the file from NCBI website contains > > > > > much more information than the batch file. Just bringing this to > > > > > everyone's attention. > > > > > > > > > > > > > > > -- > > > > > Best Regards, > > > > > > > > > > > > > > > Seth Johnson > > > > > Senior Bioinformatics Associate > > > > > > > > > > Ph: (202) 470-0900 > > > > > Fx: (775) 251-0358 > > > > > > > > > > On 6/2/06, Richard Holland wrote: > > > > > > Hi Seth. > > > > > > > > > > > > Your second point, about the authors string not being read correctly in > > > > > > Genbank format, has been fixed (or should have been if I got the code > > > > > > right!). Could you check the latest version of biojava-live out of CVS > > > > > > and give it another go? Basically the parser did not recognise the > > > > > > CONSRTM tag, as it is not mentioned in the sample record provided by > > > > > > NCBI, which is what I based the parser on. > > > > > ... > > > > > > > > > > > > cheers, > > > > > > Richard > > > > > > > > > > > > > > > > -- > > > > Richard Holland (BioMart Team) > > > > EMBL-EBI > > > > Wellcome Trust Genome Campus > > > > Hinxton > > > > Cambridge CB10 1SD > > > > UNITED KINGDOM > > > > Tel: +44-(0)1223-494416 > > > > > > > > > > > > > > > > -- > > Richard Holland (BioMart Team) > > EMBL-EBI > > Wellcome Trust Genome Campus > > Hinxton > > Cambridge CB10 1SD > > UNITED KINGDOM > > Tel: +44-(0)1223-494416 > > > > > > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From mark.schreiber at novartis.com Wed Jun 7 05:09:27 2006 From: mark.schreiber at novartis.com (mark.schreiber at novartis.com) Date: Wed, 7 Jun 2006 17:09:27 +0800 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files Message-ID: Presumably the XML it produces should validate against the dtd? It should also parse anything that validates against the dtd. I think that would be the base line for behaivour of the parser. Richard Holland Sent by: biojava-l-bounces at lists.open-bio.org 06/07/2006 05:01 PM To: Seth Johnson cc: biojava-l at lists.open-bio.org, (bcc: Mark Schreiber/GP/Novartis) Subject: Re: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files OK, I've updated INSDseqFormat to 1.4, or my interpretation of it based on what the guys next door told me. Please let me know if you have trouble running the XML it produces through any other parsers that can read it, or if it throws a wobbly whilst reading stuff you are 100% sure is valid. cheers, Richard On Mon, 2006-06-05 at 12:28 -0400, Seth Johnson wrote: > I agree with you on that one. However, the problem might be a little > deeper. Same '?' appear in the INSDseq format bounded by > tags and cause the following exception. > This tells me that the '?' are actually values that are being > incorrectly parsed. Further examination of the .dtd reveals that > INSDseqFormat.java is tailord towards the INSDSeq v. 1.3 whereas the > files I obtain are in the INSDSeq v. 1.4 (which among other things > contain a new tag ). Here're links to both > .dtd's: > > http://www.ebi.ac.uk/embl/Documentation/DTD/INSDSeq_v1.3.dtd.txt > > http://www.ebi.ac.uk/embl/Documentation/DTD/INSDC_V1.4.dtd.txt > > I think it might be worth accommodating changes for the INSDseq > format, not sure how that would affect the '?' in Genbank. > > Seth > > ====================== > org.biojava.bio.BioException: Could not read sequence > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > at exonhit.parsers.GenBankParser.main(GenBankParser.java:348) > Caused by: org.biojava.bio.seq.io.ParseException: > org.biojava.bio.seq.io.ParseException: Bad reference line found: ? > at org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:250) > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > ... 1 more > Caused by: org.biojava.bio.seq.io.ParseException: Bad reference line found: ? > at org.biojavax.bio.seq.io.INSDseqFormat$INSDseqHandler.endElement(INSDseqFormat.java:901) > at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633) > at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241) > at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685) > at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368) > at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834) > at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764) > at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148) > at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:375) > at org.biojavax.utils.XMLTools.readXMLChunk(XMLTools.java:97) > at org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:246) > ... 2 more > Java Result: -1 > ====================== > > ~~~~~~~~~~~~~~~~~~~~~~ > > > ? > 1..16732 > > Bjornerfeldt,S. > Webster,M.T. > Vila,C. > > Relaxation of Selective Constraint on Dog > Mitochondrial DNA Following Domestication > Unpublished > > > ? > 1..16732 > > Bjornerfeldt,S. > Webster,M.T. > Vila,C. > > Submitted (06-APR-2006) to the > EMBL/GenBank/DDBJ databases. Evolutionary Biology, Evolutionary > Biology, Norbyvagen 18D, Uppsala 752 36, > Sweden > > > ~~~~~~~~~~~~~~~~~~~~~~ > > On 6/5/06, Richard Holland wrote: > > Hmmm... interesting. I _could_ put in a special case that ignores the > > question marks, but that wouldn't be 'nice' really - this is more of a > > problem with the program that is producing the Genbank files than a > > problem with the parser trying to read them. '?' is not a valid tag in > > the official Genbank format, and has no meaning attached to it that I > > can work out, so I'm reluctant to make the parser recognise it. > > > > I'd suggest you contact the people who write the software you are using > > to produce the Genbank files and ask them if they could stick to the > > rules! > > > > In the meantime you could work around the problem by stripping the > > question marks in some kind of pre-processor before passing it onto > > BioJavaX for parsing. > > > > cheers, > > Richard > > > > On Mon, 2006-06-05 at 11:39 -0400, Seth Johnson wrote: > > > Removing '?' (or several of them in my case) avoids the following exception: > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > org.biojava.bio.BioException: Could not read sequence > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:348) > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ415957 > > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > ... 1 more > > > Java Result: -1 > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > I don't know where that previous tokenization problem came from since > > > I can no longer reproduce it. This time it's more or less straight > > > forward. > > > Here's the original file with question marks: > > > ============================ > > > LOCUS DQ415957 1437 bp mRNA linear VRT 01-JUN-2006 > > > DEFINITION Danio rerio capillary morphogenesis protein 2A (cmg2a) mRNA, > > > complete cds. > > > ACCESSION DQ415957 > > > VERSION DQ415957.1 GI:89513612 > > > KEYWORDS . > > > SOURCE Unknown. > > > ORGANISM Unknown. > > > Unclassified. > > > ? > > > ? > > > FEATURES Location/Qualifiers > > > ? > > > gene 1..1437 > > > /gene="cmg2a" > > > CDS 1..1437 > > > /gene="cmg2a" > > > /note="cell surface receptor; similar to anthrax toxin > > > receptor 2 (ANTXR2, ATR2, CMG2)" > > > /codon_start=1 > > > /product="capillary morphogenesis protein 2A" > > > /protein_id="ABD74633.1" > > > /db_xref="GI:89513613" > > > /translation="MTKENLWSVATTATLFFCLCFSSFKAETPSCHGAYDLYFVLDRS > > > GSVSTDWSEIYDFVKNLTERFVSPNLRVSFIVFSSRAEIVLPLTGDRSEINKGLKTLS > > > EVNPAGETYMHEGIKLATEQMKKEPKKSSSIIVALTDGKLETYIHQLTIDEADSARKY > > > GARVYCVGVKDFDEEQLADVADSKEQVFPVKGGFQALKGIVNSILKQSCTEILTVEPS > > > SVCVNQSFDIVLRGNGFAVGRQTEGVICSFIVDGVTYKQKPTKVKIDYILCPAPVLYT > > > VGQQMEVLISLNSGTSYITSAFIITASSCSDGTVVAIVFLVLFLLLALALMWWFWPLC > > > CTVVIKDPPPQRPPPPPPKLEPDPEPKKKWPTVDASYYGGRGAGGIKRMEVRWGEKGS > > > TEEGARLEMAKNAVVSIQEESEEPMVKKPRAPAQTCHQSESKWYTPIRGRLDALWALL > > > RRQYDRVSVMRPTSADKGRCMNFSRTQH" > > > ORIGIN > > > 1 atgacaaagg aaaatctctg gagcgtggca accacggcga ctcttttctt ctgtttatgc > > > 61 ttttcatctt ttaaagcgga aaccccatct tgtcatggtg cctacgacct gtactttgtg > > > 121 ttggaccgat ctggaagtgt ttcgactgac tggagtgaaa tctatgactt tgtcaaaaat > > > 181 cttacagaga gatttgtgag tccaaatctg cgagtgtcct tcattgtttt ttcatcaaga > > > 241 gcagagattg tgttaccgct cactggagac aggtcagaaa ttaataaagg cctgaagacc > > > 301 ttaagtgagg tcaatccagc tggagaaaca tacatgcatg aaggaattaa attggcaact > > > 361 gaacaaatga aaaaagagcc taaaaagtcc tctagtatta ttgtggcctt gactgatgga > > > 421 aagcttgaaa cgtatatcca tcaactcact attgacgagg ctgattcagc aaggaagtat > > > 481 ggggctcgtg tgtactgtgt tggtgtaaaa gactttgatg aagaacagct agccgatgtg > > > 541 gctgattcca aggagcaagt gttcccagtc aaaggaggct ttcaggctct caaaggcatc > > > 601 gttaactcga tcctcaagca atcatgcacc gaaatcctaa cagtggaacc gtccagcgtc > > > 661 tgcgtgaacc agtcctttga cattgttttg agagggaacg ggttcgcagt ggggagacaa > > > 721 acagaaggag tcatctgcag tttcatagtg gatggagtta cttacaaaca aaaaccaacc > > > 781 aaagtgaaga ttgactacat cctatgtcct gctccagtgc tgtatacagt tggacagcaa > > > 841 atggaggttc tgatcagttt gaacagtgga acatcatata tcaccagtgc tttcatcatc > > > 901 actgcctctt catgttcgga cggcacagtg gtggccattg tgttcttggt gctttttctc > > > 961 ctgttggctt tggctctgat gtggtggttc tggcctctat gctgcactgt cgttattaaa > > > 1021 gacccacctc cacaaagacc tcctccacct ccacctaagc tagagccaga cccggaaccc > > > 1081 aagaagaagt ggccaactgt ggatgcatct tactatgggg gaagaggagc tggtggaatc > > > 1141 aaacgcatgg aggtccgttg gggagaaaaa gggtctacag aggaaggtgc aagactagag > > > 1201 atggctaaga atgcagtagt gtcaatacaa gaggaatcag aagaacccat ggtcaaaaag > > > 1261 ccaagagcac ctgcacaaac atgccatcaa tctgaatcca agtggtatac accaatcaga > > > 1321 ggccgtcttg acgcactgtg ggctcttttg cggcggcaat atgaccgagt ttcagttatg > > > 1381 cgaccaactt ctgcagataa gggtcgctgt atgaatttca gtcgcacgca gcattaa > > > // > > > > > > ============================ > > > > > > > > > On 6/5/06, Richard Holland wrote: > > > > Hi again. > > > > > > > > Could you remove the offending question mark from the GenBank file and > > > > try it again to see if that fixes it? The parser should just ignore it > > > > but apparently not. The error looks weird to me because the tokenization > > > > for a DNA GenBank file _does_ contain the letter 't'! Not sure what's > > > > going on here. > > > ... > > > > > > > > cheers, > > > > Richard > > > > > > > > On Mon, 2006-06-05 at 10:37 -0400, Seth Johnson wrote: > > > > > Hell again Richard, > > > > > > > > > > No sooner I've said about the fix of the last parsing exception than > > > > > another one came up with Genbank format: > > > > > -------------------------------------- > > > > > org.biojava.bio.seq.io.ParseException: DQ431065 > > > > > org.biojava.bio.BioException: Could not read sequence > > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > > at exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151) > > > > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246) > > > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:326) > > > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ431065 > > > > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > > ... 3 more > > > > > org.biojava.bio.seq.io.ParseException: > > > > > org.biojava.bio.symbol.IllegalSymbolException: This tokenization > > > > > doesn't contain character: 't' > > > > > ---------------------------------------- > > > > > The Genbank file that caused it is as follows: > > > > > ========================================= > > > > > LOCUS DQ431065 425 bp DNA linear INV 01-JUN-2006 > > > > > DEFINITION Reticulitermes sp. ALS-2006c 16S ribosomal RNA gene, partial > > > > > sequence; mitochondrial. > > > > > ACCESSION DQ431065 > > > > > VERSION DQ431065.1 GI:90102206 > > > > > KEYWORDS . > > > > > SOURCE Vaccinium corymbosum > > > > > ORGANISM Vaccinium corymbosum > > > > > Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; > > > > > Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; > > > > > asterids; Ericales; Ericaceae; Vaccinioideae; Vaccinieae; > > > > > Vaccinium. > > > > > ? > > > > > REFERENCE 2 (bases 1 to 425) > > > > > AUTHORS Naik,L.D. and Rowland,L.J. > > > > > TITLE Expressed Sequence Tags of cDNA clones from subtracted library of > > > > > Vaccinium corymbosum > > > > > JOURNAL Unpublished (2005) > > > > > FEATURES Location/Qualifiers > > > > > source 1..425 > > > > > /organism="Vaccinium corymbosum" > > > > > /mol_type="genomic DNA" > > > > > /cultivar="Bluecrop" > > > > > /db_xref="taxon:69266" > > > > > /tissue_type="Flower buds" > > > > > /clone_lib="Subtracted cDNA library of Vaccinium > > > > > corymbosum" > > > > > /dev_stage="399 hour chill unit exposure" > > > > > /note="Vector: pCR4TOPO; Site_1: Eco R I; Site_2: Eco R I" > > > > > rRNA <1..>425 > > > > > /product="16S ribosomal RNA" > > > > > ORIGIN > > > > > 1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga agtatggcct gcccgctgac > > > > > 61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg tagcatagtc attagttctt > > > > > 121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc tgtcttaatt ttgaattgtt > > > > > 181 tattgaattt ggtctttgag ttaaaattct tagatgtttt tatgggacga gaagacccta > > > > > 241 tagagtttaa catttattat ggtccttttc tgtttgtgag ggctcactgg gccgtctaat > > > > > 301 atgttttgtt ggggtgatgg gagggaataa tttaacccct cctttttatt attatattta > > > > > 361 tttatattta tttgatccat ttattttgat tgtaagatta aattacctta gggataacag > > > > > 421 cgtaa > > > > > // > > > > > ================================== > > > > > I think it's the presence of the '?' at the beginning of the line?!?! > > > > > I'm not sure wether the information that was supposed to be present > > > > > instead of those question marks is absent from the original ASN.1 > > > > > batch file or it's a bug in the NCBI ASN2GO software. It looks to me > > > > > that the former is the case since the file from NCBI website contains > > > > > much more information than the batch file. Just bringing this to > > > > > everyone's attention. > > > > > > > > > > > > > > > -- > > > > > Best Regards, > > > > > > > > > > > > > > > Seth Johnson > > > > > Senior Bioinformatics Associate > > > > > > > > > > Ph: (202) 470-0900 > > > > > Fx: (775) 251-0358 > > > > > > > > > > On 6/2/06, Richard Holland wrote: > > > > > > Hi Seth. > > > > > > > > > > > > Your second point, about the authors string not being read correctly in > > > > > > Genbank format, has been fixed (or should have been if I got the code > > > > > > right!). Could you check the latest version of biojava-live out of CVS > > > > > > and give it another go? Basically the parser did not recognise the > > > > > > CONSRTM tag, as it is not mentioned in the sample record provided by > > > > > > NCBI, which is what I based the parser on. > > > > > ... > > > > > > > > > > > > cheers, > > > > > > Richard > > > > > > > > > > > > > > > > -- > > > > Richard Holland (BioMart Team) > > > > EMBL-EBI > > > > Wellcome Trust Genome Campus > > > > Hinxton > > > > Cambridge CB10 1SD > > > > UNITED KINGDOM > > > > Tel: +44-(0)1223-494416 > > > > > > > > > > > > > > > > -- > > Richard Holland (BioMart Team) > > EMBL-EBI > > Wellcome Trust Genome Campus > > Hinxton > > Cambridge CB10 1SD > > UNITED KINGDOM > > Tel: +44-(0)1223-494416 > > > > > > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From richard.holland at ebi.ac.uk Wed Jun 7 07:56:04 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Wed, 07 Jun 2006 12:56:04 +0100 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: References: Message-ID: <1149681364.3947.120.camel@texas.ebi.ac.uk> That'd be nice, except the DTD has bugs in it! I've pointed this out to them already but no fixes have been made yet. On Wed, 2006-06-07 at 17:09 +0800, mark.schreiber at novartis.com wrote: > Presumably the XML it produces should validate against the dtd? It should > also parse anything that validates against the dtd. I think that would be > the base line for behaivour of the parser. > > > > > > > Richard Holland > Sent by: biojava-l-bounces at lists.open-bio.org > 06/07/2006 05:01 PM > > > To: Seth Johnson > cc: biojava-l at lists.open-bio.org, (bcc: Mark Schreiber/GP/Novartis) > Subject: Re: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 > daily update files > > > OK, I've updated INSDseqFormat to 1.4, or my interpretation of it based > on what the guys next door told me. Please let me know if you have > trouble running the XML it produces through any other parsers that can > read it, or if it throws a wobbly whilst reading stuff you are 100% sure > is valid. > > cheers, > Richard > > On Mon, 2006-06-05 at 12:28 -0400, Seth Johnson wrote: > > I agree with you on that one. However, the problem might be a little > > deeper. Same '?' appear in the INSDseq format bounded by > > tags and cause the following exception. > > This tells me that the '?' are actually values that are being > > incorrectly parsed. Further examination of the .dtd reveals that > > INSDseqFormat.java is tailord towards the INSDSeq v. 1.3 whereas the > > files I obtain are in the INSDSeq v. 1.4 (which among other things > > contain a new tag ). Here're links to both > > .dtd's: > > > > http://www.ebi.ac.uk/embl/Documentation/DTD/INSDSeq_v1.3.dtd.txt > > > > http://www.ebi.ac.uk/embl/Documentation/DTD/INSDC_V1.4.dtd.txt > > > > I think it might be worth accommodating changes for the INSDseq > > format, not sure how that would affect the '?' in Genbank. > > > > Seth > > > > ====================== > > org.biojava.bio.BioException: Could not read sequence > > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:348) > > Caused by: org.biojava.bio.seq.io.ParseException: > > org.biojava.bio.seq.io.ParseException: Bad reference line found: ? > > at > org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:250) > > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > ... 1 more > > Caused by: org.biojava.bio.seq.io.ParseException: Bad reference line > found: ? > > at > org.biojavax.bio.seq.io.INSDseqFormat$INSDseqHandler.endElement(INSDseqFormat.java:901) > > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633) > > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241) > > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685) > > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368) > > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834) > > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764) > > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148) > > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242) > > at javax.xml.parsers.SAXParser.parse(SAXParser.java:375) > > at org.biojavax.utils.XMLTools.readXMLChunk(XMLTools.java:97) > > at > org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:246) > > ... 2 more > > Java Result: -1 > > ====================== > > > > ~~~~~~~~~~~~~~~~~~~~~~ > > > > > > ? > > 1..16732 > > > > Bjornerfeldt,S. > > Webster,M.T. > > Vila,C. > > > > Relaxation of Selective Constraint on Dog > > Mitochondrial DNA Following Domestication > > Unpublished > > > > > > ? > > 1..16732 > > > > Bjornerfeldt,S. > > Webster,M.T. > > Vila,C. > > > > Submitted (06-APR-2006) to the > > EMBL/GenBank/DDBJ databases. Evolutionary Biology, Evolutionary > > Biology, Norbyvagen 18D, Uppsala 752 36, > > Sweden > > > > > > ~~~~~~~~~~~~~~~~~~~~~~ > > > > On 6/5/06, Richard Holland wrote: > > > Hmmm... interesting. I _could_ put in a special case that ignores the > > > question marks, but that wouldn't be 'nice' really - this is more of a > > > problem with the program that is producing the Genbank files than a > > > problem with the parser trying to read them. '?' is not a valid tag in > > > the official Genbank format, and has no meaning attached to it that I > > > can work out, so I'm reluctant to make the parser recognise it. > > > > > > I'd suggest you contact the people who write the software you are > using > > > to produce the Genbank files and ask them if they could stick to the > > > rules! > > > > > > In the meantime you could work around the problem by stripping the > > > question marks in some kind of pre-processor before passing it onto > > > BioJavaX for parsing. > > > > > > cheers, > > > Richard > > > > > > On Mon, 2006-06-05 at 11:39 -0400, Seth Johnson wrote: > > > > Removing '?' (or several of them in my case) avoids the following > exception: > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > org.biojava.bio.BioException: Could not read sequence > > > > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > at > exonhit.parsers.GenBankParser.main(GenBankParser.java:348) > > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ415957 > > > > at > org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > > > > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > ... 1 more > > > > Java Result: -1 > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > I don't know where that previous tokenization problem came from > since > > > > I can no longer reproduce it. This time it's more or less straight > > > > forward. > > > > Here's the original file with question marks: > > > > ============================ > > > > LOCUS DQ415957 1437 bp mRNA linear VRT > 01-JUN-2006 > > > > DEFINITION Danio rerio capillary morphogenesis protein 2A (cmg2a) > mRNA, > > > > complete cds. > > > > ACCESSION DQ415957 > > > > VERSION DQ415957.1 GI:89513612 > > > > KEYWORDS . > > > > SOURCE Unknown. > > > > ORGANISM Unknown. > > > > Unclassified. > > > > ? > > > > ? > > > > FEATURES Location/Qualifiers > > > > ? > > > > gene 1..1437 > > > > /gene="cmg2a" > > > > CDS 1..1437 > > > > /gene="cmg2a" > > > > /note="cell surface receptor; similar to > anthrax toxin > > > > receptor 2 (ANTXR2, ATR2, CMG2)" > > > > /codon_start=1 > > > > /product="capillary morphogenesis protein 2A" > > > > /protein_id="ABD74633.1" > > > > /db_xref="GI:89513613" > > > > /translation="MTKENLWSVATTATLFFCLCFSSFKAETPSCHGAYDLYFVLDRS > > > > GSVSTDWSEIYDFVKNLTERFVSPNLRVSFIVFSSRAEIVLPLTGDRSEINKGLKTLS > > > > EVNPAGETYMHEGIKLATEQMKKEPKKSSSIIVALTDGKLETYIHQLTIDEADSARKY > > > > GARVYCVGVKDFDEEQLADVADSKEQVFPVKGGFQALKGIVNSILKQSCTEILTVEPS > > > > SVCVNQSFDIVLRGNGFAVGRQTEGVICSFIVDGVTYKQKPTKVKIDYILCPAPVLYT > > > > VGQQMEVLISLNSGTSYITSAFIITASSCSDGTVVAIVFLVLFLLLALALMWWFWPLC > > > > CTVVIKDPPPQRPPPPPPKLEPDPEPKKKWPTVDASYYGGRGAGGIKRMEVRWGEKGS > > > > TEEGARLEMAKNAVVSIQEESEEPMVKKPRAPAQTCHQSESKWYTPIRGRLDALWALL > > > > RRQYDRVSVMRPTSADKGRCMNFSRTQH" > > > > ORIGIN > > > > 1 atgacaaagg aaaatctctg gagcgtggca accacggcga ctcttttctt > ctgtttatgc > > > > 61 ttttcatctt ttaaagcgga aaccccatct tgtcatggtg cctacgacct > gtactttgtg > > > > 121 ttggaccgat ctggaagtgt ttcgactgac tggagtgaaa tctatgactt > tgtcaaaaat > > > > 181 cttacagaga gatttgtgag tccaaatctg cgagtgtcct tcattgtttt > ttcatcaaga > > > > 241 gcagagattg tgttaccgct cactggagac aggtcagaaa ttaataaagg > cctgaagacc > > > > 301 ttaagtgagg tcaatccagc tggagaaaca tacatgcatg aaggaattaa > attggcaact > > > > 361 gaacaaatga aaaaagagcc taaaaagtcc tctagtatta ttgtggcctt > gactgatgga > > > > 421 aagcttgaaa cgtatatcca tcaactcact attgacgagg ctgattcagc > aaggaagtat > > > > 481 ggggctcgtg tgtactgtgt tggtgtaaaa gactttgatg aagaacagct > agccgatgtg > > > > 541 gctgattcca aggagcaagt gttcccagtc aaaggaggct ttcaggctct > caaaggcatc > > > > 601 gttaactcga tcctcaagca atcatgcacc gaaatcctaa cagtggaacc > gtccagcgtc > > > > 661 tgcgtgaacc agtcctttga cattgttttg agagggaacg ggttcgcagt > ggggagacaa > > > > 721 acagaaggag tcatctgcag tttcatagtg gatggagtta cttacaaaca > aaaaccaacc > > > > 781 aaagtgaaga ttgactacat cctatgtcct gctccagtgc tgtatacagt > tggacagcaa > > > > 841 atggaggttc tgatcagttt gaacagtgga acatcatata tcaccagtgc > tttcatcatc > > > > 901 actgcctctt catgttcgga cggcacagtg gtggccattg tgttcttggt > gctttttctc > > > > 961 ctgttggctt tggctctgat gtggtggttc tggcctctat gctgcactgt > cgttattaaa > > > > 1021 gacccacctc cacaaagacc tcctccacct ccacctaagc tagagccaga > cccggaaccc > > > > 1081 aagaagaagt ggccaactgt ggatgcatct tactatgggg gaagaggagc > tggtggaatc > > > > 1141 aaacgcatgg aggtccgttg gggagaaaaa gggtctacag aggaaggtgc > aagactagag > > > > 1201 atggctaaga atgcagtagt gtcaatacaa gaggaatcag aagaacccat > ggtcaaaaag > > > > 1261 ccaagagcac ctgcacaaac atgccatcaa tctgaatcca agtggtatac > accaatcaga > > > > 1321 ggccgtcttg acgcactgtg ggctcttttg cggcggcaat atgaccgagt > ttcagttatg > > > > 1381 cgaccaactt ctgcagataa gggtcgctgt atgaatttca gtcgcacgca > gcattaa > > > > // > > > > > > > > ============================ > > > > > > > > > > > > On 6/5/06, Richard Holland wrote: > > > > > Hi again. > > > > > > > > > > Could you remove the offending question mark from the GenBank file > and > > > > > try it again to see if that fixes it? The parser should just > ignore it > > > > > but apparently not. The error looks weird to me because the > tokenization > > > > > for a DNA GenBank file _does_ contain the letter 't'! Not sure > what's > > > > > going on here. > > > > ... > > > > > > > > > > cheers, > > > > > Richard > > > > > > > > > > On Mon, 2006-06-05 at 10:37 -0400, Seth Johnson wrote: > > > > > > Hell again Richard, > > > > > > > > > > > > No sooner I've said about the fix of the last parsing exception > than > > > > > > another one came up with Genbank format: > > > > > > -------------------------------------- > > > > > > org.biojava.bio.seq.io.ParseException: DQ431065 > > > > > > org.biojava.bio.BioException: Could not read sequence > > > > > > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > > > at > exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151) > > > > > > at > exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246) > > > > > > at > exonhit.parsers.GenBankParser.main(GenBankParser.java:326) > > > > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ431065 > > > > > > at > org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > > > > > > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > > > ... 3 more > > > > > > org.biojava.bio.seq.io.ParseException: > > > > > > org.biojava.bio.symbol.IllegalSymbolException: This tokenization > > > > > > doesn't contain character: 't' > > > > > > ---------------------------------------- > > > > > > The Genbank file that caused it is as follows: > > > > > > ========================================= > > > > > > LOCUS DQ431065 425 bp DNA linear > INV 01-JUN-2006 > > > > > > DEFINITION Reticulitermes sp. ALS-2006c 16S ribosomal RNA gene, > partial > > > > > > sequence; mitochondrial. > > > > > > ACCESSION DQ431065 > > > > > > VERSION DQ431065.1 GI:90102206 > > > > > > KEYWORDS . > > > > > > SOURCE Vaccinium corymbosum > > > > > > ORGANISM Vaccinium corymbosum > > > > > > Eukaryota; Viridiplantae; Streptophyta; Embryophyta; > Tracheophyta; > > > > > > Spermatophyta; Magnoliophyta; eudicotyledons; core > eudicotyledons; > > > > > > asterids; Ericales; Ericaceae; Vaccinioideae; > Vaccinieae; > > > > > > Vaccinium. > > > > > > ? > > > > > > REFERENCE 2 (bases 1 to 425) > > > > > > AUTHORS Naik,L.D. and Rowland,L.J. > > > > > > TITLE Expressed Sequence Tags of cDNA clones from > subtracted library of > > > > > > Vaccinium corymbosum > > > > > > JOURNAL Unpublished (2005) > > > > > > FEATURES Location/Qualifiers > > > > > > source 1..425 > > > > > > /organism="Vaccinium corymbosum" > > > > > > /mol_type="genomic DNA" > > > > > > /cultivar="Bluecrop" > > > > > > /db_xref="taxon:69266" > > > > > > /tissue_type="Flower buds" > > > > > > /clone_lib="Subtracted cDNA library of > Vaccinium > > > > > > corymbosum" > > > > > > /dev_stage="399 hour chill unit exposure" > > > > > > /note="Vector: pCR4TOPO; Site_1: Eco R I; > Site_2: Eco R I" > > > > > > rRNA <1..>425 > > > > > > /product="16S ribosomal RNA" > > > > > > ORIGIN > > > > > > 1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga agtatggcct > gcccgctgac > > > > > > 61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg tagcatagtc > attagttctt > > > > > > 121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc tgtcttaatt > ttgaattgtt > > > > > > 181 tattgaattt ggtctttgag ttaaaattct tagatgtttt tatgggacga > gaagacccta > > > > > > 241 tagagtttaa catttattat ggtccttttc tgtttgtgag ggctcactgg > gccgtctaat > > > > > > 301 atgttttgtt ggggtgatgg gagggaataa tttaacccct cctttttatt > attatattta > > > > > > 361 tttatattta tttgatccat ttattttgat tgtaagatta aattacctta > gggataacag > > > > > > 421 cgtaa > > > > > > // > > > > > > ================================== > > > > > > I think it's the presence of the '?' at the beginning of the > line?!?! > > > > > > I'm not sure wether the information that was supposed to be > present > > > > > > instead of those question marks is absent from the original > ASN.1 > > > > > > batch file or it's a bug in the NCBI ASN2GO software. It looks > to me > > > > > > that the former is the case since the file from NCBI website > contains > > > > > > much more information than the batch file. Just bringing this to > > > > > > everyone's attention. > > > > > > > > > > > > > > > > > > -- > > > > > > Best Regards, > > > > > > > > > > > > > > > > > > Seth Johnson > > > > > > Senior Bioinformatics Associate > > > > > > > > > > > > Ph: (202) 470-0900 > > > > > > Fx: (775) 251-0358 > > > > > > > > > > > > On 6/2/06, Richard Holland wrote: > > > > > > > Hi Seth. > > > > > > > > > > > > > > Your second point, about the authors string not being read > correctly in > > > > > > > Genbank format, has been fixed (or should have been if I got > the code > > > > > > > right!). Could you check the latest version of biojava-live > out of CVS > > > > > > > and give it another go? Basically the parser did not recognise > the > > > > > > > CONSRTM tag, as it is not mentioned in the sample record > provided by > > > > > > > NCBI, which is what I based the parser on. > > > > > > ... > > > > > > > > > > > > > > cheers, > > > > > > > Richard > > > > > > > > > > > > > > > > > > > -- > > > > > Richard Holland (BioMart Team) > > > > > EMBL-EBI > > > > > Wellcome Trust Genome Campus > > > > > Hinxton > > > > > Cambridge CB10 1SD > > > > > UNITED KINGDOM > > > > > Tel: +44-(0)1223-494416 > > > > > > > > > > > > > > > > > > > > > -- > > > Richard Holland (BioMart Team) > > > EMBL-EBI > > > Wellcome Trust Genome Campus > > > Hinxton > > > Cambridge CB10 1SD > > > UNITED KINGDOM > > > Tel: +44-(0)1223-494416 > > > > > > > > > > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From johnson.biotech at gmail.com Wed Jun 7 11:36:13 2006 From: johnson.biotech at gmail.com (Seth Johnson) Date: Wed, 7 Jun 2006 11:36:13 -0400 Subject: [Biojava-l] Parsing INSDseq Sequences (1.3 & 1.4) In-Reply-To: <1149606788.3947.87.camel@texas.ebi.ac.uk> References: <1149606788.3947.87.camel@texas.ebi.ac.uk> Message-ID: Hello again Richard, Thank you for updating the INSDseqFormat to 1.4 so promptly. Another reason I inquired about accessing different terms is because the code: rs.getAnnotation().getProperty(Terms.getMolTypeTerm()) When the above is executed after parsing the INSDseq file it produces the following exception: ~~~~~~~~~~~~~~~~~~~~~ Exception in thread "main" java.util.NoSuchElementException: No such property: biojavax:moltype, rank 0 at org.biojavax.SimpleRichAnnotation.getNote(SimpleRichAnnotation.java:137) at org.biojavax.SimpleRichAnnotation.getProperty(SimpleRichAnnotation.java:147) at exonhit.parsers.GenBankParser.main(GenBankParser.java:370) ~~~~~~~~~~~~~~~~~~~~~ The file that I'm parsing is as follows and does contain the 'moltype': +++++++++++++++++++++ AY069118 1502 single mRNA linear INV

17-DEC-2001

15-DEC-2001

Drosophila melanogaster GH13089 full length cDNA

AY069118

AY069118.1

gb|AY069118.1| gi|17861571

17-DEC-2001

15-DEC-2001

Drosophila melanogaster GH13089 full length cDNA

AY069118

AY069118.1

gb|AY069118.1| gi|17861571

AAGAATAGAGGGAGAATGAAAAAAATGACATAAATGGCGGAAAGCAAACCTAGCGCCAACATTCGTATTTTCGTTTAATTTTCGCTCCAAAGTGCAATTAATTCCGGCTTCTTGATCGCTGCATATTGAGTGCAGCCACGCAAAGAGTTACAAGGACAGGAGTATAGTCATCGAGTCGATTGCGGACCATGTACAAGCGCAAAACCGCGAGTATTGTTAAAAGAGACAGCTCCGCAGCGGGCACCACCTCCTCGGCTATGATGATGAAGGTGGATTCGGCTGAGACTTCGGTCCGGTCGCAGAGCTACGAGTCTACACCCGTTAGCATGGACACATCACCGGATCCTCCAACGCCAATCAAGTCTCCGTCGAATTCACAATCGCAATCGCAGCCTGGACAACAGCGCTCCGTGGGCTCACTGGTCCTGCTCACACAGAAGTTTGTGGATCTCGTGAAGGCCAACGAAGGATCCATCGACCTGAAAGCGGCAACCAAAATCTTGGACGTACAGAAGCGCCGAATATACGATATTACCAATGTTTTAGAGGGCATTGGACTAATTGATAAGGGCAGACACTGCTCCCTAGTGCGCTGGCGCGGAGGGGGCTTTAACAATGCCAAGGACCAAGAGAACTACGACCTGGCACGTAGCCGGACTAATCATTTGAAAATGTTGGAGGATGACCTAGACAGGCAACTGGAGTATGCACAGCGCAATCTGCGCTACGTTATGCAGGATCCCTCGAATAGGTCGTATGCATATGTGACACGTGATGATCTGCTGGACATCTTTGGAGATGATTCCGTATTCACAATACCTAATTATGACGAGGAAGTAGATATCAAGCGTAATCATTACGAGCTGGCCGTGTCGCTGGACAATGGCAGCGCAATTGACATTCGCCTGGTGACGAACCAAGGAAAGAGTACTACAAATCCGCACGATGTGGATGGGTTCTTTGACTATCACCGTCTGGACACGCCCTCACCCTCGACGTCGTCGCACTCCAGCGAGGATGGTAACGCTCCAGCATGCGCGGGGAACGTGATCACCGACGAGCACGGTTACTCGTGCAATCCCGGGATGAAAGATGAGATGAAACTTTTGGAGAACGAGCTGACGGCCAAGATAATCTTCCAAAATTATCTGTCCGGTCATTCGCTGCGGCGATTTTATCCCGATGATCCGAATCTAGAAAACCCGCCGCTGCTGCAGCTGAATCCTCCGCAGGAAGACTTCAACTTTGCGTTAAAAAGCGACGAAGGTATTTGCGAGCTGTTTGATGTTCAGTGCTCCTAACTGTGGAAGGGGATGTACACCTTAGGACTATAGCTACACTGCAACTGGCCGCGTGCATTGTGCAAATATTTATGATTAGTACAATTTTGACTTTGGATTTCTCTATATCGTCTAGAAATTTTTAATTAGTGTAATACCTTGTAATTTCGCAAATAACAGCAAAACCAATAAATTCGTAAATGCAAAAAAAAAAAAAAAAAA ~~~~~~~~~~~~~~~~~~~~~~ On 6/8/06, Richard Holland wrote: > > Yesterday I think I said I was going to add other-seqids but I forgot to > do it, so I did it just now. Try it and see. Use the new > INSDseqFormat.Terms.getOtherSeqIdTerm() term to find them. > > cheers, > Richard > > On Wed, 2006-06-07 at 19:48 -0400, Seth Johnson wrote: > > Hi Richard, > > > > I still cannot locate the GI number for the main sequence. After I > > parse it with readINSDseqDNA, I then use: > > > > Note [] myAccs = ((RichAnnotation)rs.getAnnotation > > ()).getProperties(Terms.getAdditionalAccessionTerm ()); > > > > However, the 'myAccs' appears to be empty. Am I on the wrong track to > > get to other-seqids??? > > > > On 6/6/06, Richard Holland wrote: > > GenBank has a separate line for GI number, so it can be parsed > > out > > nicely. INSDseq does not, so you have to rely on the other- > > seqids tag > > and hope that one of them is the GI number. However it seems I > > have not > > included that tag in the parser, so I will include it. This > > will make > > the other-seqids values available through the notes with the > > term > > Terms.getAdditionalAccessionTerm(), but getIdentifier() will > > remain > > null. > > > > For your second question, the tutorial makes the mistake in > > several > > places of saying getNoteSet(Terms.blahblah()). This was > > shorthand for: > > > > rs.getAnnotation().getProperty(Terms.blahblah()) > > (for single values) > > > > or > > > > ((RichAnnotation)rs.getAnnotation()).getProperties > > ( Terms.blahblah()) > > (for multiple values) > > > > but never got expanded. Maybe someone can fix that one > > day... :)ded... > > > > I'm just updating INSDseq to 1.4 now. The guys next door gave > > me the > > details of the changes, and told me that 1.3 is actually no > > longer > > supported by them after Friday this week! So I'll make it 1.4 > > only. > > > > cheers, > > Richard > > > -- > Richard Holland (BioMart Team) > EMBL-EBI > Wellcome Trust Genome Campus > Hinxton > Cambridge CB10 1SD > UNITED KINGDOM > Tel: +44-(0)1223-494416 > > -- Best Regards, Seth Johnson Senior Bioinformatics Associate Ph: (202) 470-0900 Fx: (775) 251-0358 From richard.holland at ebi.ac.uk Mon Jun 12 04:37:23 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Mon, 12 Jun 2006 09:37:23 +0100 Subject: [Biojava-l] Parsing INSDseq Sequences (1.3 & 1.4) In-Reply-To: References: <1149606788.3947.87.camel@texas.ebi.ac.uk> <1149758062.3947.187.camel@texas.ebi.ac.uk> Message-ID: <1150101444.3952.6.camel@texas.ebi.ac.uk> Typo in code. my fault. Try again! On Thu, 2006-06-08 at 10:23 -0400, Seth Johnson wrote: > I'm still getting an empty array back from this: > > Note [] myAccs = ((RichAnnotation)rs.getAnnotation()).getProperties > (INSDseqFormat.Terms.getOtherSeqIdTerm()); > > Here's the file that I'm parsing: > ~~~~~~~~~~~~~~~~~~~~~~ > > "http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd"> > > > AY069118 > 1502 > single > mRNA > linear > INV > 17-DEC-2001 > 15-DEC-2001 > Drosophila melanogaster GH13089 full length > cDNA > AY069118 > AY069118.1 > > gb|AY069118.1| > gi|17861571 > > > FLI_CDNA > > Drosophila melanogaster (fruit > fly) > Drosophila melanogaster > Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; > Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; > Ephydroidea; Drosophilidae; Drosophila > > > 1 (bases 1 to > 1502) > 1..1502 > > Stapleton,M. > Brokstein,P. > Hong,L. > Agbayani,A. > Carlson,J. > Champe,M. > Chavez,C. > Dorsett,V. > Farfan,D. > Frise,E. > George,R. > Gonzalez,M. > Guarin,H. > Li,P. > Liao,G. > Miranda,A. > Mungall,C.J. > Nunoo,J. > Pacleb,J. > Paragas,V. > Park,S. > Phouanenavong,S. > Wan,K. > Yu,C. > Lewis,S.E. > Rubin,G.M. > Celniker,S. > > Direct Submission > Submitted (10-DEC-2001) Berkeley > Drosophila Genome Project, Lawrence Berkeley National Laboratory, One > Cyclotron Road, Berkeley, CA 94720, USA > > > Sequence submitted by: Berkeley Drosophila Genome > Project Lawrence Berkeley National Laboratory Berkeley, CA 94720 This > clone was sequenced as part of a high-throughput process to sequence > clones from Drosophila Gene Collection 1 (Rubin et al., Science 2000). > The sequence has been subjected to integrity checks for sequence > accuracy, presence of a polyA tail and contiguity within 100 kb in the > genome. Thus we believe the sequence to reflect accurately this > particular cDNA clone. However, there are artifacts associated with > the generation of cDNA clones that may have not been detected in our > initial analyses such as internal priming, priming from contaminating > genomic DNA, retained introns due to reverse transcription of > unspliced precursor RNAs, and reverse transcriptase errors that result > in single base changes. For further information about this sequence, > including its location and relationship to other sequences, please > visit our Web site ( http://fruitfly.berkeley.edu) or send email to > cdna at fruitfly.berkeley.edu. > > > source > 1..1502 > > > 1 > 1502 > AY069118.1 > > > > > organism > Drosophila > melanogaster > > > mol_type > mRNA > > > strain > y; cn bw sp > > > db_xref > taxon:7227 > > > map > 39B3-39B3 > > > > > gene > 1..1502 > > > 1 > 1502 > AY069118.1 > > > > > gene > E2f2 > > > note > alignment with genomic scaffold > AE003669 > > > db_xref > > FLYBASE:FBgn0024371 > > > > > CDS > 189..1301 > > > 189 > 1301 > AY069118.1 > > > > > gene > E2f2 > > > note > Longest ORF > > > codon_start > 1 > > > transl_table > 1 > > > product > GH13089p > > > protein_id > AAL39263.1 > > > db_xref > GI:17861572 > > > db_xref > > FLYBASE:FBgn0024371 > > > translation > > MYKRKTASIVKRDSSAAGTTSSAMMMKVDSAETSVRSQSYESTPVSMDTSPDPPTPIKSPSNSQSQSQPGQQRSVGSLVLLTQKFVDLVKANEGSIDLKAATKILDVQKRRIYDITNVLEGIGLIDKGRHCSLVRWRGGGFNNAKDQENYDLARSRTNHLKMLEDDLDRQLEYAQRNLRYVMQDPSNRSYAYVTRDDLLDIFGDDSVFTIPNYDEEVDIKRNHYELAVSLDNGSAIDIRLVTNQGKSTTNPHDVDGFFDYHRLDTPSPSTSSHSSEDGNAPACAGNVITDEHGYSCNPGMKDEMKLLENELTAKIIFQNYLSGHSLRRFYPDDPNLENPPLLQLNPPQEDFNFALKSDEGICELFDVQCS > > > > > > AAGAATAGAGGGAGAATGAAAAAAATGACATAAATGGCGGAAAGCAAACCTAGCGCCAACATTCGTATTTTCGTTTAATTTTCGCTCCAAAGTGCAATTAATTCCGGCTTCTTGATCGCTGCATATTGAGTGCAGCCACGCAAAGAGTTACAAGGACAGGAGTATAGTCATCGAGTCGATTGCGGACCATGTACAAGCGCAAAACCGCGAGTATTGTTAAAAGAGACAGCTCCGCAGCGGGCACCACCTCCTCGGCTATGATGATGAAGGTGGATTCGGCTGAGACTTCGGTCCGGTCGCAGAGCTACGAGTCTACACCCGTTAGCATGGACACATCACCGGATCCTCCAACGCCAATCAAGTCTCCGTCGAATTCACAATCGCAATCGCAGCCTGGACAACAGCGCTCCGTGGGCTCACTGGTCCTGCTCACACAGAAGTTTGTGGATCTCGTGAAGGCCAACGAAGGATCCATCGACCTGAAAGCGGCAACCAAAATCTTGGACGTACAGAAGCGCCGAATATACGATATTACCAATGTTTTAGAGGGCATTGGACTAATTGATAAGGGCAGACACTGCTCCCTAGTGCGCTGGCGCGGAGGGGGCTTTAACAATGCCAAGGACCAAGAGAACTACGACCTGGCACGTAGCCGGACTAATCATTTGAAAATGTTGGAGGATGACCTAGACAGGCAACTGGAGTATGCACAGCGCAATCTGCGCTACGTTATGCAGGATCCCTCGAATAGGTCGTATGCATATGTGACACGTGATGATCTGCTGGACATCTTTGGAGATGATTCCGTATTCACAATACCTAATTATGACGAGGAAGTAGATATCAAGCGTAATCATTACGAGCTGGCCGTGTCGCTGGACAATGGCAGCGCAATTGACATTCGCCTGGTGACGAACCAAGGAAAGAGTACTACAAATCCGCACGATGTGGATGGGTTCTTTGACTATCACCGTCTGGACACGCCCTCACCCTCGACGTCGTCGCACTCCAGCGAGGATGGTAACGCTCCAGCATGCGCGGGGAACGTGATCACCGACGAGCACGGTTACTCGTGCAATCCCGGGATGAAAGATGAGATGAAACTTTTGGAGAACGAGCTGACGGCCAAGATAATCTTCCAAAATTATCTGTCCGGTCATTCGCTGCGGCGATTTTATCCCGATGATCCGAATCTAGAAAACCCGCCGCTGCTGCAGCTGAATCCTCCGCAGGAAGACTTCAACTTTGCGTTAAAAAGCGACGAAGGTATTTGCGAGCTGTTTGATGTTCAGTGCTCCTAACTGTGGAAGGGGATGTACACCTTAGGACTATAGCTACACTGCAACTGGCCGCGTGCATTGTGCAAATATTTATGATTAGTACAATTTTGACTTTGGATTTCTCTATATCGTCTAGAAATTTTTAATTAGTGTAATACCTTGTAATTTCGCAAATAACAGCAAAACCAATAAATTCGTAAATGCAAAAAAAAAAAAAAAAAA > > > ~~~~~~~~~~~~~~~~~~~~~~ > > On 6/8/06, Richard Holland wrote: > Yesterday I think I said I was going to add other-seqids but I > forgot to > do it, so I did it just now. Try it and see. Use the new > INSDseqFormat.Terms.getOtherSeqIdTerm() term to find them. > > cheers, > Richard > > On Wed, 2006-06-07 at 19:48 -0400, Seth Johnson wrote: > > Hi Richard, > > > > I still cannot locate the GI number for the main > sequence. After I > > parse it with readINSDseqDNA, I then use: > > > > Note [] myAccs = > ((RichAnnotation)rs.getAnnotation > > ()).getProperties(Terms.getAdditionalAccessionTerm ()); > > > > However, the 'myAccs' appears to be empty. Am I on the > wrong track to > > get to other-seqids??? > > > > On 6/6/06, Richard Holland < richard.holland at ebi.ac.uk> > wrote: > > GenBank has a separate line for GI number, so it can > be parsed > > out > > nicely. INSDseq does not, so you have to rely on the > other- > > seqids tag > > and hope that one of them is the GI number. However > it seems I > > have not > > included that tag in the parser, so I will include > it. This > > will make > > the other-seqids values available through the notes > with the > > term > > Terms.getAdditionalAccessionTerm(), but > getIdentifier() will > > remain > > null. > > > > For your second question, the tutorial makes the > mistake in > > several > > places of saying getNoteSet(Terms.blahblah()). This > was > > shorthand for: > > > > rs.getAnnotation().getProperty(Terms.blahblah()) > > (for single values) > > > > or > > > > ((RichAnnotation)rs.getAnnotation()).getProperties > > ( Terms.blahblah ()) > > (for multiple values) > > > > but never got expanded. Maybe someone can fix that > one > > day... :)ded... > > > > I'm just updating INSDseq to 1.4 now. The guys next > door gave > > me the > > details of the changes, and told me that 1.3 is > actually no > > longer > > supported by them after Friday this week! So I'll > make it 1.4 > > only. > > > > cheers, > > Richard > > > -- > Richard Holland (BioMart Team) > EMBL-EBI > Wellcome Trust Genome Campus > Hinxton > Cambridge CB10 1SD > UNITED KINGDOM > Tel: +44-(0)1223-494416 > > > > > -- > Best Regards, > > > Seth Johnson > Senior Bioinformatics Associate > > Ph: (202) 470-0900 > Fx: (775) 251-0358 -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From johnson.biotech at gmail.com Tue Jun 13 12:28:24 2006 From: johnson.biotech at gmail.com (Seth Johnson) Date: Tue, 13 Jun 2006 12:28:24 -0400 Subject: [Biojava-l] Parsing INSDseq Sequences (1.3 & 1.4) In-Reply-To: References: <1149606788.3947.87.camel@texas.ebi.ac.uk> <1149758062.3947.187.camel@texas.ebi.ac.uk> <1150101444.3952.6.camel@texas.ebi.ac.uk> Message-ID: Works like a charm now!!! :) I figured it was a typo somewhere on Friday, but couldn't find the source. I didn't think tag info was case sensitive. On 6/12/06, Richard Holland wrote: > > Typo in code. my fault. Try again! > > > > On Thu, 2006-06-08 at 10:23 -0400, Seth Johnson wrote: > > I'm still getting an empty array back from this: > > > > Note [] myAccs = ((RichAnnotation)rs.getAnnotation()).getProperties > > (INSDseqFormat.Terms.getOtherSeqIdTerm()); > > > > Here's the file that I'm parsing: > > ~~~~~~~~~~~~~~~~~~~~~~ > > > > > "http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd"> > > > > > > AY069118 > > 1502 > > single > > mRNA > > linear > > INV > > 17-DEC-2001 > > 15-DEC-2001 > > Drosophila melanogaster GH13089 full length > > cDNA > > AY069118 > > AY069118.1 > > > > gb|AY069118.1| > > gi|17861571 > > > > > > FLI_CDNA > > > > Drosophila melanogaster (fruit > > fly) > > Drosophila melanogaster > > Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; > > Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; > > Ephydroidea; Drosophilidae; Drosophila > > > > > > 1 (bases 1 to > > 1502) > > 1..1502 > > > > Stapleton,M. > > Brokstein,P. > > Hong,L. > > Agbayani,A. > > Carlson,J. > > Champe,M. > > Chavez,C. > > Dorsett,V. > > Farfan,D. > > Frise,E. > > George,R. > > Gonzalez,M. > > Guarin,H. > > Li,P. > > Liao,G. > > Miranda,A. > > Mungall,C.J. > > Nunoo,J. > > Pacleb,J. > > Paragas,V. > > Park,S. > > Phouanenavong,S. > > Wan,K. > > Yu,C. > > Lewis,S.E. > > Rubin, G.M. > > Celniker,S. > > > > Direct Submission > > Submitted (10-DEC-2001) Berkeley > > Drosophila Genome Project, Lawrence Berkeley National Laboratory, One > > Cyclotron Road, Berkeley, CA 94720, USA > > > > > > Sequence submitted by: Berkeley Drosophila Genome > > Project Lawrence Berkeley National Laboratory Berkeley, CA 94720 This > > clone was sequenced as part of a high-throughput process to sequence > > clones from Drosophila Gene Collection 1 (Rubin et al., Science 2000). > > The sequence has been subjected to integrity checks for sequence > > accuracy, presence of a polyA tail and contiguity within 100 kb in the > > genome. Thus we believe the sequence to reflect accurately this > > particular cDNA clone. However, there are artifacts associated with > > the generation of cDNA clones that may have not been detected in our > > initial analyses such as internal priming, priming from contaminating > > genomic DNA, retained introns due to reverse transcription of > > unspliced precursor RNAs, and reverse transcriptase errors that result > > in single base changes. For further information about this sequence, > > including its location and relationship to other sequences, please > > visit our Web site ( http://fruitfly.berkeley.edu) or send email to > > cdna at fruitfly.berkeley.edu. > > > > > > source > > 1..1502 > > > > > > 1 > > 1502 > > AY069118.1 > > > > > > > > > > organism > > Drosophila > > melanogaster > > > > > > mol_type > > mRNA > > > > > > strain > > y; cn bw sp > > > > > > db_xref > > taxon:7227 > > > > > > map > > 39B3-39B3 > > > > > > > > > > gene > > 1..1502 > > > > > > 1 > > 1502 > > AY069118.1 > > > > > > > > > > gene > > E2f2 > > > > > > note > > alignment with genomic scaffold > > AE003669 > > > > > > db_xref > > > > FLYBASE:FBgn0024371 > > > > > > > > > > CDS > > 189..1301 > > > > > > 189 > > 1301 > > AY069118.1 > > > > > > > > > > gene > > E2f2 > > > > > > note > > Longest ORF > > > > > > codon_start > > 1 > > > > > > transl_table > > 1 > > > > > > product > > GH13089p > > > > > > protein_id > > AAL39263.1 > > > > > > db_xref > > GI:17861572 > > > > > > db_xref > > > > FLYBASE:FBgn0024371 > > > > > > translation > > > > > MYKRKTASIVKRDSSAAGTTSSAMMMKVDSAETSVRSQSYESTPVSMDTSPDPPTPIKSPSNSQSQSQPGQQRSVGSLVLLTQKFVDLVKANEGSIDLKAATKILDVQKRRIYDITNVLEGIGLIDKGRHCSLVRWRGGGFNNAKDQENYDLARSRTNHLKMLEDDLDRQLEYAQRNLRYVMQDPSNRSYAYVTRDDLLDIFGDDSVFTIPNYDEEVDIKRNHYELAVSLDNGSAIDIRLVTNQGKSTTNPHDVDGFFDYHRLDTPSPSTSSHSSEDGNAPACAGNVITDEHGYSCNPGMKDEMKLLENELTAKIIFQNYLSGHSLRRFYPDDPNLENPPLLQLNPPQEDFNFALKSDEGICELFDVQCS > > > > > > > > > > > > > > AAGAATAGAGGGAGAATGAAAAAAATGACATAAATGGCGGAAAGCAAACCTAGCGCCAACATTCGTATTTTCGTTTAATTTTCGCTCCAAAGTGCAATTAATTCCGGCTTCTTGATCGCTGCATATTGAGTGCAGCCACGCAAAGAGTTACAAGGACAGGAGTATAGTCATCGAGTCGATTGCGGACCATGTACAAGCGCAAAACCGCGAGTATTGTTAAAAGAGACAGCTCCGCAGCGGGCACCACCTCCTCGGCTATGATGATGAAGGTGGATTCGGCTGAGACTTCGGTCCGGTCGCAGAGCTACGAGTCTACACCCGTTAGCATGGACACATCACCGGATCCTCCAACGCCAATCAAGTCTCCGTCGAATTCACAATCGCAATCGCAGCCTGGACAACAGCGCTCCGTGGGCTCACTGGTCCTGCTCACACAGAAGTTTGTGGATCTCGTGAAGGCCAACGAAGGATCCATCGACCTGAAAGCGGCAACCAAAATCTTGGACGTACAGAAGCGCCGAATATACGATATTACCAATGTTTTAGAGGGCATTGGACTAATTGATAAGGGCAGACACTGCTCCCTAGTGCGCTGGCGCGGAGGGGGCTTTAACAATGCCAAGGACCAAGAGAACTACGACCTGGCACGTAGCCGGACTAATCATTTGAAAATGTTGGAGGATGACCTAGACAGGCAACTGGAGTATGCACAGCGCAATCTGCGCTACGTTATGCAGGATCCCTCGAATAGGTCGTATGCATATGTGACACGTGATGATCTGCTGGACATCTTTGGAGATGATTCCGTATTCACAATACCTAATTATGACGAGGAAGTAGATATCAAGCGTAATCATTACGAGCTGGCCGTGTCGCTGGACAATGGCAGCGCAATTGACATTCGCCTGGTGACGAACCAAGGAAAGAGTACTACAAATCCGCACGATGTGGATGGGTTCTTTGACTATCACCGTCTGGACACGCCCTCACCCTCGACGTCGTCGCACTCCAGCGAGGATGGTAACGCTCCAGCATGCGCGGGGAACGTGATCACCGACGAGCACGGTTACTCGTGCAATCCCGGGATGAAAGATGAGATGAAACTTTTGGAGAACGAGCTGACGGCCAAGATAATCTTCCAAAATTATCTGTCCGGTCATTCGCTGCGGCGATTTTATCCCGATGATCCGAATCTAGAAAACCCGCCGCTGCTGCAGCTGAATCCTCCGCAGGAAGACTTCAACTTTGCGTTAAAAAGCGACGAAGGTATTTGCGAGCTGTTTGATGTTCAGTGCTCCTAACTGTGGAAGGGGATGTACACCTTAGGACTATAGCTACACTGCAACTGGCCGCGTGCATTGTGCAAATATTTATGATTAGTACAATTTTGACTTTGGATTTCTCTATATCGTCTAGAAATTTTTAATTAGTGTAATACCTTGTAATTTCGCAAATAACAGCAAAACCAATAAATTCGTAAATGCAAAAAAAAAAAAAAAAAA > > > > > > > ~~~~~~~~~~~~~~~~~~~~~~ > > > > On 6/8/06, Richard Holland wrote: > > Yesterday I think I said I was going to add other-seqids but I > > forgot to > > do it, so I did it just now. Try it and see. Use the new > > INSDseqFormat.Terms.getOtherSeqIdTerm() term to find them. > > > > cheers, > > Richard > > > > On Wed, 2006-06-07 at 19:48 -0400, Seth Johnson wrote: > > > Hi Richard, > > > > > > I still cannot locate the GI number for the main > > sequence. After I > > > parse it with readINSDseqDNA, I then use: > > > > > > Note [] myAccs = > > ((RichAnnotation)rs.getAnnotation > > > ()).getProperties( Terms.getAdditionalAccessionTerm ()); > > > > > > However, the 'myAccs' appears to be empty. Am I on the > > wrong track to > > > get to other-seqids??? > > > > > > On 6/6/06, Richard Holland < richard.holland at ebi.ac.uk> > > wrote: > > > GenBank has a separate line for GI number, so it can > > be parsed > > > out > > > nicely. INSDseq does not, so you have to rely on the > > other- > > > seqids tag > > > and hope that one of them is the GI number. However > > it seems I > > > have not > > > included that tag in the parser, so I will include > > it. This > > > will make > > > the other-seqids values available through the notes > > with the > > > term > > > Terms.getAdditionalAccessionTerm(), but > > getIdentifier() will > > > remain > > > null. > > > > > > For your second question, the tutorial makes the > > mistake in > > > several > > > places of saying getNoteSet( Terms.blahblah()). This > > was > > > shorthand for: > > > > > > rs.getAnnotation().getProperty(Terms.blahblah()) > > > (for single values) > > > > > > or > > > > > > ((RichAnnotation)rs.getAnnotation()).getProperties > > > ( Terms.blahblah ()) > > > (for multiple values) > > > > > > but never got expanded. Maybe someone can fix that > > one > > > day... :)ded... > > > > > > I'm just updating INSDseq to 1.4 now. The guys next > > door gave > > > me the > > > details of the changes, and told me that 1.3 is > > actually no > > > longer > > > supported by them after Friday this week! So I'll > > make it 1.4 > > > only. > > > > > > cheers, > > > Richard > > > > > -- > > Richard Holland (BioMart Team) > > EMBL-EBI > > Wellcome Trust Genome Campus > > Hinxton > > Cambridge CB10 1SD > > UNITED KINGDOM > > Tel: +44-(0)1223-494416 > > > > > > > > > > -- > > Best Regards, > > > > > > Seth Johnson > > Senior Bioinformatics Associate > > > > Ph: (202) 470-0900 > > Fx: (775) 251-0358 > -- > Richard Holland (BioMart Team) > EMBL-EBI > Wellcome Trust Genome Campus > Hinxton > Cambridge CB10 1SD > UNITED KINGDOM > Tel: +44-(0)1223-494416 > > -- Best Regards, Seth Johnson Senior Bioinformatics Associate Ph: (202) 470-0900 Fx: (775) 251-0358 -- Best Regards, Seth Johnson Senior Bioinformatics Associate Ph: (202) 470-0900 Fx: (775) 251-0358 From duarte at molgen.mpg.de Fri Jun 16 11:31:42 2006 From: duarte at molgen.mpg.de (Jose Duarte) Date: Fri, 16 Jun 2006 17:31:42 +0200 Subject: [Biojava-l] Blast xml parsing In-Reply-To: <60513.141.42.56.114.1149021449.squirrel@webmail.charite.de> References: <60513.141.42.56.114.1149021449.squirrel@webmail.charite.de> Message-ID: <4492CEDE.6010107@molgen.mpg.de> I am a newbie to biojava so sorry if questions are too simple! I am trying to use Biojava's blast xml parsing following the cookbook from biojava's web. So far I managed to parse correctly the xml output from blast getting the SeqSimilaritySearchResult object and then the SeqSimilaritySearchHit and SeqSimilaritySearchSubHit. From those I could get bit scores, a SimpleAlignment object, sequences as SymbolList objects and all kinds of things. My question is how to get the percentage identity as well as all of those. That must be obvious but I've been looking around and couldn't find how. Any pointers appreciated Thanks Jose ---- Jose M. Duarte Max Planck Institute for Molecular Genetics Ihnestr. 63-73 14195 Berlin Germany From mark.schreiber at novartis.com Sun Jun 18 22:46:58 2006 From: mark.schreiber at novartis.com (mark.schreiber at novartis.com) Date: Mon, 19 Jun 2006 10:46:58 +0800 Subject: [Biojava-l] Blast xml parsing Message-ID: Hi -= I'm not sure where the percent identity gets sent but you can find out by using the example code (http://biojava.org/wiki/BioJava:CookBook:Blast:Echo). It is also a nice code base for making a custom blast parser that is not so object heavy. - Mark Mark Schreiber Research Investigator (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 10 Biopolis Road #05-01 Chromos Singapore 138670 www.nitd.novartis.com phone +65 6722 2973 fax +65 6722 2910 Jose Duarte Sent by: biojava-l-bounces at lists.open-bio.org 06/16/2006 11:31 PM To: biojava-l at biojava.org cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-l] Blast xml parsing I am a newbie to biojava so sorry if questions are too simple! I am trying to use Biojava's blast xml parsing following the cookbook from biojava's web. So far I managed to parse correctly the xml output from blast getting the SeqSimilaritySearchResult object and then the SeqSimilaritySearchHit and SeqSimilaritySearchSubHit. From those I could get bit scores, a SimpleAlignment object, sequences as SymbolList objects and all kinds of things. My question is how to get the percentage identity as well as all of those. That must be obvious but I've been looking around and couldn't find how. Any pointers appreciated Thanks Jose ---- Jose M. Duarte Max Planck Institute for Molecular Genetics Ihnestr. 63-73 14195 Berlin Germany _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From duarte at molgen.mpg.de Mon Jun 19 07:08:07 2006 From: duarte at molgen.mpg.de (Jose Duarte) Date: Mon, 19 Jun 2006 13:08:07 +0200 Subject: [Biojava-l] Blast xml parsing In-Reply-To: References: Message-ID: <44968597.6090803@molgen.mpg.de> mark.schreiber at novartis.com wrote: >Hi -= > >I'm not sure where the percent identity gets sent but you can find out by >using the example code >(http://biojava.org/wiki/BioJava:CookBook:Blast:Echo). > >It is also a nice code base for making a custom blast parser that is not >so object heavy. > > Thanks, that sounds good. However I have tried to run the BlastEcho.java code and got following error: Exception in thread "main" org.xml.sax.SAXException: Could not recognise the format of this file as one supported by the framework. at org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:182) at BlastEcho.echo(BlastEcho.java:29) at BlastEcho.main(BlastEcho.java:75) I am pretty sure that my xml file is well formated blast output. Actually I've been parsing it already with code using biojava's BlastXMLParser without problems. My blast version is blastp 2.2.10. Also this is happening despite BlastEcho calling the setModeLazy() method of the parser object. As I understand it shouldn't be checking for versions using this mode. Anybody knows what might be wrong here or how could I get around this problem? Thanks in advance Jose From duarte at molgen.mpg.de Mon Jun 19 08:35:14 2006 From: duarte at molgen.mpg.de (Jose Duarte) Date: Mon, 19 Jun 2006 14:35:14 +0200 Subject: [Biojava-l] Blast xml parsing In-Reply-To: <44968597.6090803@molgen.mpg.de> References: <44968597.6090803@molgen.mpg.de> Message-ID: <44969A02.8050909@molgen.mpg.de> Jose Duarte wrote: >mark.schreiber at novartis.com wrote: > > > >>Hi -= >> >>I'm not sure where the percent identity gets sent but you can find out by >>using the example code >>(http://biojava.org/wiki/BioJava:CookBook:Blast:Echo). >> >>It is also a nice code base for making a custom blast parser that is not >>so object heavy. >> >> >Thanks, that sounds good. However I have tried to run the BlastEcho.java >code and got following error: > >Exception in thread "main" org.xml.sax.SAXException: Could not recognise >the format of this file as one supported by the framework. > at >org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:182) > at BlastEcho.echo(BlastEcho.java:29) > at BlastEcho.main(BlastEcho.java:75) > >I am pretty sure that my xml file is well formated blast output. >Actually I've been parsing it already with code using biojava's >BlastXMLParser without problems. My blast version is blastp 2.2.10. > >Also this is happening despite BlastEcho calling the setModeLazy() >method of the parser object. As I understand it shouldn't be checking >for versions using this mode. > > I think I can answer my question now. I just found out that for some reason replacing line 17 in BlastEcho.java: BlastLikeSAXParser parser = new BlastLikeSAXParser(); for: BlastXMLParserFacade parser = new BlastXMLParserFacade(); it all works fine. I have no idea what's the difference between the BlastLikeSAXParser and the BlastXMLParserFacade classes but it looks as it works as good. I can change the code in the wiki if somebody can confirm this is correct. Cheers Jose From anderson.moura at telemar-rj.com.br Mon Jun 19 09:29:31 2006 From: anderson.moura at telemar-rj.com.br (Anderson Moura da Silva) Date: Mon, 19 Jun 2006 10:29:31 -0300 Subject: [Biojava-l] Search on net Message-ID: <3C39C09ED334F243838953854BE43FB602EAE67C@MAILBX02.telemar.corp.net> Hi everybody, Is there a way to get a sequence online using biojava entering the name or a reference for the sequence? Thanks a lot Anderson Moura - Brasil Esta mensagem, incluindo seus anexos, pode conter informa??es privilegiadas e/ou de car?ter confidencial, n?o podendo ser retransmitida sem autoriza??o do remetente. Se voc? n?o ? o destinat?rio ou pessoa autorizada a receb?-la, informamos que o seu uso, divulga??o, c?pia ou arquivamento s?o proibidos. Portanto, se voc? recebeu esta mensagem por engano, por favor, nos informe respondendo imediatamente a este e-mail e em seguida apague-a. From richard.holland at ebi.ac.uk Mon Jun 19 10:03:43 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Mon, 19 Jun 2006 15:03:43 +0100 Subject: [Biojava-l] Search on net In-Reply-To: <3C39C09ED334F243838953854BE43FB602EAE67C@MAILBX02.telemar.corp.net> References: <3C39C09ED334F243838953854BE43FB602EAE67C@MAILBX02.telemar.corp.net> Message-ID: <1150725823.3948.33.camel@texas.ebi.ac.uk> Take a look at org.biojavax.bio.db.ncbi.GenbankRichSequenceDB If this doesn't do what you want it to do, you can always grab records using the Entrez e-utils from NCBI (here: http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html ) then parse the resulting Genbank records using BJX. cheers, Richard On Mon, 2006-06-19 at 10:29 -0300, Anderson Moura da Silva wrote: > Hi everybody, > > Is there a way to get a sequence online using biojava entering the name or a reference for the sequence? > > > Thanks a lot > Anderson Moura - Brasil > > > Esta mensagem, incluindo seus anexos, pode conter informa??es privilegiadas e/ou de car?ter confidencial, n?o podendo ser retransmitida sem autoriza??o do remetente. Se voc? n?o ? o destinat?rio ou pessoa autorizada a receb?-la, informamos que o seu uso, divulga??o, c?pia ou arquivamento s?o proibidos. Portanto, se voc? recebeu esta mensagem por engano, por favor, nos informe respondendo imediatamente a este e-mail e em seguida apague-a. > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From debesis at gmail.com Mon Jun 19 09:33:39 2006 From: debesis at gmail.com (=?ISO-8859-13?Q?b=EBgantis_debesis?=) Date: Mon, 19 Jun 2006 16:33:39 +0300 Subject: [Biojava-l] Blast xml parsing In-Reply-To: <44969A02.8050909@molgen.mpg.de> References: <44968597.6090803@molgen.mpg.de> <44969A02.8050909@molgen.mpg.de> Message-ID: Hi, I think BlastLikeSAXParser is a parser for ordinary (non XML) blast output files. It can not recognize xml's. BlastXMLParserFacade is for XML parsing. Another question: does biojava has any support for psi-blast result parsing? Thanks, Valdemaras Rep?ys On 6/19/06, Jose Duarte wrote: > > Jose Duarte wrote: > > >mark.schreiber at novartis.com wrote: > > > > > > > >>Hi -= > >> > >>I'm not sure where the percent identity gets sent but you can find out > by > >>using the example code > >>(http://biojava.org/wiki/BioJava:CookBook:Blast:Echo). > >> > >>It is also a nice code base for making a custom blast parser that is not > >>so object heavy. > >> > >> > >Thanks, that sounds good. However I have tried to run the BlastEcho.java > >code and got following error: > > > >Exception in thread "main" org.xml.sax.SAXException: Could not recognise > >the format of this file as one supported by the framework. > > at > >org.biojava.bio.program.sax.BlastLikeSAXParser.parse( > BlastLikeSAXParser.java:182) > > at BlastEcho.echo(BlastEcho.java:29) > > at BlastEcho.main(BlastEcho.java:75) > > > >I am pretty sure that my xml file is well formated blast output. > >Actually I've been parsing it already with code using biojava's > >BlastXMLParser without problems. My blast version is blastp 2.2.10. > > > >Also this is happening despite BlastEcho calling the setModeLazy() > >method of the parser object. As I understand it shouldn't be checking > >for versions using this mode. > > > > > > I think I can answer my question now. I just found out that for some > reason replacing line 17 in BlastEcho.java: > > BlastLikeSAXParser parser = new BlastLikeSAXParser(); > > for: > > BlastXMLParserFacade parser = new BlastXMLParserFacade(); > > it all works fine. > > I have no idea what's the difference between the BlastLikeSAXParser and > the BlastXMLParserFacade classes but it looks as it works as good. I can > change the code in the wiki if somebody can confirm this is correct. > > Cheers > > Jose > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From jolyon.holdstock at ogt.co.uk Mon Jun 19 10:24:21 2006 From: jolyon.holdstock at ogt.co.uk (Jolyon Holdstock) Date: Mon, 19 Jun 2006 15:24:21 +0100 Subject: [Biojava-l] Search on net[Scanned] Message-ID: <588D0DD225D05746B5D8CAE1BE971F3FDA74F0@EUCLID.internal.ogtip.com> This might be the class you can use: org.biojava.bio.seq.db.NCBISequenceDB -----Original Message----- From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Anderson Moura da Silva Sent: 19 June 2006 14:30 To: biojava-l at lists.open-bio.org Subject: [Biojava-l] Search on net[Scanned] Hi everybody, Is there a way to get a sequence online using biojava entering the name or a reference for the sequence? Thanks a lot Anderson Moura - Brasil Esta mensagem, incluindo seus anexos, pode conter informa??es privilegiadas e/ou de car?ter confidencial, n?o podendo ser retransmitida sem autoriza??o do remetente. Se voc? n?o ? o destinat?rio ou pessoa autorizada a receb?-la, informamos que o seu uso, divulga??o, c?pia ou arquivamento s?o proibidos. Portanto, se voc? recebeu esta mensagem por engano, por favor, nos informe respondendo imediatamente a este e-mail e em seguida apague-a. _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l This email has been scanned by Oxford Gene Technology Security Systems. From mark.schreiber at novartis.com Mon Jun 19 22:51:57 2006 From: mark.schreiber at novartis.com (mark.schreiber at novartis.com) Date: Tue, 20 Jun 2006 10:51:57 +0800 Subject: [Biojava-l] Blast xml parsing Message-ID: Hi Jose - The orginal recipe parses the standard text output of BLAST. The modification you correctly made allows it to parse the XML output. - Mark Jose Duarte Sent by: biojava-l-bounces at lists.open-bio.org 06/19/2006 08:35 PM To: Jose Duarte cc: biojava-l at biojava.org, mark.schreiber at novartis.com Subject: Re: [Biojava-l] Blast xml parsing Jose Duarte wrote: >mark.schreiber at novartis.com wrote: > > > >>Hi -= >> >>I'm not sure where the percent identity gets sent but you can find out by >>using the example code >>(http://biojava.org/wiki/BioJava:CookBook:Blast:Echo). >> >>It is also a nice code base for making a custom blast parser that is not >>so object heavy. >> >> >Thanks, that sounds good. However I have tried to run the BlastEcho.java >code and got following error: > >Exception in thread "main" org.xml.sax.SAXException: Could not recognise >the format of this file as one supported by the framework. > at >org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:182) > at BlastEcho.echo(BlastEcho.java:29) > at BlastEcho.main(BlastEcho.java:75) > >I am pretty sure that my xml file is well formated blast output. >Actually I've been parsing it already with code using biojava's >BlastXMLParser without problems. My blast version is blastp 2.2.10. > >Also this is happening despite BlastEcho calling the setModeLazy() >method of the parser object. As I understand it shouldn't be checking >for versions using this mode. > > I think I can answer my question now. I just found out that for some reason replacing line 17 in BlastEcho.java: BlastLikeSAXParser parser = new BlastLikeSAXParser(); for: BlastXMLParserFacade parser = new BlastXMLParserFacade(); it all works fine. I have no idea what's the difference between the BlastLikeSAXParser and the BlastXMLParserFacade classes but it looks as it works as good. I can change the code in the wiki if somebody can confirm this is correct. Cheers Jose _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From mark.schreiber at novartis.com Mon Jun 19 22:55:45 2006 From: mark.schreiber at novartis.com (mark.schreiber at novartis.com) Date: Tue, 20 Jun 2006 10:55:45 +0800 Subject: [Biojava-l] Search on net[Scanned] Message-ID: If you are using biojava1.4 then use the class Jolyon recommends. If you are using biojava-live from CVS I would strongly recommend using the newer class org.biojavax.bio.db.ncbi.GenbankRichSequenceDB. It produces RichSequences with more structured information. org.biojava.bio.seq.db.NCBISequenceDB will be deprecated as of biojava 1.5 - Mark "Jolyon Holdstock" Sent by: biojava-l-bounces at lists.open-bio.org 06/19/2006 10:24 PM To: "Anderson Moura da Silva" , cc: (bcc: Mark Schreiber/GP/Novartis) Subject: Re: [Biojava-l] Search on net[Scanned] This might be the class you can use: org.biojava.bio.seq.db.NCBISequenceDB -----Original Message----- From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Anderson Moura da Silva Sent: 19 June 2006 14:30 To: biojava-l at lists.open-bio.org Subject: [Biojava-l] Search on net[Scanned] Hi everybody, Is there a way to get a sequence online using biojava entering the name or a reference for the sequence? Thanks a lot Anderson Moura - Brasil Esta mensagem, incluindo seus anexos, pode conter informa??es privilegiadas e/ou de car?ter confidencial, n?o podendo ser retransmitida sem autoriza??o do remetente. Se voc? n?o ? o destinat?rio ou pessoa autorizada a receb?-la, informamos que o seu uso, divulga??o, c?pia ou arquivamento s?o proibidos. Portanto, se voc? recebeu esta mensagem por engano, por favor, nos informe respondendo imediatamente a este e-mail e em seguida apague-a. _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l This email has been scanned by Oxford Gene Technology Security Systems. _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From walsh at andrew.cmu.edu Tue Jun 20 10:29:36 2006 From: walsh at andrew.cmu.edu (Andrew Walsh) Date: Tue, 20 Jun 2006 10:29:36 -0400 Subject: [Biojava-l] The Java sandbox and BioJava Message-ID: <44980650.9040007@andrew.cmu.edu> I am working on an application that I want to deliver via Java Web Start. This application uses BioJava 1.4 to do some basic processing of protein sequence files. Applications run via Java Web Start have the option of staying in the Java sandbox or requesting to have full access. I would like to provide both options to the user, but I am having trouble with the in-sandbox version. When the application tries to open a sequence file, it fails with the following error: java.lang.reflect.InvocationTargetException Caused by: org.biojava.bio.BioError: Couldn't locate AlphabetManager.xml. This probably means that your biojava.jar file is corrupt or incorrectly built. at org.biojava.bio.symbol.AlphabetManager.(AlphabetManager.java:1012) at org.biojava.bio.seq.ProteinTools.(ProteinTools.java:75) at org.biojava.bio.seq.io.MSFAlignmentFormat.read(MSFAlignmentFormat.java:187) at org.biojava.bio.seq.io.SeqIOTools.fileToAlign(SeqIOTools.java:1138) at org.biojava.bio.seq.io.SeqIOTools.fileToBiojava(SeqIOTools.java:940) at org.biojava.bio.seq.io.SeqIOTools.fileToBiojava(SeqIOTools.java:908) at org.msaviewer.tools.SequenceFileReader.readAlignmentFile(Unknown Source) at org.msaviewer.MSAViewer.openSafeSession(MSAViewer.java:295) at org.msaviewer.MSAViewer.main(MSAViewer.java:1334) ... 11 more It would appear that the way in which the AlphabetManager reads AlphabetManager.xml from the BioJava jar file is not compatible with the Java Web Start sandbox restrictions. I skimmed the relevant sections of the AlphabetManager code and didn't see an obvious solution. Does anyone have any experience with getting BioJava to work inside the sandbox that might be able to suggest a fix to this problem? Thanks, Andy Walsh Postdoctoral Fellow Language Technologies Institute Carnegie Mellon University From david at autohandle.com Tue Jun 20 20:22:04 2006 From: david at autohandle.com (David Scott) Date: Tue, 20 Jun 2006 17:22:04 -0700 Subject: [Biojava-l] add isTaxonHidden to NCBITaxon Message-ID: <4498912C.4040109@autohandle.com> in the genbank ORGANISM line where the taxonomy hierarchy is shown, not all levels are show by genbank. whether a level is shown or not is controlled by the isTaxonHidden flag in the genbank taxonomy file: "nodes.dmp". biosql does not currently provide for a isTaxonHidden field in the sg_taxon table. the table can be modified and the field added locally. it would make to easier to make this modification to add NCBITaxon.isTaxonHidden to NCBITaxon and SimpleNCBITaxon - modeling the methods after a similar field: ComparableTerm.getObsolete: NCBITaxon: public boolean isTaxonHidden(); public void setTaxonHidden(final boolean isHidden) throws ChangeVetoException; SimpleNCBITaxon: private boolean isTaxonHidden=false; // for user private boolean isTaxonHidden() { ... } public void setTaxonHidden(final boolean isHidden) throws ChangeVetoException { ... } // for hibernate private String getTaxonHiddenChar() { ... } private void setTaxonHiddenChar(final String isHiddenChar) throws ChangeVetoException { ... } SimpleNCBITaxonomyLoader: public NCBITaxon readNode(... ... final String isTaxonHidden = parts[10].trim()// either "0" or "1" .... try { ... t.setTaxonHidden(Integer.parseInt(isTaxonHidden)==1); ... the code would continue to operate as it has - until the database has the additional field added locally and the hibernate mapping file modified to map the new field into the Taxon object. From mark.schreiber at novartis.com Tue Jun 20 22:29:04 2006 From: mark.schreiber at novartis.com (mark.schreiber at novartis.com) Date: Wed, 21 Jun 2006 10:29:04 +0800 Subject: [Biojava-l] The Java sandbox and BioJava Message-ID: I think I remember something similar happening with Applets a long time ago. I don't recall the solution. Christophe Gille uses biojava and webstart for the STRAP project. He may be able to offer advice (www.charite.de/bioinf/strap) - Mark Andrew Walsh Sent by: biojava-l-bounces at lists.open-bio.org 06/20/2006 10:29 PM To: biojava-l at lists.open-bio.org cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-l] The Java sandbox and BioJava I am working on an application that I want to deliver via Java Web Start. This application uses BioJava 1.4 to do some basic processing of protein sequence files. Applications run via Java Web Start have the option of staying in the Java sandbox or requesting to have full access. I would like to provide both options to the user, but I am having trouble with the in-sandbox version. When the application tries to open a sequence file, it fails with the following error: java.lang.reflect.InvocationTargetException Caused by: org.biojava.bio.BioError: Couldn't locate AlphabetManager.xml. This probably means that your biojava.jar file is corrupt or incorrectly built. at org.biojava.bio.symbol.AlphabetManager.(AlphabetManager.java:1012) at org.biojava.bio.seq.ProteinTools.(ProteinTools.java:75) at org.biojava.bio.seq.io.MSFAlignmentFormat.read(MSFAlignmentFormat.java:187) at org.biojava.bio.seq.io.SeqIOTools.fileToAlign(SeqIOTools.java:1138) at org.biojava.bio.seq.io.SeqIOTools.fileToBiojava(SeqIOTools.java:940) at org.biojava.bio.seq.io.SeqIOTools.fileToBiojava(SeqIOTools.java:908) at org.msaviewer.tools.SequenceFileReader.readAlignmentFile(Unknown Source) at org.msaviewer.MSAViewer.openSafeSession(MSAViewer.java:295) at org.msaviewer.MSAViewer.main(MSAViewer.java:1334) ... 11 more It would appear that the way in which the AlphabetManager reads AlphabetManager.xml from the BioJava jar file is not compatible with the Java Web Start sandbox restrictions. I skimmed the relevant sections of the AlphabetManager code and didn't see an obvious solution. Does anyone have any experience with getting BioJava to work inside the sandbox that might be able to suggest a fix to this problem? Thanks, Andy Walsh Postdoctoral Fellow Language Technologies Institute Carnegie Mellon University _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From richard.holland at ebi.ac.uk Wed Jun 21 04:17:42 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Wed, 21 Jun 2006 09:17:42 +0100 Subject: [Biojava-l] The Java sandbox and BioJava In-Reply-To: <44980650.9040007@andrew.cmu.edu> References: <44980650.9040007@andrew.cmu.edu> Message-ID: <1150877863.3948.49.camel@texas.ebi.ac.uk> Not sure exactly, but this might help: http://java.sun.com/j2se/1.5.0/docs/guide/javaws/developersguide/faq.html#211 cheers, Richard On Tue, 2006-06-20 at 10:29 -0400, Andrew Walsh wrote: > I am working on an application that I want to deliver via Java Web > Start. This application uses BioJava 1.4 to do some basic processing of > protein sequence files. Applications run via Java Web Start have the > option of staying in the Java sandbox or requesting to have full > access. I would like to provide both options to the user, but I am > having trouble with the in-sandbox version. When the application tries > to open a sequence file, it fails with the following error: > > java.lang.reflect.InvocationTargetException > > Caused by: org.biojava.bio.BioError: Couldn't locate > AlphabetManager.xml. This probably means that your biojava.jar file is > corrupt or incorrectly built. > at > org.biojava.bio.symbol.AlphabetManager.(AlphabetManager.java:1012) > at org.biojava.bio.seq.ProteinTools.(ProteinTools.java:75) > at > org.biojava.bio.seq.io.MSFAlignmentFormat.read(MSFAlignmentFormat.java:187) > at org.biojava.bio.seq.io.SeqIOTools.fileToAlign(SeqIOTools.java:1138) > at org.biojava.bio.seq.io.SeqIOTools.fileToBiojava(SeqIOTools.java:940) > at org.biojava.bio.seq.io.SeqIOTools.fileToBiojava(SeqIOTools.java:908) > at org.msaviewer.tools.SequenceFileReader.readAlignmentFile(Unknown > Source) > at org.msaviewer.MSAViewer.openSafeSession(MSAViewer.java:295) > at org.msaviewer.MSAViewer.main(MSAViewer.java:1334) > ... 11 more > > It would appear that the way in which the AlphabetManager reads > AlphabetManager.xml from the BioJava jar file is not compatible with the > Java Web Start sandbox restrictions. I skimmed the relevant sections of > the AlphabetManager code and didn't see an obvious solution. Does > anyone have any experience with getting BioJava to work inside the > sandbox that might be able to suggest a fix to this problem? > > Thanks, > Andy Walsh > Postdoctoral Fellow > Language Technologies Institute > Carnegie Mellon University > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From david at autohandle.com Wed Jun 21 11:47:29 2006 From: david at autohandle.com (David Scott) Date: Wed, 21 Jun 2006 08:47:29 -0700 Subject: [Biojava-l] proposal: clear LRUCache when connectToBioSQL is called Message-ID: <44996A11.6040809@autohandle.com> if the RichObjectFactory is used standalone or used with another session - when a new session is started the cache may have invalid objects - the LRUCache can be cleared when the session is set. public static void connectToBioSQL(Object session) { ... clearLRUCache(); From david at autohandle.com Wed Jun 21 11:50:35 2006 From: david at autohandle.com (David Scott) Date: Wed, 21 Jun 2006 08:50:35 -0700 Subject: [Biojava-l] proposal: application subclassing of biosql objects Message-ID: <44996ACB.2030706@autohandle.com> install a static Map in RichObjectFactory that the application can use to map application subclasses of the biosql classes contained in the RichObjectFactory code - applications would set up the map via: public final static void setApplicationClass(final Class theBioJavaClass, final Class theApplicationClass) RichObjectFactory would retrieve the Map via: private final static class getApplicationClass(final Class theBioJavaClass); RichObjectFactory would substitute the application class for the biojava class in getObject: public static synchronized Object getObject(final Class clazz, Object[] params) { List paramsList = Arrays.asList(params); final Class applicationClass = getApplicationClass(clazz); .... .... builder.buildObject(applicationClass, paramsList); BioSqlRichObjectBuilder would recognize the subclass in buildObject by changing the if statements from identity: public Object buildObject(Clazz clazz, List paramsList) { if (clazz == SimpleNamespace.class) { ... to assignable: public Object buildObject(Clazz clazz, List paramsList) { if (SimpleNamespace.class.isAssignableFrom(clazz)) { ... the map size will be short - a linear search might be faster than a hash. From walsh at andrew.cmu.edu Wed Jun 21 13:38:42 2006 From: walsh at andrew.cmu.edu (Andrew Walsh) Date: Wed, 21 Jun 2006 13:38:42 -0400 Subject: [Biojava-l] The Java sandbox and BioJava In-Reply-To: References: Message-ID: <44998422.6040307@andrew.cmu.edu> Thanks to Richard and Mark for the replies and suggestions. As it turns out, the BioJava code accesses AlphabetManager.xml in a perfectly safe, "in the sandbox" way, which is why I couldn't see a problem there. The problem was actually that I had a copy of biojava.jar in the "ext" folder of my local Java run time environment. This location apparently supersedes any libraries provided remotely by the application in the Web Start framework, so Java was loading the classes from there. Then when the ClassLoader tried to get the resource, it went looking in the jar file in the local filesystem, which is not allowed from inside the sandbox. After removing the unnecessary biojava.jar from the "ext" folder, everything worked just fine. Thanks again for the help, and thanks for the developers for writing the code correctly in the first place! -Andy mark.schreiber at novartis.com wrote: > I think I remember something similar happening with Applets a long time > ago. I don't recall the solution. Christophe Gille uses biojava and > webstart for the STRAP project. He may be able to offer advice > (www.charite.de/bioinf/strap) > > - Mark > > > > > > Andrew Walsh > Sent by: biojava-l-bounces at lists.open-bio.org > 06/20/2006 10:29 PM > > > To: biojava-l at lists.open-bio.org > cc: (bcc: Mark Schreiber/GP/Novartis) > Subject: [Biojava-l] The Java sandbox and BioJava > > > I am working on an application that I want to deliver via Java Web > Start. This application uses BioJava 1.4 to do some basic processing of > protein sequence files. Applications run via Java Web Start have the > option of staying in the Java sandbox or requesting to have full > access. I would like to provide both options to the user, but I am > having trouble with the in-sandbox version. When the application tries > to open a sequence file, it fails with the following error: > > java.lang.reflect.InvocationTargetException > > Caused by: org.biojava.bio.BioError: Couldn't locate > AlphabetManager.xml. This probably means that your biojava.jar file is > corrupt or incorrectly built. > at > org.biojava.bio.symbol.AlphabetManager.(AlphabetManager.java:1012) > at org.biojava.bio.seq.ProteinTools.(ProteinTools.java:75) > at > org.biojava.bio.seq.io.MSFAlignmentFormat.read(MSFAlignmentFormat.java:187) > at org.biojava.bio.seq.io.SeqIOTools.fileToAlign(SeqIOTools.java:1138) > at > org.biojava.bio.seq.io.SeqIOTools.fileToBiojava(SeqIOTools.java:940) > at > org.biojava.bio.seq.io.SeqIOTools.fileToBiojava(SeqIOTools.java:908) > at org.msaviewer.tools.SequenceFileReader.readAlignmentFile(Unknown > Source) > at org.msaviewer.MSAViewer.openSafeSession(MSAViewer.java:295) > at org.msaviewer.MSAViewer.main(MSAViewer.java:1334) > ... 11 more > > It would appear that the way in which the AlphabetManager reads > AlphabetManager.xml from the BioJava jar file is not compatible with the > Java Web Start sandbox restrictions. I skimmed the relevant sections of > the AlphabetManager code and didn't see an obvious solution. Does > anyone have any experience with getting BioJava to work inside the > sandbox that might be able to suggest a fix to this problem? > > Thanks, > Andy Walsh > Postdoctoral Fellow > Language Technologies Institute > Carnegie Mellon University > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > From mark.schreiber at novartis.com Wed Jun 21 20:43:51 2006 From: mark.schreiber at novartis.com (mark.schreiber at novartis.com) Date: Thu, 22 Jun 2006 08:43:51 +0800 Subject: [Biojava-l] proposal: clear LRUCache when connectToBioSQL is called Message-ID: I'm OK with this. David Scott Sent by: biojava-l-bounces at lists.open-bio.org 06/21/2006 11:47 PM To: biojava-l at lists.open-bio.org cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-l] proposal: clear LRUCache when connectToBioSQL is called if the RichObjectFactory is used standalone or used with another session - when a new session is started the cache may have invalid objects - the LRUCache can be cleared when the session is set. public static void connectToBioSQL(Object session) { ... clearLRUCache(); _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From mark.schreiber at novartis.com Wed Jun 21 20:44:18 2006 From: mark.schreiber at novartis.com (mark.schreiber at novartis.com) Date: Thu, 22 Jun 2006 08:44:18 +0800 Subject: [Biojava-l] proposal: application subclassing of biosql objects Message-ID: OK with this too. David Scott Sent by: biojava-l-bounces at lists.open-bio.org 06/21/2006 11:50 PM To: biojava-l at lists.open-bio.org cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-l] proposal: application subclassing of biosql objects install a static Map in RichObjectFactory that the application can use to map application subclasses of the biosql classes contained in the RichObjectFactory code - applications would set up the map via: public final static void setApplicationClass(final Class theBioJavaClass, final Class theApplicationClass) RichObjectFactory would retrieve the Map via: private final static class getApplicationClass(final Class theBioJavaClass); RichObjectFactory would substitute the application class for the biojava class in getObject: public static synchronized Object getObject(final Class clazz, Object[] params) { List paramsList = Arrays.asList(params); final Class applicationClass = getApplicationClass(clazz); .... .... builder.buildObject(applicationClass, paramsList); BioSqlRichObjectBuilder would recognize the subclass in buildObject by changing the if statements from identity: public Object buildObject(Clazz clazz, List paramsList) { if (clazz == SimpleNamespace.class) { ... to assignable: public Object buildObject(Clazz clazz, List paramsList) { if (SimpleNamespace.class.isAssignableFrom(clazz)) { ... the map size will be short - a linear search might be faster than a hash. _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From edbeaty at charter.net Mon Jun 26 10:16:39 2006 From: edbeaty at charter.net (Dexter Riley) Date: Mon, 26 Jun 2006 07:16:39 -0700 (PDT) Subject: [Biojava-l] Getting a Slice of an Alignment Message-ID: <5047818.post@talk.nabble.com> Hello. I have a FlexibleAlignment of 20 sequences, and want to get a slice of it: >Seq1 nactatcgg...atcagcgtatctgac >Seq2 nactatcgg...atcagcgtatctgac ... >Seq19 nactatcgg...atcagcgtatctgac >Seq20 nactatcgg...atcagcgtatctgac So a slice of the Location(1,5) of this alignment should look like: >Seq1 nacta >Seq2 nacta ... >Seq19 nacta >Seq20 nacta How to do this? Alignment.subAlignment(null, Location(1,5)) returns an alignment containing all the full-length sequences (presumably because all the sequences have symbols between positions 1 and 5). Any suggestions would be greatly appreciated. Thanks, Ed -- View this message in context: http://www.nabble.com/Getting-a-Slice-of-an-Alignment-t1849222.html#a5047818 Sent from the BioJava forum at Nabble.com. From richard.holland at ebi.ac.uk Mon Jun 26 11:29:05 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Mon, 26 Jun 2006 16:29:05 +0100 Subject: [Biojava-l] Getting a Slice of an Alignment In-Reply-To: <5047818.post@talk.nabble.com> References: <5047818.post@talk.nabble.com> Message-ID: <1151335745.3938.40.camel@texas.ebi.ac.uk> I think what you're looking at here are the labels of the alignment. What you need to be looking at is a combination of the labels and the symbol lists mapped to each label by the alignment. The getLabels() method of a sub alignment will return you all the original sequences for that alignment, full-length. The symbolListForLabel(label) method of a sub-alignment will return only the symbols of the sequence that fall within the alignment. cheers, Richard On Mon, 2006-06-26 at 07:16 -0700, Dexter Riley wrote: > Hello. I have a FlexibleAlignment of 20 sequences, and want to get a slice > of it: > >Seq1 > nactatcgg...atcagcgtatctgac > >Seq2 > nactatcgg...atcagcgtatctgac > ... > >Seq19 > nactatcgg...atcagcgtatctgac > >Seq20 > nactatcgg...atcagcgtatctgac > > So a slice of the Location(1,5) of this alignment should look like: > >Seq1 > nacta > >Seq2 > nacta > ... > >Seq19 > nacta > >Seq20 > nacta > > How to do this? Alignment.subAlignment(null, Location(1,5)) returns an > alignment containing all the full-length sequences (presumably because all > the sequences have symbols between positions 1 and 5). > > Any suggestions would be greatly appreciated. > Thanks, > Ed > -- > View this message in context: http://www.nabble.com/Getting-a-Slice-of-an-Alignment-t1849222.html#a5047818 > Sent from the BioJava forum at Nabble.com. > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From david.bourgais at bioxpr.be Mon Jun 26 11:18:02 2006 From: david.bourgais at bioxpr.be (David Bourgais) Date: Mon, 26 Jun 2006 17:18:02 +0200 Subject: [Biojava-l] Using Blast2HTML Message-ID: <1151335082.10441.1.camel@bioxpr-05.ct.fundp.ac.be> Hello I would like to implement the Blast2HTML class. But this method failed suring compilation : public HTMLRenderer configureBlastN( PrintWriter poOut ) { SimpleAlignmentStyler oStyler = new SimpleAlignmentStyler ( SimpleAlignmentStyler.SHOW_ALL ); String oRed = "FFA2A2"; oStyler.addStyle( "-", oRed ); oStyler.addStyle( "N", oRed ); oStyler.addStyle( "A", oRed ); oStyler.addStyle( "T", oRed ); oStyler.addStyle( "C", oRed ); oStyler.addStyle( "G", oRed ); AlignmentMarker oAlignmentMarker = new AlignmentMarker ( new ColourCommand() { public boolean isColoured ( String poFirst, String poSecond ) { if ( poFirst.equals( poSecond ) ) { return false; } else { return true; } } } // end ColourCommand , oStyler ); Properties oProps = new Properties(); oProps.put( "db", "nucl" ); DefaultURLGeneratorFactory durlgf = new DefaultURLGeneratorFactory(); HTMLRenderer oRenderer = new HTMLRenderer(poOut, this.oStyleDefinition, 50, durlgf, oAlignmentMarker, oProps); return oRenderer; } My compiler told me : The constructor HTMLRenderer(PrintWriter, String, int, DefaultURLGeneratorFactory, AlignmentMarker, Properties) is undefined How can I solve this problem. Thank you very much for answer. Regards. David Bourgais From edbeaty at charter.net Mon Jun 26 12:04:33 2006 From: edbeaty at charter.net (Dexter Riley) Date: Mon, 26 Jun 2006 09:04:33 -0700 (PDT) Subject: [Biojava-l] Getting a Slice of an Alignment In-Reply-To: <1151335745.3938.40.camel@texas.ebi.ac.uk> References: <5047818.post@talk.nabble.com> <1151335745.3938.40.camel@texas.ebi.ac.uk> Message-ID: <5049831.post@talk.nabble.com> I should have clarified; the >Seq1 lines were just to indicate that the series of sequences were part of an alignment. I did a quick implementation of what I was looking for, but was hoping that something like this already existed, in a form that could handle gaps and alignment shifts properly : public static Alignment getSlice(Alignment alignment, Location location) throws BioException{ List subAlignment = new ArrayList(); List labels = (List) alignment.getLabels(); for (String label: labels){ subAlignment.add(new SimpleAlignmentElement( label, alignment.symbolListForLabel(label).subList(location.getMin(), location.getMax()), location) ); } return new FlexibleAlignment(subAlignment); } This implementation will probably not work as expected if the sequences don't all begin at location 1. I'm honestly surprised that there's no Alignment utility that will do something like this already; surely someone else has had a need for a view to a part of an alignment before? -Ed -- View this message in context: http://www.nabble.com/Getting-a-Slice-of-an-Alignment-t1849222.html#a5049831 Sent from the BioJava forum at Nabble.com. From richard.holland at ebi.ac.uk Tue Jun 27 05:02:40 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Tue, 27 Jun 2006 10:02:40 +0100 Subject: [Biojava-l] Using Blast2HTML In-Reply-To: <1151335082.10441.1.camel@bioxpr-05.ct.fundp.ac.be> References: <1151335082.10441.1.camel@bioxpr-05.ct.fundp.ac.be> Message-ID: <1151398960.3938.43.camel@texas.ebi.ac.uk> Are you sure the BioJava classes/jar files are on your classpath whilst compiling? The constructor it mentions definitely exists so this is the only thing I can think of that might be wrong. cheers, Richard PS. Also make sure you're using the latest (1.4) release, or the latest code from CVS. On Mon, 2006-06-26 at 17:18 +0200, David Bourgais wrote: > Hello > > I would like to implement the Blast2HTML class. > But this method failed suring compilation : > > public HTMLRenderer configureBlastN( PrintWriter poOut ) { > > SimpleAlignmentStyler oStyler = new SimpleAlignmentStyler > ( SimpleAlignmentStyler.SHOW_ALL ); > String oRed = "FFA2A2"; > oStyler.addStyle( "-", oRed ); > oStyler.addStyle( "N", oRed ); > oStyler.addStyle( "A", oRed ); > oStyler.addStyle( "T", oRed ); > oStyler.addStyle( "C", oRed ); > oStyler.addStyle( "G", oRed ); > > AlignmentMarker oAlignmentMarker = new AlignmentMarker > ( new ColourCommand() { > public boolean isColoured > ( String poFirst, String poSecond ) { > > if ( poFirst.equals( poSecond ) ) { > return false; > } else { > return true; > } > } > } // end ColourCommand > , oStyler > ); > > Properties oProps = new Properties(); > oProps.put( "db", "nucl" ); > > DefaultURLGeneratorFactory durlgf = new DefaultURLGeneratorFactory(); > > HTMLRenderer oRenderer = new HTMLRenderer(poOut, > this.oStyleDefinition, > 50, > durlgf, > oAlignmentMarker, > oProps); > > return oRenderer; > } > > My compiler told me : > The constructor HTMLRenderer(PrintWriter, String, int, > DefaultURLGeneratorFactory, AlignmentMarker, Properties) is undefined > > How can I solve this problem. > > Thank you very much for answer. > > Regards. > > David Bourgais > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From richard.holland at ebi.ac.uk Tue Jun 27 05:17:38 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Tue, 27 Jun 2006 10:17:38 +0100 Subject: [Biojava-l] Getting a Slice of an Alignment In-Reply-To: <5049831.post@talk.nabble.com> References: <5047818.post@talk.nabble.com> <1151335745.3938.40.camel@texas.ebi.ac.uk> <5049831.post@talk.nabble.com> Message-ID: <1151399858.3938.57.camel@texas.ebi.ac.uk> Your getSlice method, assuming it must return a FlexibleAlignment instance, could be better written as: public static FlexibleAlignment getSlice(Alignment alignment, Location location) { Alignment subAlignment = alignment.subAlignment(null, location); List subAlignmentElements = new ArrayList(); List labels = (List) alignment.getLabels(); for (String label : labels) subAlignmentElements.add(new SimpleAlignmentElement( label, subAlignment.symbolListForLabel(label), location); return new FlexibleAlignment(subAlignmentElements); } or if you don't care what class of alignment you get back, then even just this would work: public static Alignment getSlice(Alignment alignment, Location location) { return alignment.subAlignment(null, location); } I'm afraid I still don't understand what it is that the above code can't do. Could you give example code showing what it is you are doing with the sub alignment once it has been created, the output you'd expect or want from the example code, and the output it actually gives? cheers, Richard On Mon, 2006-06-26 at 09:04 -0700, Dexter Riley wrote: > I should have clarified; the >Seq1 lines were just to indicate that the > series of sequences were part of an alignment. I did a quick implementation > of what I was looking for, but was hoping that something like this already > existed, in a form that could handle gaps and alignment shifts properly : > > public static Alignment getSlice(Alignment alignment, Location location) > throws BioException{ > List subAlignment = new ArrayList(); > List labels = (List) alignment.getLabels(); > for (String label: labels){ > subAlignment.add(new SimpleAlignmentElement( > label, > alignment.symbolListForLabel(label).subList(location.getMin(), > location.getMax()), > location) > ); > } > return new FlexibleAlignment(subAlignment); > } > > This implementation will probably not work as expected if the sequences > don't all begin at location 1. I'm honestly surprised that there's no > Alignment utility that will do something like this already; surely someone > else has had a need for a view to a part of an alignment before? > > -Ed > -- > View this message in context: http://www.nabble.com/Getting-a-Slice-of-an-Alignment-t1849222.html#a5049831 > Sent from the BioJava forum at Nabble.com. > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From david.bourgais at bioxpr.be Tue Jun 27 07:05:50 2006 From: david.bourgais at bioxpr.be (David Bourgais) Date: Tue, 27 Jun 2006 13:05:50 +0200 Subject: [Biojava-l] NoClassDefFound error with ColourCommand Message-ID: <1151406350.16394.11.camel@bioxpr-05.ct.fundp.ac.be> Hello I compiled a CVS version of BioJava (1.5) and I tried to execute my program. The problem is during the execution of this part of code : AlignmentMarker oAlignmentMarker = new AlignmentMarker ( new ColourCommand() { public boolean isColoured ( String poFirst, String poSecond ) { if ( poFirst.equals( poSecond ) ) { return false; } else { return true; } } } // end ColourCommand , oStyler ); Everything is okay during compilation but in the execution, my program crashes with this error : java.lang.NoClassDefFoundError: org/biojava/bio/program/blast2html/ColourCommand at com.bioxpr.blaster.Process.doPost(Process.java:35) But, in my Jar, ColourCommand class is present. Is there a problem with this interface ? Thank you very much for answer. Regards. David From richard.holland at ebi.ac.uk Tue Jun 27 07:42:43 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Tue, 27 Jun 2006 12:42:43 +0100 Subject: [Biojava-l] NoClassDefFound error with ColourCommand In-Reply-To: <1151406350.16394.11.camel@bioxpr-05.ct.fundp.ac.be> References: <1151406350.16394.11.camel@bioxpr-05.ct.fundp.ac.be> Message-ID: <1151408563.3938.67.camel@texas.ebi.ac.uk> Sounds like you may be executing the code against a different version of BioJava than you compiled from. On Tue, 2006-06-27 at 13:05 +0200, David Bourgais wrote: > Hello > > I compiled a CVS version of BioJava (1.5) and I tried to execute my > program. The problem is during the execution of this part of code : > > AlignmentMarker oAlignmentMarker = new AlignmentMarker > ( new ColourCommand() { > public boolean isColoured > ( String poFirst, String poSecond ) { > > if ( poFirst.equals( poSecond ) ) { > return false; > } else { > return true; > } > } > } // end ColourCommand > , oStyler > ); > > Everything is okay during compilation but in the execution, my program > crashes with this error : > > java.lang.NoClassDefFoundError: > org/biojava/bio/program/blast2html/ColourCommand at > com.bioxpr.blaster.Process.doPost(Process.java:35) > > But, in my Jar, ColourCommand class is present. > Is there a problem with this interface ? > > Thank you very much for answer. > > Regards. > > David > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From david.bourgais at bioxpr.be Tue Jun 27 07:36:01 2006 From: david.bourgais at bioxpr.be (David Bourgais) Date: Tue, 27 Jun 2006 13:36:01 +0200 Subject: [Biojava-l] NoClassDefFound error with ColourCommand In-Reply-To: <1151406350.16394.11.camel@bioxpr-05.ct.fundp.ac.be> References: <1151406350.16394.11.camel@bioxpr-05.ct.fundp.ac.be> Message-ID: <1151408161.16394.15.camel@bioxpr-05.ct.fundp.ac.be> Ok, I prefer to change my question. I would like to convert a Blast output in XML into a HTML file. In the BioJava website, I noticed this link : http://www.biojava.org/wiki/BioJava:Tutorial:Blast2HTML_Example_Application But, the two links in the introduction are not correct. So, is there any tutorial to convert a Blast XML into a HTML file ? Thank you very much for answer. Regards. David From richard.holland at ebi.ac.uk Tue Jun 27 07:57:10 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Tue, 27 Jun 2006 12:57:10 +0100 Subject: [Biojava-l] NoClassDefFound error with ColourCommand In-Reply-To: <1151408161.16394.15.camel@bioxpr-05.ct.fundp.ac.be> References: <1151406350.16394.11.camel@bioxpr-05.ct.fundp.ac.be> <1151408161.16394.15.camel@bioxpr-05.ct.fundp.ac.be> Message-ID: <1151409430.3938.73.camel@texas.ebi.ac.uk> The link you give is the tutorial. The broken links would normally link to two example XML files, there's nothing in those files which tell you how to parse them. You will notice that the tutorial specifies an example application in the demos directory of the BioJava distribution. The source code of this example application is provided with BioJava (or if you can't find it, check out the latest copy of the source from CVS). Reading the source code should be enough to work out how it works - although if you'd like to write a better tutorial on the subject and contribute it back, that would be very much appreciated! cheers, Richard. PS. NOTE: to the person who converted the Blast2HTML tutorial, can you fix the two broken links to the example blastp and blastn input files please? Thanks! On Tue, 2006-06-27 at 13:36 +0200, David Bourgais wrote: > Ok, I prefer to change my question. > I would like to convert a Blast output in XML into a HTML file. > In the BioJava website, I noticed this link : > http://www.biojava.org/wiki/BioJava:Tutorial:Blast2HTML_Example_Application > But, the two links in the introduction are not correct. > So, is there any tutorial to convert a Blast XML into a HTML file ? > > Thank you very much for answer. > > Regards. > > David > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From anderson.moura at telemar-rj.com.br Tue Jun 27 07:59:03 2006 From: anderson.moura at telemar-rj.com.br (Anderson Moura da Silva) Date: Tue, 27 Jun 2006 08:59:03 -0300 Subject: [Biojava-l] Alignment Viewer Message-ID: <3C39C09ED334F243838953854BE43FB602F9DE9B@MAILBX02.telemar.corp.net> Hi everyone, Does BioJava implements a Alignment Panel like SequencePanel? Is it possible to put colors in it like ClustalW? I Used the SequencePanel but not in colors, is it possible to put colors on this one? Another question: The SequenceDB can only store sequences of one type? I'm asking it because I tryed to save sequences using ProteinTools, RNATools and DNATools on the same SequenceDB and I was not able to get this sequences back using the ID of the sequence. Is it wrong to put more than one type of sequence on the same SequenceDB? Thank you a lot Anderson Moura - Brasil Esta mensagem, incluindo seus anexos, pode conter informa??es privilegiadas e/ou de car?ter confidencial, n?o podendo ser retransmitida sem autoriza??o do remetente. Se voc? n?o ? o destinat?rio ou pessoa autorizada a receb?-la, informamos que o seu uso, divulga??o, c?pia ou arquivamento s?o proibidos. Portanto, se voc? recebeu esta mensagem por engano, por favor, nos informe respondendo imediatamente a este e-mail e em seguida apague-a. From david.bourgais at bioxpr.be Tue Jun 27 08:20:50 2006 From: david.bourgais at bioxpr.be (David Bourgais) Date: Tue, 27 Jun 2006 14:20:50 +0200 Subject: [Biojava-l] NoClassDefFound error with ColourCommand In-Reply-To: <1151409430.3938.73.camel@texas.ebi.ac.uk> References: <1151406350.16394.11.camel@bioxpr-05.ct.fundp.ac.be> <1151408161.16394.15.camel@bioxpr-05.ct.fundp.ac.be> <1151409430.3938.73.camel@texas.ebi.ac.uk> Message-ID: <1151410850.16394.27.camel@bioxpr-05.ct.fundp.ac.be> Good afternoon I am sorry but I did not find any source code in my CVS version in the demo repertory or apps. have you got a specific path to retrieve it ? Thank you again. Regards. From david.bourgais at bioxpr.be Tue Jun 27 08:54:30 2006 From: david.bourgais at bioxpr.be (David Bourgais) Date: Tue, 27 Jun 2006 14:54:30 +0200 Subject: [Biojava-l] NoClassDefFound error with ColourCommand In-Reply-To: <1151410850.16394.27.camel@bioxpr-05.ct.fundp.ac.be> References: <1151406350.16394.11.camel@bioxpr-05.ct.fundp.ac.be> <1151408161.16394.15.camel@bioxpr-05.ct.fundp.ac.be> <1151409430.3938.73.camel@texas.ebi.ac.uk> <1151410850.16394.27.camel@bioxpr-05.ct.fundp.ac.be> Message-ID: <1151412870.16394.29.camel@bioxpr-05.ct.fundp.ac.be> Ok, I solved the problem. Thank you for all. Regards. David From edbeaty at charter.net Tue Jun 27 10:20:59 2006 From: edbeaty at charter.net (Dexter Riley) Date: Tue, 27 Jun 2006 07:20:59 -0700 (PDT) Subject: [Biojava-l] Getting a Slice of an Alignment In-Reply-To: <1151399858.3938.57.camel@texas.ebi.ac.uk> References: <5047818.post@talk.nabble.com> <1151335745.3938.40.camel@texas.ebi.ac.uk> <5049831.post@talk.nabble.com> <1151399858.3938.57.camel@texas.ebi.ac.uk> Message-ID: <5066891.post@talk.nabble.com> Thanks for looking at the method! I'll give your improved version a try. subAlignment does return a slice of the Alignment; a horizontal slice. I need a vertical slice at a given location. In other words, subAlignment: if sequence in alignment has symbols at location, return entire sequence get(Vertical)Slice: for sequence in alignment, return subsequence at location I use slices for primer design, where I have a candidate primer location and want to see the list of different target sequences in the alignment at that position (so I can consider possible mismatches, Tm, etc.) It would also be handy for the GUI, to say, "give me a view of bases 2000-2567 for every sequence in this really long alignment". Thanks, Ed -- View this message in context: http://www.nabble.com/Getting-a-Slice-of-an-Alignment-tf1849222.html#a5066891 Sent from the BioJava forum at Nabble.com. From richard.holland at ebi.ac.uk Tue Jun 27 11:26:37 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Tue, 27 Jun 2006 16:26:37 +0100 Subject: [Biojava-l] Getting a Slice of an Alignment In-Reply-To: <5066891.post@talk.nabble.com> References: <5047818.post@talk.nabble.com> <1151335745.3938.40.camel@texas.ebi.ac.uk> <5049831.post@talk.nabble.com> <1151399858.3938.57.camel@texas.ebi.ac.uk> <5066891.post@talk.nabble.com> Message-ID: <1151421997.3938.91.camel@texas.ebi.ac.uk> Ah... I just read the source code for the symbolListForLabel() method on sub alignments, and found what may well be a bug. BioJava list people, your help please! In my understanding, symbolListForLabel() should return the symbols from the given label that fall within the alignment. This is the case in all except sub alignments. Sub alignments, for whatever reason, are returning the symbols from the given label that fall within the parent alignment upon which the sub alignment is based, NOT just those that fall within the sub alignment itself. Is this a bug? I think it is. The solution would be for me to alter AbstractULAlignment.SubULAlignment.symbolListForLabel() to restrict the returned symbols to only include those in the area covered by the sub alignment. It would return EMPTY_SEQUENCE if the label didn't cover the area of the sub alignment, and it would return a truncated symbol list if it only partially covered it. Would this be acceptable? If so, once this change was made, it would fix Ed's problems below as subAlignment() would start returning vertical slices as I think it should probably have done so from the start, rather than the horizontal slices it is returning at present. cheers, Richard On Tue, 2006-06-27 at 07:20 -0700, Dexter Riley wrote: > Thanks for looking at the method! I'll give your improved version a try. > > subAlignment does return a slice of the Alignment; a horizontal slice. I > need a vertical slice at a given location. In other words, > subAlignment: > if sequence in alignment has symbols at location, return entire sequence > get(Vertical)Slice: > for sequence in alignment, return subsequence at location > > I use slices for primer design, where I have a candidate primer location and > want to see the list of different target sequences in the alignment at that > position (so I can consider possible mismatches, Tm, etc.) > It would also be handy for the GUI, to say, "give me a view of bases > 2000-2567 for every sequence in this really long alignment". > > Thanks, > Ed -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From edbeaty at charter.net Tue Jun 27 15:57:45 2006 From: edbeaty at charter.net (Dexter Riley) Date: Tue, 27 Jun 2006 12:57:45 -0700 (PDT) Subject: [Biojava-l] Getting a Slice of an Alignment In-Reply-To: <1151421997.3938.91.camel@texas.ebi.ac.uk> References: <5047818.post@talk.nabble.com> <1151335745.3938.40.camel@texas.ebi.ac.uk> <5049831.post@talk.nabble.com> <1151399858.3938.57.camel@texas.ebi.ac.uk> <5066891.post@talk.nabble.com> <1151421997.3938.91.camel@texas.ebi.ac.uk> Message-ID: <5072893.post@talk.nabble.com> Richard Holland-2 wrote: > > Ah... > > I just read the source code for the symbolListForLabel() method on sub > alignments, and found what may well be a bug. > > BioJava list people, your help please! In my understanding, > symbolListForLabel() should return the symbols from the given label that > fall within the alignment. This is the case in all except sub > alignments. Sub alignments, for whatever reason, are returning the > symbols from the given label that fall within the parent alignment upon > which the sub alignment is based, NOT just those that fall within the > sub alignment itself. > > Is this a bug? I think it is. > > The solution would be for me to alter > AbstractULAlignment.SubULAlignment.symbolListForLabel() to restrict the > returned symbols to only include those in the area covered by the sub > alignment. It would return EMPTY_SEQUENCE if the label didn't cover the > area of the sub alignment, and it would return a truncated symbol list > if it only partially covered it. > > Would this be acceptable? > > If so, once this change was made, it would fix Ed's problems below as > subAlignment() would start returning vertical slices as I think it > should probably have done so from the start, rather than the horizontal > slices it is returning at present. > > cheers, > Richard > I think that would provide just the functionality I was looking for! Thanks very much for all your help. All the best, Ed -- View this message in context: http://www.nabble.com/Getting-a-Slice-of-an-Alignment-tf1849222.html#a5072893 Sent from the BioJava forum at Nabble.com. From mark.schreiber at novartis.com Tue Jun 27 21:43:21 2006 From: mark.schreiber at novartis.com (mark.schreiber at novartis.com) Date: Wed, 28 Jun 2006 09:43:21 +0800 Subject: [Biojava-l] Alignment Viewer Message-ID: Hi - I don't think there is an alignment viewer officially although I know that people have written them. Maybe someone would like to contribute theirs to the list?? Can you give more detailed information about your SequenceDB bug? Biojava version, example code etc? Thanks, - Mark Mark Schreiber Research Investigator (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 10 Biopolis Road #05-01 Chromos Singapore 138670 www.nitd.novartis.com phone +65 6722 2973 fax +65 6722 2910 "Anderson Moura da Silva" Sent by: biojava-l-bounces at lists.open-bio.org 06/27/2006 07:59 PM To: cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-l] Alignment Viewer Hi everyone, Does BioJava implements a Alignment Panel like SequencePanel? Is it possible to put colors in it like ClustalW? I Used the SequencePanel but not in colors, is it possible to put colors on this one? Another question: The SequenceDB can only store sequences of one type? I'm asking it because I tryed to save sequences using ProteinTools, RNATools and DNATools on the same SequenceDB and I was not able to get this sequences back using the ID of the sequence. Is it wrong to put more than one type of sequence on the same SequenceDB? Thank you a lot Anderson Moura - Brasil Esta mensagem, incluindo seus anexos, pode conter informa??es privilegiadas e/ou de car?ter confidencial, n?o podendo ser retransmitida sem autoriza??o do remetente. Se voc? n?o ? o destinat?rio ou pessoa autorizada a receb?-la, informamos que o seu uso, divulga??o, c?pia ou arquivamento s?o proibidos. Portanto, se voc? recebeu esta mensagem por engano, por favor, nos informe respondendo imediatamente a este e-mail e em seguida apague-a. _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From dreamcomtrue at gmail.com Wed Jun 28 03:22:54 2006 From: dreamcomtrue at gmail.com (Jessicaa) Date: Wed, 28 Jun 2006 00:22:54 -0700 (PDT) Subject: [Biojava-l] Using SeqIOTools.biojavaToFile Message-ID: <5079611.post@talk.nabble.com> Hello, I compiled a java file and I tried to execute my program to change embl flat file to ncbi flat file. But output file(Toncbi.gbk) is o byte. //ex>java embl_togbk embl_sample import java.io.*; import java.util.*; import java.lang.*; import org.biojava.bio.*; import org.biojava.bio.seq.*; import org.biojava.bio.seq.io.*; import org.biojava.bio.symbol.*; public class embl_togbk { public static void main(String[] args) { String fileName = args[0]; BufferedReader br; SequenceIterator seq_iter; Sequence seq; try { br = new BufferedReader(new FileReader(fileName)); seq_iter = SeqIOTools.readEmbl(br); while(seq_iter.hasNext()) { seq =seq_iter.nextSequence(); System.out.println(seq.getName()); Iterator feat_iter = seq.features(); while(feat_iter.hasNext()) { Feature f= (Feature) (feat_iter.next()); Annotation annot = f.getAnnotation(); System.out.println("*+feature: "+ f.getType()); System.out.println(); } } // System.out.println(); SeqIOTools.biojavaToFile("GenBank","dna", new FileOutputStream("Toncbi.gbk"), seq_iter); br.close(); } catch(Exception e) { e.printStackTrace(); } } } //----------------------------------------------- //file:embl_sample ID A1MVRNA2 standard; DNA; 2593 BP. XX AC X01572; XX DT 03-AUG-1987 (an correction) DT 30-JAN-1986 (author review) DT 17-JUL-1985 (first entry) XX DE Alfalfa mosaic virus (A1M4) RNA 2 XX KW unidentified reading frame. XX OS Alfalfa mosaic virus OC Viridae; ss-RNA nonenveloped viruses; Alfamovirus. XX RN [1] (bases 1-2593; enum. 1 to 2593) RA Cornelissen B.J.C., Brederode F.T., Veeneman G.H., van Boom J.H., RA Bol J.F.; RT "Complete nucleotide sequence of alfalfa mosaic virus RNA 2"; RL Nucl. Acids Res. 11:3019-3025(1983). XX CC Data kindly reviewed (30-JAN-1986) by J.F. Bol XX FH Key From To Description FH FT TRANSCR 1 80 A1MV RNA 2 FT CDS 11 80 unidentified reading frame XX SQ Sequence 80 BP; 13 A; 14 C; 21 G; 32 T; TTTTTTTTTT ATGCCCCCCCC GGGGGGGGGG TTTTTTTTTT TTTTTTTTTT GGGGGGGGGG AAAAAAAAAA CCCCCCCCTAA // ------------------------------------------------------- Is there a problem with this code? Thank you very much for answer. Regards. Jessica -- View this message in context: http://www.nabble.com/Using-SeqIOTools.biojavaToFile-tf1859886.html#a5079611 Sent from the BioJava forum at Nabble.com. From richard.holland at ebi.ac.uk Wed Jun 28 04:56:20 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Wed, 28 Jun 2006 09:56:20 +0100 Subject: [Biojava-l] Using SeqIOTools.biojavaToFile In-Reply-To: <5079611.post@talk.nabble.com> References: <5079611.post@talk.nabble.com> Message-ID: <1151484980.3942.6.camel@texas.ebi.ac.uk> Your code is reading all the way through the sequence iterator (the variable seq_iter) in a while loop. Then, you are using the same sequence iterator to write out the database using SeqIOTools.biojavaToFile. You get no output because the iterator has been iterated over and all used up in the while loop, so has no more sequences to supply. If you want to write out as well as parse the sequences, you either have to write them out one at a time in the while loop, or open a new sequence iterator over the file at the end of the loop to pass to the SeqIOTools.biojavaToFile call. cheers, Richard On Wed, 2006-06-28 at 00:22 -0700, Jessicaa wrote: > Hello, > I compiled a java file and I tried to execute my program to change embl flat > file to ncbi flat file. > But output file(Toncbi.gbk) is o byte. > > > //ex>java embl_togbk embl_sample > > > import java.io.*; > import java.util.*; > import java.lang.*; > import org.biojava.bio.*; > import org.biojava.bio.seq.*; > import org.biojava.bio.seq.io.*; > import org.biojava.bio.symbol.*; > > > public class embl_togbk { > public static void main(String[] args) { > String fileName = args[0]; > BufferedReader br; > SequenceIterator seq_iter; > Sequence seq; > try { > br = new BufferedReader(new FileReader(fileName)); > seq_iter = SeqIOTools.readEmbl(br); > while(seq_iter.hasNext()) { > seq =seq_iter.nextSequence(); > System.out.println(seq.getName()); > Iterator feat_iter = seq.features(); > while(feat_iter.hasNext()) { > Feature f= (Feature) (feat_iter.next()); > Annotation annot = f.getAnnotation(); > System.out.println("*+feature: "+ f.getType()); > System.out.println(); > } > } > // System.out.println(); > SeqIOTools.biojavaToFile("GenBank","dna", new > FileOutputStream("Toncbi.gbk"), seq_iter); > > br.close(); > } catch(Exception e) { > e.printStackTrace(); > } > } > } > > > > //----------------------------------------------- > //file:embl_sample > ID A1MVRNA2 standard; DNA; 2593 BP. > XX > AC X01572; > XX > DT 03-AUG-1987 (an correction) > DT 30-JAN-1986 (author review) > DT 17-JUL-1985 (first entry) > XX > DE Alfalfa mosaic virus (A1M4) RNA 2 > XX > KW unidentified reading frame. > XX > OS Alfalfa mosaic virus > OC Viridae; ss-RNA nonenveloped viruses; Alfamovirus. > XX > RN [1] (bases 1-2593; enum. 1 to 2593) > RA Cornelissen B.J.C., Brederode F.T., Veeneman G.H., van Boom J.H., > RA Bol J.F.; > RT "Complete nucleotide sequence of alfalfa mosaic virus RNA 2"; > RL Nucl. Acids Res. 11:3019-3025(1983). > XX > CC Data kindly reviewed (30-JAN-1986) by J.F. Bol > XX > FH Key From To Description > FH > FT TRANSCR 1 80 A1MV RNA 2 > FT CDS 11 80 unidentified reading frame > XX > SQ Sequence 80 BP; 13 A; 14 C; 21 G; 32 T; > TTTTTTTTTT ATGCCCCCCCC GGGGGGGGGG TTTTTTTTTT TTTTTTTTTT GGGGGGGGGG > AAAAAAAAAA CCCCCCCCTAA > // > > ------------------------------------------------------- > Is there a problem with this code? > Thank you very much for answer. > > Regards. > > Jessica -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From richard.holland at ebi.ac.uk Wed Jun 28 04:57:56 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Wed, 28 Jun 2006 09:57:56 +0100 Subject: [Biojava-l] Getting a Slice of an Alignment In-Reply-To: <5072893.post@talk.nabble.com> References: <5047818.post@talk.nabble.com> <1151335745.3938.40.camel@texas.ebi.ac.uk> <5049831.post@talk.nabble.com> <1151399858.3938.57.camel@texas.ebi.ac.uk> <5066891.post@talk.nabble.com> <1151421997.3938.91.camel@texas.ebi.ac.uk> <5072893.post@talk.nabble.com> Message-ID: <1151485076.3942.8.camel@texas.ebi.ac.uk> Dear list... if I haven't heard any arguments to the contrary by 9am Monday 3rd July (UK time), I'll make the changes described below. cheers, Richard On Tue, 2006-06-27 at 12:57 -0700, Dexter Riley wrote: > > Richard Holland-2 wrote: > > > > Ah... > > > > I just read the source code for the symbolListForLabel() method on sub > > alignments, and found what may well be a bug. > > > > BioJava list people, your help please! In my understanding, > > symbolListForLabel() should return the symbols from the given label that > > fall within the alignment. This is the case in all except sub > > alignments. Sub alignments, for whatever reason, are returning the > > symbols from the given label that fall within the parent alignment upon > > which the sub alignment is based, NOT just those that fall within the > > sub alignment itself. > > > > Is this a bug? I think it is. > > > > The solution would be for me to alter > > AbstractULAlignment.SubULAlignment.symbolListForLabel() to restrict the > > returned symbols to only include those in the area covered by the sub > > alignment. It would return EMPTY_SEQUENCE if the label didn't cover the > > area of the sub alignment, and it would return a truncated symbol list > > if it only partially covered it. > > > > Would this be acceptable? > > > > If so, once this change was made, it would fix Ed's problems below as > > subAlignment() would start returning vertical slices as I think it > > should probably have done so from the start, rather than the horizontal > > slices it is returning at present. > > > > cheers, > > Richard > > > > I think that would provide just the functionality I was looking for! Thanks > very much for all your help. > All the best, > Ed -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From mark.schreiber at novartis.com Thu Jun 29 03:36:05 2006 From: mark.schreiber at novartis.com (mark.schreiber at novartis.com) Date: Thu, 29 Jun 2006 15:36:05 +0800 Subject: [Biojava-l] RES: Alignment Viewer Message-ID: The following example works for me. Let me know if it causes problems for you. - Mark /* * SeqDBExample.java * * Created on June 29, 2006, 3:23 PM * */ package db; import org.biojava.bio.seq.DNATools; import org.biojava.bio.seq.ProteinTools; import org.biojava.bio.seq.RNATools; import org.biojava.bio.seq.Sequence; import org.biojava.bio.seq.db.HashSequenceDB; import org.biojava.bio.seq.db.SequenceDB; import org.biojava.bio.seq.io.SeqIOTools; /** * * @author */ public class SeqDBExample { private SequenceDB db; /** Creates a new instance of SeqDBExample */ public SeqDBExample() { db = new HashSequenceDB(); } public void test() throws Exception{ //create 3 sequences Sequence rna = RNATools.createRNASequence("auggc", "rna seq"); Sequence dna = DNATools.createDNASequence("atggc", "dna seq"); Sequence prot = ProteinTools.createProteinSequence("HVFST", "prot seq"); //add them to the DB db.addSequence(rna); db.addSequence(dna); db.addSequence(prot); //get them back and print them to the screen SeqIOTools.writeFasta(System.out, db.getSequence("rna seq")); SeqIOTools.writeFasta(System.out, db.getSequence("dna seq")); SeqIOTools.writeFasta(System.out, db.getSequence("prot seq")); } public static void main(String[] args) throws Exception{ SeqDBExample example = new SeqDBExample(); example.test(); } } "Anderson Moura da Silva" 06/28/2006 09:52 PM To: cc: Subject: RES: [Biojava-l] Alignment Viewer Well, I've created a HashSequenceDB and then I need to load from a customized XML file, many sequences of DNA, RNA or Protein Alphabet. I don't have the code here now but i did something like this: //Check whether it's DNA, RNA or Protein in a loop if (alphabet=="DNA"){ sym = createDNASequence(sequence_string, name_sequence); Sequence = createSequence(sym , null, name_sequence, null); } else if (alphabet=="RNA"){ sym = createRNASequence(sequence_string, name_sequence); sequence = createSequence(sym , null, name_sequence, null); } else if (alphabet=="PROTEIN_TERM"){ sym = createProteinSequence(sequence_string, name_sequence); sequence = createSequence(sym , null, name_sequence, null); } hashSequenceDB.addSequence(name_sequence,sequence); //end of the loop then in another class, I try to get back the sequences doing this passing the var name_sequence as a parameter: hashSequenceDB.getSequence(name_sequence); //but it aways give me back a null pointer type error. If you can help I'd be glad! Thanks Anderson Moura - Brazil -----Mensagem original----- De: mark.schreiber at novartis.com [mailto:mark.schreiber at novartis.com] Enviada em: ter?a-feira, 27 de junho de 2006 22:43 Para: Anderson Moura da Silva Cc: biojava-l at lists.open-bio.org; biojava-l-bounces at lists.open-bio.org Assunto: Re: [Biojava-l] Alignment Viewer Hi - I don't think there is an alignment viewer officially although I know that people have written them. Maybe someone would like to contribute theirs to the list?? Can you give more detailed information about your SequenceDB bug? Biojava version, example code etc? Thanks, - Mark Mark Schreiber Research Investigator (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 10 Biopolis Road #05-01 Chromos Singapore 138670 www.nitd.novartis.com phone +65 6722 2973 fax +65 6722 2910 "Anderson Moura da Silva" Sent by: biojava-l-bounces at lists.open-bio.org 06/27/2006 07:59 PM To: cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-l] Alignment Viewer Hi everyone, Does BioJava implements a Alignment Panel like SequencePanel? Is it possible to put colors in it like ClustalW? I Used the SequencePanel but not in colors, is it possible to put colors on this one? Another question: The SequenceDB can only store sequences of one type? I'm asking it because I tryed to save sequences using ProteinTools, RNATools and DNATools on the same SequenceDB and I was not able to get this sequences back using the ID of the sequence. Is it wrong to put more than one type of sequence on the same SequenceDB? Thank you a lot Anderson Moura - Brasil Esta mensagem, incluindo seus anexos, pode conter informa??es privilegiadas e/ou de car?ter confidencial, n?o podendo ser retransmitida sem autoriza??o do remetente. Se voc? n?o ? o destinat?rio ou pessoa autorizada a receb?-la, informamos que o seu uso, divulga??o, c?pia ou arquivamento s?o proibidos. Portanto, se voc? recebeu esta mensagem por engano, por favor, nos informe respondendo imediatamente a este e-mail e em seguida apague-a. _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From richard.holland at ebi.ac.uk Thu Jun 1 15:26:12 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Thu, 01 Jun 2006 16:26:12 +0100 Subject: [Biojava-l] Error loading ontology terms In-Reply-To: References: Message-ID: <1149175573.3948.78.camel@texas.ebi.ac.uk> Hi there. I looked through your stack trace, and the line numbers don't match up with the current code. I have a strong feeling you may have an out-of- date version of biojava. Could you double-check that you have the latest biojava-1.4 version, or are using the biojava-live version built from CVS? If you can confirm that you are using the latest 1.4 or biojava-live then it'd be easier to solve this. Alternatively, you could have an out-of-date version of the BioSQL schema. The reason I suspect that your BioSQL or BioJava are out of date is because in the last stack trace you mention, this exception arises: java.sql.SQLException: Unknown column 'name' in 'field list' This shows that BioJava has expected to find a column called 'name' in some table in BioSQL, but that column is not there. This would only happen if your BioSQL version did not match the version of BioSQL that your version of BioJava was expecting. cheers, Richard On Thu, 2006-06-01 at 21:32 +0800, Yi-Feng Chang wrote: > Leif, this looks more like a biojava or biojava-x related problem, so > I'm resending it to the Biojava list. -hilmar > ======================================================================== > == > Dear All, > I've checked biosql archives, and found a similar thread > (http://lists.open-bio.org/pipermail/biojava-l/2005-November/ > 005151.html) > however, it did not give specific solution. So I post here again, and > hope there are someone could help me. > I'm using JDK1.5.0_05, Biojava 1.4, Biosql 1.41, and Mysql 5.0 with > My_connectJ 3.1 > I was following the demo source that provide by biojava-in-anger except > for the database connection > the exceptions were listed in following: > In first connection there would be a connection error > *** Importing a core ontology -- hope this is okay > *** Importing terms > Exception in thread "main" org.biojava.bio.BioException: Error > connecting to BioSQL database: Connection is closed. > at > org.biojava.bio.seq.db.biosql.BioSQLSequenceDB.initDb > (BioSQLSequenceDB.java:276) > at > org.biojava.bio.seq.db.biosql.BioSQLSequenceDB. > (BioSQLSequenceDB.java:194) > at genevote.BioSQLTest.loadSeq(BioSQLTest.java:31) > at genevote.BioSQLTest.main(BioSQLTest.java:70) > Caused by: java.sql.SQLException: Connection is closed. > at > org.apache.commons.dbcp.PoolingDataSource > $PoolGuardConnectionWrapper.checkOpen(PoolingDataSource.java:219) > at > org.apache.commons.dbcp.PoolingDataSource > $PoolGuardConnectionWrapper.createStatement(PoolingDataSource.java:248) > at > org.biojava.bio.seq.db.biosql.MySQLDBHelper.getInsertID > (MySQLDBHelper.java:68) > at > org.biojava.bio.seq.db.biosql.BioSQLSequenceDB.initDb > (BioSQLSequenceDB.java:268) > ... 3 more > Then I tried again, it works, and I put all sequences in genbank format > into biosql db without error. > But, while I tried to extract sequences, exception comes again. > org.biojava.bio.BioException: Error loading ontology terms > at > org.biojava.bio.seq.db.biosql.OntologySQL.loadOntology > (OntologySQL.java:444) > at > org.biojava.bio.seq.db.biosql.OntologySQL.getOntology > (OntologySQL.java:116) > at org.biojava.bio.seq.db.biosql.OntologySQL.(OntologySQL.java: > 413) > at > org.biojava.bio.seq.db.biosql.OntologySQL.getOntologySQL > (OntologySQL.java:72) > at > org.biojava.bio.seq.db.biosql.BioSQLSequenceDB.initDb > (BioSQLSequenceDB.java:240) > at > org.biojava.bio.seq.db.biosql.BioSQLSequenceDB. > (BioSQLSequenceDB.java:194) > at genevote.test.loadSeq(test.java:25) > at genevote.test.main(test.java:76) > Caused by: java.sql.SQLException: Unknown column 'name' in 'field list' > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2851) > at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1534) > at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:1625) > at com.mysql.jdbc.Connection.execSQL(Connection.java:2297) > at com.mysql.jdbc.Connection.execSQL(Connection.java:2226) > at > com.mysql.jdbc.PreparedStatement.executeInternal > (PreparedStatement.java:1812) > at > com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java: > 1657) > at > org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery > (DelegatingPreparedStatement.java:205) > at > org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery > (DelegatingPreparedStatement.java:205) > at org.biojava.bio.seq.db.biosql.OntologySQL.loadTerms > (OntologySQL.java:339) > at > org.biojava.bio.seq.db.biosql.OntologySQL.loadOntology > (OntologySQL.java:441) > ... 7 more > > yi-feng chang > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From johnson.biotech at gmail.com Thu Jun 1 22:03:43 2006 From: johnson.biotech at gmail.com (Seth Johnson) Date: Thu, 1 Jun 2006 18:03:43 -0400 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files Message-ID: Hi All, I'm a newbie to the whole BioJava(X) API and was hoping to get some clarification on several issues that I'm having. I am developing a parser that would take as input "NCBI Incremental ASN.1 Sequence Updates to Genbank" files ( ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the ASN2GB converter ( ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert resulting sequences to a format parsable by BioJava(X) ( http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where my problems start. ISSUE 1: I've tried to parse all of the formats that ASN2GB outputs ( GenBank (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML), tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank format is recognized by the "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with some exceptions that I'll describe in issue #2. This is the code that I'm using to parse, for example, the EMBL output: BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb")); Namespace gbNspace = (Namespace) RichObjectFactory.getObject(SimpleNamespace.class, new Object[]{"gbSpace"} ); RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace); while (gbSeqs.hasNext()) { try { RichSequence rs = gbSeqs.nextRichSequence(); // Further processing or RichSequence object from here } catch (BioException be){ be.printStackTrace(); } } The multi-sequence EMBL file looks like this: --------------------------------------------------------------------------------- ID DQ472184 standard; DNA; INV; 546 BP. XX AC DQ472184; XX SV DQ472184.1 DT 15-MAY-2006 XX DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, DE complete cds. XX KW . XX OS Trypanosoma cruzi strain CL Brener OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; OC Schizotrypanum. XX RN [1] RP 1-546 RA De Melo L.D.B.; RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; RL Unpublished. XX RN [2] RP 1-546 RA De Melo L.D.B.; RT ; RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ RL 21949-900, Brazil XX FH Key Location/Qualifiers FH FT source 1..546 FT /organism="Trypanosoma cruzi strain CL Brener" FT /mol_type="genomic DNA" FT /strain="CL Brener" FT /db_xref="taxon:353153" FT gene <1..>546 FT /gene="ARC21" FT /note="TcARC21" FT mRNA <1..>546 FT /gene="ARC21" FT /product="actin-related protein 3" FT CDS 1..546 FT /gene="ARC21" FT /note="actin-binding protein; ARPC3 21 kDa; putative FT member of Arp2/3 complex" FT /codon_start=1 FT /product="actin-related protein 3" FT /protein_id="ABF13401.1" FT /db_xref="GI:93360014" FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL FT FPEKDGTGNKFWMAFAKRPFLASS" atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 agttag 546 // ID DQ472185 standard; DNA; INV; 543 BP. XX AC DQ472185; XX SV DQ472185.1 DT 15-MAY-2006 XX DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, DE complete cds. XX KW . XX OS Trypanosoma cruzi strain CL Brener OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; OC Schizotrypanum. XX RN [1] RP 1-543 RA De Melo L.D.B.; RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; RL Unpublished. XX RN [2] RP 1-543 RA De Melo L.D.B.; RT ; RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ RL 21949-900, Brazil XX FH Key Location/Qualifiers FH FT source 1..543 FT /organism="Trypanosoma cruzi strain CL Brener" FT /mol_type="genomic DNA" FT /strain="CL Brener" FT /db_xref="taxon:353153" FT gene <1..>543 FT /gene="ARC20" FT /note="TcARC20" FT mRNA <1..>543 FT /gene="ARC20" FT /product="actin-related protein 4" FT CDS 1..543 FT /gene="ARC20" FT /note="actin-binding protein; ARPC4 20 kDa; putative FT member of Arp2/3 complex" FT /codon_start=1 FT /product="actin-related protein 4" FT /protein_id="ABF13402.1" FT /db_xref="GI:93360016" FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA FT MKLNVNQRARRAAMEFFLALNFT" atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 tga 543 // ----------------------------------------------------------------------- I get an exception message "Could Not Read Sequence". Same thing happens if I use the readINSDSetDNA reader instead of readEMBLDNA one with the following INSDset file (beginning of the file): DQ022078 16729 DNA linear ENV

15-MAY-2006

DQ022078

gb|DQ022078.1| gi|71842722

15-MAY-2006

DQ022078

gb|DQ022078.1| gi|71842722

ENV ? 1..16729 Schmeisser,C. Elend,C. Streit,W.R. Isolation and biochemical characterization of two novel metagenome derived esterases Appl. Environ. Microbiol. 0:0-0 (2006) ? 1..16729 Schmeisser,C. Elend,C. Streit,W.R. Submitted (29-APR-2005) to the EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057, Germany So my question is wether the ASN2GB produces output that's incompatible with BioJava parsers or is there a problem with the sequence themselves or the problems with the majority of parsers??? Could it be that I'm using the API wrongly for the above formats, although GenBank parser works as advertised with some exceptions below: ISSUE #2: When I try to parse GenBank files using the following code: BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb")); Namespace gbNspace = (Namespace) RichObjectFactory.getObject(SimpleNamespace.class, new Object[]{"gbSpace"} ); RichSequenceIterator gbSeqs = RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace); while (gbSeqs.hasNext()) { try { RichSequence rs = gbSeqs.nextRichSequence(); // Further processing or RichSequence object from here } catch (BioException be){ be.printStackTrace(); } } Genbank file in question: LOCUS BC074905 838 bp mRNA linear PRI 15-APR-2006 DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038 IMAGE:30915482), complete cds. ACCESSION BC074905 VERSION BC074905.2 GI:50959825 KEYWORDS MGC. SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 838) AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G., Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D., Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K., Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F., Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L., Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L., Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S., Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J., Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J., McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S., Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W., Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A., Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S., Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y., Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D., Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M., Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E., Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A. CONSRTM Mammalian Gene Collection Program Team TITLE Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002) PUBMED 12477932 REFERENCE 2 (bases 1 to 838) CONSRTM NIH MGC Project TITLE Direct Submission JOURNAL Submitted (25-JUN-2004) National Institutes of Health, Mammalian Gene Collection (MGC), Bethesda, MD 20892-2590, USA REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832. Contact: MGC help desk Email: cgapbs-r at mail.nih.gov Tissue Procurement: Genome Sequence Centre, British Columbia Cancer Center cDNA Library Preparation: British Columbia Cancer Research Center cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL) DNA Sequencing by: Genome Sequence Centre, BC Cancer Agency, Vancouver, BC, Canada info at bcgsc.bc.ca Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao, Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang, Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra. Clone distribution: MGC clone distribution information can be found through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov Series: IRBU Plate: 4 Row: C Column: 3. Differences found between this sequence and the human reference genome (build 36) are described in misc_difference features below. FEATURES Location/Qualifiers source 1..838 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /clone="MGC:104038 IMAGE:30915482" /tissue_type="Lung, PCR rescued clones" /clone_lib="NIH_MGC_273" /lab_host="DH10B" /note="Vector: pCR4 Topo TA with reversed insert" gene 1..838 /gene="KLK14" /note="synonym: KLK-L6" /db_xref="GeneID:43847" /db_xref="HGNC:6362" /db_xref="IMGT/GENE-DB:6362" /db_xref="MIM:606135" CDS 49..804 /gene="KLK14" /codon_start=1 /product="KLK14 protein" /protein_id="AAH74905.1" /db_xref="GI:50959826" /db_xref="GeneID:43847" /db_xref="HGNC:6362" /db_xref="IMGT/GENE-DB:6362" /db_xref="MIM:606135" /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK" misc_difference 98 /gene="KLK14" /note="'G' in cDNA is 'A' in the human genome; amino acid difference: 'R' in cDNA, 'Q' in the human genome." misc_difference 133 /gene="KLK14" /note="'T' in cDNA is 'C' in the human genome; amino acid difference: 'Y' in cDNA, 'H' in the human genome." ORIGIN 1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag 121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact 241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc // I get the following exception: java.lang.IllegalArgumentException: Authors string cannot be null org.biojava.bio.BioException: Could not read sequence at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107) at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258) at exonhit.parsers.GenBankParser.main(GenBankParser.java:341) Caused by: java.lang.IllegalArgumentException: Authors string cannot be null at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76) at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) ----------------------------------------------------------------------- I'm trying to see what could be the problem with this particular sequence. Looks to me like the AUTHORS portion is not getting parsed correctly. Any ideas would be greatly appreciated! -- Best Regards, Seth Johnson Senior Bioinformatics Associate Ph: (202) 470-0900 Fx: (775) 251-0358 _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From andreas.draeger at uni-tuebingen.de Fri Jun 2 05:57:22 2006 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Fri, 02 Jun 2006 07:57:22 +0200 Subject: [Biojava-l] Error loading ontology terms In-Reply-To: <1149175573.3948.78.camel@texas.ebi.ac.uk> References: <1149175573.3948.78.camel@texas.ebi.ac.uk> Message-ID: <447FD342.4090806@uni-tuebingen.de> Hello, You can solve this problem just by renaming the column "synonym" in table "term_synonym" to "name". The reason for changing the name of this column is that in some database systems the term "synonym" is a reserved word. So the older version that you are using currently might cause problems with some databas systems. Once you renamed this column, BioJava will work fine. Andreas Dr?ger > java.sql.SQLException: Unknown column 'name' in 'field list' > > -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From richard.holland at ebi.ac.uk Fri Jun 2 09:01:39 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Fri, 02 Jun 2006 10:01:39 +0100 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: References: Message-ID: <1149238900.3948.87.camel@texas.ebi.ac.uk> Hi Seth. Your second point, about the authors string not being read correctly in Genbank format, has been fixed (or should have been if I got the code right!). Could you check the latest version of biojava-live out of CVS and give it another go? Basically the parser did not recognise the CONSRTM tag, as it is not mentioned in the sample record provided by NCBI, which is what I based the parser on. I've set it up now so that it reads the CONSRTM tag, but the value is merged with the authors tag with (consortium) appended. There will still be problems if the consortium value has commas in it - not sure how to fix this yet. Your first point is harder to solve because you did not provide a complete stack trace for the exceptions you are getting. The complete stack trace would enable me to identify exactly where things are going wrong and give me a better chance of fixing them. Could you send the stack trace, and I'll see what I can do. cheers, Richard On Thu, 2006-06-01 at 18:03 -0400, Seth Johnson wrote: > Hi All, > > I'm a newbie to the whole BioJava(X) API and was hoping to get some > clarification on several issues that I'm having. > I am developing a parser that would take as input "NCBI Incremental > ASN.1 Sequence Updates to Genbank" files ( > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the > ASN2GB converter ( > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert > resulting sequences to a format parsable by BioJava(X) ( > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where > my problems start. > > ISSUE 1: > I've tried to parse all of the formats that ASN2GB outputs ( GenBank > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML), > tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank > format is recognized by the > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with > some exceptions that I'll describe in issue #2. This is the code that > I'm using to parse, for example, the EMBL output: > > BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb")); > Namespace gbNspace = (Namespace) > RichObjectFactory.getObject(SimpleNamespace.class, new > Object[]{"gbSpace"} ); > RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace); > while (gbSeqs.hasNext()) { > try { > RichSequence rs = gbSeqs.nextRichSequence(); > // Further processing or RichSequence object from here > > } catch (BioException be){ > be.printStackTrace(); > } > } > > The multi-sequence EMBL file looks like this: > --------------------------------------------------------------------------------- > ID DQ472184 standard; DNA; INV; 546 BP. > XX > AC DQ472184; > XX > SV DQ472184.1 > DT 15-MAY-2006 > XX > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > DE complete cds. > XX > KW . > XX > OS Trypanosoma cruzi strain CL Brener > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > OC Schizotrypanum. > XX > RN [1] > RP 1-546 > RA De Melo L.D.B.; > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > RL Unpublished. > XX > RN [2] > RP 1-546 > RA De Melo L.D.B.; > RT ; > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > RL 21949-900, Brazil > XX > FH Key Location/Qualifiers > FH > FT source 1..546 > FT /organism="Trypanosoma cruzi strain CL Brener" > FT /mol_type="genomic DNA" > FT /strain="CL Brener" > FT /db_xref="taxon:353153" > FT gene <1..>546 > FT /gene="ARC21" > FT /note="TcARC21" > FT mRNA <1..>546 > FT /gene="ARC21" > FT /product="actin-related protein 3" > FT CDS 1..546 > FT /gene="ARC21" > FT /note="actin-binding protein; ARPC3 21 kDa; putative > FT member of Arp2/3 complex" > FT /codon_start=1 > FT /product="actin-related protein 3" > FT /protein_id="ABF13401.1" > FT /db_xref="GI:93360014" > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > FT FPEKDGTGNKFWMAFAKRPFLASS" > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > agttag 546 > // > ID DQ472185 standard; DNA; INV; 543 BP. > XX > AC DQ472185; > XX > SV DQ472185.1 > DT 15-MAY-2006 > XX > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > DE complete cds. > XX > KW . > XX > OS Trypanosoma cruzi strain CL Brener > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > OC Schizotrypanum. > XX > RN [1] > RP 1-543 > RA De Melo L.D.B.; > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > RL Unpublished. > XX > RN [2] > RP 1-543 > RA De Melo L.D.B.; > RT ; > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > RL 21949-900, Brazil > XX > FH Key Location/Qualifiers > FH > FT source 1..543 > FT /organism="Trypanosoma cruzi strain CL Brener" > FT /mol_type="genomic DNA" > FT /strain="CL Brener" > FT /db_xref="taxon:353153" > FT gene <1..>543 > FT /gene="ARC20" > FT /note="TcARC20" > FT mRNA <1..>543 > FT /gene="ARC20" > FT /product="actin-related protein 4" > FT CDS 1..543 > FT /gene="ARC20" > FT /note="actin-binding protein; ARPC4 20 kDa; putative > FT member of Arp2/3 complex" > FT /codon_start=1 > FT /product="actin-related protein 4" > FT /protein_id="ABF13402.1" > FT /db_xref="GI:93360016" > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > FT MKLNVNQRARRAAMEFFLALNFT" > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > tga 543 > // > ----------------------------------------------------------------------- > I get an exception message "Could Not Read Sequence". Same thing > happens if I use the readINSDSetDNA reader instead of readEMBLDNA one > with the following INSDset file (beginning of the file): > > > > > DQ022078 > 16729 > DNA > linear > ENV > 15-MAY-2006 > 15-MAY-2006 > Uncultured bacterium WWRS-2005 putative > aminoglycoside phosphotransferase (a3.001), putative oxidoreductase > (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase > class C (estA3), putative permease (a3.005), putative transmembrane > signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative > asparaginase (a3.010), hypothetical protein (a3.011), hypothetical > protein (a3.012), putative membrane protease subunit (a3.013), > putative haloalkane dehalogenase (a3.014), putative transcriptional > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and > hypothetical protein (a3.017) genes, complete cds > DQ022078 > > gb|DQ022078.1| > gi|71842722 > > > ENV > > > > ? > 1..16729 > > Schmeisser,C. > Elend,C. > Streit,W.R. > > Isolation and biochemical characterization > of two novel metagenome derived esterases > Appl. Environ. Microbiol. 0:0-0 > (2006) > > > ? > 1..16729 > > Schmeisser,C. > Elend,C. > Streit,W.R. > > Submitted (29-APR-2005) to the > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057, > Germany > > > > So my question is wether the ASN2GB produces output that's > incompatible with BioJava parsers or is there a problem with the > sequence themselves or the problems with the majority of parsers??? > Could it be that I'm using the API wrongly for the above formats, > although GenBank parser works as advertised with some exceptions > below: > > ISSUE #2: > When I try to parse GenBank files using the following code: > > BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb")); > Namespace gbNspace = (Namespace) > RichObjectFactory.getObject(SimpleNamespace.class, new > Object[]{"gbSpace"} ); > RichSequenceIterator gbSeqs = > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace); > while (gbSeqs.hasNext()) { > try { > RichSequence rs = gbSeqs.nextRichSequence(); > // Further processing or RichSequence object from here > > } catch (BioException be){ > be.printStackTrace(); > } > } > > Genbank file in question: > > LOCUS BC074905 838 bp mRNA linear PRI 15-APR-2006 > DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038 > IMAGE:30915482), complete cds. > ACCESSION BC074905 > VERSION BC074905.2 GI:50959825 > KEYWORDS MGC. > SOURCE Homo sapiens (human) > ORGANISM Homo sapiens > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; > Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; > Catarrhini; Hominidae; Homo. > REFERENCE 1 (bases 1 to 838) > AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G., > Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D., > Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K., > Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F., > Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L., > Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L., > Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S., > Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J., > Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J., > McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S., > Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W., > Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A., > Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S., > Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y., > Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D., > Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M., > Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E., > Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A. > CONSRTM Mammalian Gene Collection Program Team > TITLE Generation and initial analysis of more than 15,000 full-length > human and mouse cDNA sequences > JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002) > PUBMED 12477932 > REFERENCE 2 (bases 1 to 838) > CONSRTM NIH MGC Project > TITLE Direct Submission > JOURNAL Submitted (25-JUN-2004) National Institutes of Health, Mammalian > Gene Collection (MGC), Bethesda, MD 20892-2590, USA > REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov > COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832. > Contact: MGC help desk > Email: cgapbs-r at mail.nih.gov > Tissue Procurement: Genome Sequence Centre, British Columbia Cancer > Center > cDNA Library Preparation: British Columbia Cancer Research Center > cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL) > DNA Sequencing by: Genome Sequence Centre, > BC Cancer Agency, Vancouver, BC, Canada > info at bcgsc.bc.ca > Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson > Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen > Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel > Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave > Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth > Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao, > Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR > Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang, > Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra. > > Clone distribution: MGC clone distribution information can be found > through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov > Series: IRBU Plate: 4 Row: C Column: 3. > > Differences found between this sequence and the human reference > genome (build 36) are described in misc_difference features below. > FEATURES Location/Qualifiers > source 1..838 > /organism="Homo sapiens" > /mol_type="mRNA" > /db_xref="taxon:9606" > /clone="MGC:104038 IMAGE:30915482" > /tissue_type="Lung, PCR rescued clones" > /clone_lib="NIH_MGC_273" > /lab_host="DH10B" > /note="Vector: pCR4 Topo TA with reversed insert" > gene 1..838 > /gene="KLK14" > /note="synonym: KLK-L6" > /db_xref="GeneID:43847" > /db_xref="HGNC:6362" > /db_xref="IMGT/GENE-DB:6362" > /db_xref="MIM:606135" > CDS 49..804 > /gene="KLK14" > /codon_start=1 > /product="KLK14 protein" > /protein_id="AAH74905.1" > /db_xref="GI:50959826" > /db_xref="GeneID:43847" > /db_xref="HGNC:6362" > /db_xref="IMGT/GENE-DB:6362" > /db_xref="MIM:606135" > /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA > GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN > YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA > SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV > SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK" > misc_difference 98 > /gene="KLK14" > /note="'G' in cDNA is 'A' in the human genome; amino acid > difference: 'R' in cDNA, 'Q' in the human genome." > misc_difference 133 > /gene="KLK14" > /note="'T' in cDNA is 'C' in the human genome; amino acid > difference: 'Y' in cDNA, 'H' in the human genome." > ORIGIN > 1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg > 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag > 121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg > 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact > 241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg > 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac > 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg > 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga > 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc > 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg > 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct > 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc > 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt > 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc > // > > I get the following exception: > > java.lang.IllegalArgumentException: Authors string cannot be null > org.biojava.bio.BioException: Could not read sequence > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107) > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258) > at exonhit.parsers.GenBankParser.main(GenBankParser.java:341) > Caused by: java.lang.IllegalArgumentException: Authors string cannot be null > at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76) > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356) > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > ----------------------------------------------------------------------- > > I'm trying to see what could be the problem with this particular > sequence. Looks to me like the AUTHORS portion is not getting parsed > correctly. Any ideas would be greatly appreciated! > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From johnson.biotech at gmail.com Fri Jun 2 17:04:59 2006 From: johnson.biotech at gmail.com (Seth Johnson) Date: Fri, 2 Jun 2006 13:04:59 -0400 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: <1149238900.3948.87.camel@texas.ebi.ac.uk> References: <1149238900.3948.87.camel@texas.ebi.ac.uk> Message-ID: Hi Richard, I made sure I have the latest source code from CVS compiled (EMBLFormat.java & GenbankFormat.java are from 05/24/06). I'm happy to report that GenBank issue is solved!!!! As far as EMBL parsing, I apologize for not providing the stack dump for ISSUE #1. Here's the dump of the exception: -------------------------------------------------------- org.biojava.bio.BioException: Could not read sequence at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) at exonhit.parsers.GenBankParser.main(GenBankParser.java:359) Caused by: java.lang.NumberFormatException: null at java.lang.Integer.parseInt(Integer.java:415) at java.lang.Integer.parseInt(Integer.java:497) at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:299) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) ... 1 more Java Result: -1 ------------------------------------------------------- Here, again, is the code that I'm using to to parse: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ BufferedReader gbBR = null; try { gbBR = new BufferedReader(new FileReader("C:\\Download\\ASN2BSML\\seth_06_02.emb")); } catch (FileNotFoundException fnfe) { fnfe.printStackTrace(); System.exit(-1); } Namespace gbNspace = (Namespace) RichObjectFactory.getObject(SimpleNamespace.class, new Object[]{"gbSpace"} ); RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(gbBR,gbNspace); while (gbSeqs.hasNext()) { try { RichSequence rs = gbSeqs.nextRichSequence(); NCBITaxon myTaxon = rs.getTaxon(); }catch (BioException be){ be.printStackTrace(); System.exit(-1); } } ~~~~~~~~~~~~~~~~~~~~~~~~~ And here's the EMBL file that I'm trying to parse: +++++++++++++++++++++++++ ID DQ472184 standard; DNA; INV; 546 BP. XX AC DQ472184; XX SV DQ472184.1 DT 15-MAY-2006 XX DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, DE complete cds. XX KW . XX OS Trypanosoma cruzi strain CL Brener OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; OC Schizotrypanum. XX RN [1] RP 1-546 RA De Melo L.D.B.; RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; RL Unpublished. XX RN [2] RP 1-546 RA De Melo L.D.B.; RT ; RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ RL 21949-900, Brazil XX FH Key Location/Qualifiers FH FT source 1..546 FT /organism="Trypanosoma cruzi strain CL Brener" FT /mol_type="genomic DNA" FT /strain="CL Brener" FT /db_xref="taxon:353153" FT gene <1..>546 FT /gene="ARC21" FT /note="TcARC21" FT mRNA <1..>546 FT /gene="ARC21" FT /product="actin-related protein 3" FT CDS 1..546 FT /gene="ARC21" FT /note="actin-binding protein; ARPC3 21 kDa; putative FT member of Arp2/3 complex" FT /codon_start=1 FT /product="actin-related protein 3" FT /protein_id="ABF13401.1" FT /db_xref="GI:93360014" FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL FT FPEKDGTGNKFWMAFAKRPFLASS" atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 agttag 546 // ID DQ472185 standard; DNA; INV; 543 BP. XX AC DQ472185; XX SV DQ472185.1 DT 15-MAY-2006 XX DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, DE complete cds. XX KW . XX OS Trypanosoma cruzi strain CL Brener OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; OC Schizotrypanum. XX RN [1] RP 1-543 RA De Melo L.D.B.; RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; RL Unpublished. XX RN [2] RP 1-543 RA De Melo L.D.B.; RT ; RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ RL 21949-900, Brazil XX FH Key Location/Qualifiers FH FT source 1..543 FT /organism="Trypanosoma cruzi strain CL Brener" FT /mol_type="genomic DNA" FT /strain="CL Brener" FT /db_xref="taxon:353153" FT gene <1..>543 FT /gene="ARC20" FT /note="TcARC20" FT mRNA <1..>543 FT /gene="ARC20" FT /product="actin-related protein 4" FT CDS 1..543 FT /gene="ARC20" FT /note="actin-binding protein; ARPC4 20 kDa; putative FT member of Arp2/3 complex" FT /codon_start=1 FT /product="actin-related protein 4" FT /protein_id="ABF13402.1" FT /db_xref="GI:93360016" FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA FT MKLNVNQRARRAAMEFFLALNFT" atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 tga 543 // +++++++++++++++++++++++++++++++++ It looks to me like there's some kind of problem with parsing the sequence version number. I even tried the sequence from test directory (AY069118.em) with same outcome. Regards, Seth On 6/2/06, Richard Holland wrote: > Hi Seth. > > Your second point, about the authors string not being read correctly in > Genbank format, has been fixed (or should have been if I got the code > right!). Could you check the latest version of biojava-live out of CVS > and give it another go? Basically the parser did not recognise the > CONSRTM tag, as it is not mentioned in the sample record provided by > NCBI, which is what I based the parser on. > > I've set it up now so that it reads the CONSRTM tag, but the value is > merged with the authors tag with (consortium) appended. There will still > be problems if the consortium value has commas in it - not sure how to > fix this yet. > > Your first point is harder to solve because you did not provide a > complete stack trace for the exceptions you are getting. The complete > stack trace would enable me to identify exactly where things are going > wrong and give me a better chance of fixing them. Could you send the > stack trace, and I'll see what I can do. > > cheers, > Richard > > > On Thu, 2006-06-01 at 18:03 -0400, Seth Johnson wrote: > > Hi All, > > > > I'm a newbie to the whole BioJava(X) API and was hoping to get some > > clarification on several issues that I'm having. > > I am developing a parser that would take as input "NCBI Incremental > > ASN.1 Sequence Updates to Genbank" files ( > > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the > > ASN2GB converter ( > > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert > > resulting sequences to a format parsable by BioJava(X) ( > > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where > > my problems start. > > > > ISSUE 1: > > I've tried to parse all of the formats that ASN2GB outputs ( GenBank > > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML), > > tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank > > format is recognized by the > > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with > > some exceptions that I'll describe in issue #2. This is the code that > > I'm using to parse, for example, the EMBL output: > > > > BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb")); > > Namespace gbNspace = (Namespace) > > RichObjectFactory.getObject(SimpleNamespace.class, new > > Object[]{"gbSpace"} ); > > RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace); > > while (gbSeqs.hasNext()) { > > try { > > RichSequence rs = gbSeqs.nextRichSequence(); > > // Further processing or RichSequence object from here > > > > } catch (BioException be){ > > be.printStackTrace(); > > } > > } > > > > The multi-sequence EMBL file looks like this: > > --------------------------------------------------------------------------------- > > ID DQ472184 standard; DNA; INV; 546 BP. > > XX > > AC DQ472184; > > XX > > SV DQ472184.1 > > DT 15-MAY-2006 > > XX > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > > DE complete cds. > > XX > > KW . > > XX > > OS Trypanosoma cruzi strain CL Brener > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > OC Schizotrypanum. > > XX > > RN [1] > > RP 1-546 > > RA De Melo L.D.B.; > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > RL Unpublished. > > XX > > RN [2] > > RP 1-546 > > RA De Melo L.D.B.; > > RT ; > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > RL 21949-900, Brazil > > XX > > FH Key Location/Qualifiers > > FH > > FT source 1..546 > > FT /organism="Trypanosoma cruzi strain CL Brener" > > FT /mol_type="genomic DNA" > > FT /strain="CL Brener" > > FT /db_xref="taxon:353153" > > FT gene <1..>546 > > FT /gene="ARC21" > > FT /note="TcARC21" > > FT mRNA <1..>546 > > FT /gene="ARC21" > > FT /product="actin-related protein 3" > > FT CDS 1..546 > > FT /gene="ARC21" > > FT /note="actin-binding protein; ARPC3 21 kDa; putative > > FT member of Arp2/3 complex" > > FT /codon_start=1 > > FT /product="actin-related protein 3" > > FT /protein_id="ABF13401.1" > > FT /db_xref="GI:93360014" > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > > FT FPEKDGTGNKFWMAFAKRPFLASS" > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > > agttag 546 > > // > > ID DQ472185 standard; DNA; INV; 543 BP. > > XX > > AC DQ472185; > > XX > > SV DQ472185.1 > > DT 15-MAY-2006 > > XX > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > > DE complete cds. > > XX > > KW . > > XX > > OS Trypanosoma cruzi strain CL Brener > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > OC Schizotrypanum. > > XX > > RN [1] > > RP 1-543 > > RA De Melo L.D.B.; > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > RL Unpublished. > > XX > > RN [2] > > RP 1-543 > > RA De Melo L.D.B.; > > RT ; > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > RL 21949-900, Brazil > > XX > > FH Key Location/Qualifiers > > FH > > FT source 1..543 > > FT /organism="Trypanosoma cruzi strain CL Brener" > > FT /mol_type="genomic DNA" > > FT /strain="CL Brener" > > FT /db_xref="taxon:353153" > > FT gene <1..>543 > > FT /gene="ARC20" > > FT /note="TcARC20" > > FT mRNA <1..>543 > > FT /gene="ARC20" > > FT /product="actin-related protein 4" > > FT CDS 1..543 > > FT /gene="ARC20" > > FT /note="actin-binding protein; ARPC4 20 kDa; putative > > FT member of Arp2/3 complex" > > FT /codon_start=1 > > FT /product="actin-related protein 4" > > FT /protein_id="ABF13402.1" > > FT /db_xref="GI:93360016" > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > > FT MKLNVNQRARRAAMEFFLALNFT" > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > > tga 543 > > // > > ----------------------------------------------------------------------- > > I get an exception message "Could Not Read Sequence". Same thing > > happens if I use the readINSDSetDNA reader instead of readEMBLDNA one > > with the following INSDset file (beginning of the file): > > > > > > > > > > DQ022078 > > 16729 > > DNA > > linear > > ENV > > 15-MAY-2006 > > 15-MAY-2006 > > Uncultured bacterium WWRS-2005 putative > > aminoglycoside phosphotransferase (a3.001), putative oxidoreductase > > (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase > > class C (estA3), putative permease (a3.005), putative transmembrane > > signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone > > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative > > asparaginase (a3.010), hypothetical protein (a3.011), hypothetical > > protein (a3.012), putative membrane protease subunit (a3.013), > > putative haloalkane dehalogenase (a3.014), putative transcriptional > > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and > > hypothetical protein (a3.017) genes, complete cds > > DQ022078 > > > > gb|DQ022078.1| > > gi|71842722 > > > > > > ENV > > > > > > > > ? > > 1..16729 > > > > Schmeisser,C. > > Elend,C. > > Streit,W.R. > > > > Isolation and biochemical characterization > > of two novel metagenome derived esterases > > Appl. Environ. Microbiol. 0:0-0 > > (2006) > > > > > > ? > > 1..16729 > > > > Schmeisser,C. > > Elend,C. > > Streit,W.R. > > > > Submitted (29-APR-2005) to the > > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University > > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057, > > Germany > > > > > > > > So my question is wether the ASN2GB produces output that's > > incompatible with BioJava parsers or is there a problem with the > > sequence themselves or the problems with the majority of parsers??? > > Could it be that I'm using the API wrongly for the above formats, > > although GenBank parser works as advertised with some exceptions > > below: > > > > ISSUE #2: > > When I try to parse GenBank files using the following code: > > > > BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb")); > > Namespace gbNspace = (Namespace) > > RichObjectFactory.getObject(SimpleNamespace.class, new > > Object[]{"gbSpace"} ); > > RichSequenceIterator gbSeqs = > > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace); > > while (gbSeqs.hasNext()) { > > try { > > RichSequence rs = gbSeqs.nextRichSequence(); > > // Further processing or RichSequence object from here > > > > } catch (BioException be){ > > be.printStackTrace(); > > } > > } > > > > Genbank file in question: > > > > LOCUS BC074905 838 bp mRNA linear PRI 15-APR-2006 > > DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038 > > IMAGE:30915482), complete cds. > > ACCESSION BC074905 > > VERSION BC074905.2 GI:50959825 > > KEYWORDS MGC. > > SOURCE Homo sapiens (human) > > ORGANISM Homo sapiens > > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; > > Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; > > Catarrhini; Hominidae; Homo. > > REFERENCE 1 (bases 1 to 838) > > AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G., > > Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D., > > Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K., > > Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F., > > Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L., > > Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L., > > Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S., > > Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J., > > Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J., > > McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S., > > Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W., > > Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A., > > Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S., > > Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y., > > Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D., > > Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M., > > Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E., > > Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A. > > CONSRTM Mammalian Gene Collection Program Team > > TITLE Generation and initial analysis of more than 15,000 full-length > > human and mouse cDNA sequences > > JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002) > > PUBMED 12477932 > > REFERENCE 2 (bases 1 to 838) > > CONSRTM NIH MGC Project > > TITLE Direct Submission > > JOURNAL Submitted (25-JUN-2004) National Institutes of Health, Mammalian > > Gene Collection (MGC), Bethesda, MD 20892-2590, USA > > REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov > > COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832. > > Contact: MGC help desk > > Email: cgapbs-r at mail.nih.gov > > Tissue Procurement: Genome Sequence Centre, British Columbia Cancer > > Center > > cDNA Library Preparation: British Columbia Cancer Research Center > > cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL) > > DNA Sequencing by: Genome Sequence Centre, > > BC Cancer Agency, Vancouver, BC, Canada > > info at bcgsc.bc.ca > > Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson > > Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen > > Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel > > Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave > > Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth > > Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao, > > Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR > > Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang, > > Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra. > > > > Clone distribution: MGC clone distribution information can be found > > through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov > > Series: IRBU Plate: 4 Row: C Column: 3. > > > > Differences found between this sequence and the human reference > > genome (build 36) are described in misc_difference features below. > > FEATURES Location/Qualifiers > > source 1..838 > > /organism="Homo sapiens" > > /mol_type="mRNA" > > /db_xref="taxon:9606" > > /clone="MGC:104038 IMAGE:30915482" > > /tissue_type="Lung, PCR rescued clones" > > /clone_lib="NIH_MGC_273" > > /lab_host="DH10B" > > /note="Vector: pCR4 Topo TA with reversed insert" > > gene 1..838 > > /gene="KLK14" > > /note="synonym: KLK-L6" > > /db_xref="GeneID:43847" > > /db_xref="HGNC:6362" > > /db_xref="IMGT/GENE-DB:6362" > > /db_xref="MIM:606135" > > CDS 49..804 > > /gene="KLK14" > > /codon_start=1 > > /product="KLK14 protein" > > /protein_id="AAH74905.1" > > /db_xref="GI:50959826" > > /db_xref="GeneID:43847" > > /db_xref="HGNC:6362" > > /db_xref="IMGT/GENE-DB:6362" > > /db_xref="MIM:606135" > > /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA > > GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN > > YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA > > SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV > > SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK" > > misc_difference 98 > > /gene="KLK14" > > /note="'G' in cDNA is 'A' in the human genome; amino acid > > difference: 'R' in cDNA, 'Q' in the human genome." > > misc_difference 133 > > /gene="KLK14" > > /note="'T' in cDNA is 'C' in the human genome; amino acid > > difference: 'Y' in cDNA, 'H' in the human genome." > > ORIGIN > > 1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg > > 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag > > 121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg > > 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact > > 241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg > > 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac > > 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg > > 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga > > 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc > > 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg > > 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct > > 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc > > 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt > > 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc > > // > > > > I get the following exception: > > > > java.lang.IllegalArgumentException: Authors string cannot be null > > org.biojava.bio.BioException: Could not read sequence > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107) > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258) > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:341) > > Caused by: java.lang.IllegalArgumentException: Authors string cannot be null > > at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76) > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356) > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > ----------------------------------------------------------------------- > > > > I'm trying to see what could be the problem with this particular > > sequence. Looks to me like the AUTHORS portion is not getting parsed > > correctly. Any ideas would be greatly appreciated! > > > -- > Richard Holland (BioMart Team) > EMBL-EBI > Wellcome Trust Genome Campus > Hinxton > Cambridge CB10 1SD > UNITED KINGDOM > Tel: +44-(0)1223-494416 > > -- Best Regards, Seth Johnson Senior Bioinformatics Associate Ph: (202) 470-0900 Fx: (775) 251-0358 From johnson.biotech at gmail.com Fri Jun 2 18:46:26 2006 From: johnson.biotech at gmail.com (Seth Johnson) Date: Fri, 2 Jun 2006 14:46:26 -0400 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: References: Message-ID: Hi Mark, Thank you for your suggestions. I've followed your suggestions and it seems to have found a bug that caused an exception in readINSDseqDNA parser. http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=94481355 The problem int the above sequence in INSDseq format was caused by the presence of tags without the corresponding tags: environmental_sample I have not checked wether it's handled correctly by other parsers when it is converted from original NCBI ASN.1 format. Could the code be adjusted so if there's no tags it would assume the value to be 'null' ??? Regards, Seth On 6/1/06, mark.schreiber at novartis.com wrote: > Hi Seth - > > The BioJavaX parsers are still quite new and have not been heavily tested > so your experiences can help us quite a lot. The parsers where initially > designed to be quite strict and follow the GenBank etc specifications. > However, there are often minor variations to those specs which cause > things to break. > > To help us find the bugs can you make sure you are using the very latest > version of biojava from CVS, for example I was under the impression that > the author = null problem had been solved. In each case an example file > and the full stack trace is very useful as well. In some cases you have > provided these so we have a starting point. > > Also, if you have ideas on ways to fix the problems your suggestions would > be greatly appreciated. We only have a very small team of active > developers many of whom are unfortunately very busy just now. > > Hopefully we can get to this soon. > > - Mark > > > > > > "Seth Johnson" > Sent by: biojava-l-bounces at lists.open-bio.org > 06/02/2006 06:03 AM > > > To: biojava-l at lists.open-bio.org > cc: (bcc: Mark Schreiber/GP/Novartis) > Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 > daily update files > > > Hi All, > > I'm a newbie to the whole BioJava(X) API and was hoping to get some > clarification on several issues that I'm having. > I am developing a parser that would take as input "NCBI Incremental > ASN.1 Sequence Updates to Genbank" files ( > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the > ASN2GB converter ( > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert > resulting sequences to a format parsable by BioJava(X) ( > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where > my problems start. > > ISSUE 1: > I've tried to parse all of the formats that ASN2GB outputs ( GenBank > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML), > tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank > format is recognized by the > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with > some exceptions that I'll describe in issue #2. This is the code that > I'm using to parse, for example, the EMBL output: > > BufferedReader inBuf = new BufferedReader(new > FileReader("embl_output.emb")); > Namespace gbNspace = (Namespace) > RichObjectFactory.getObject(SimpleNamespace.class, new > Object[]{"gbSpace"} ); > RichSequenceIterator gbSeqs = > RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace); > while (gbSeqs.hasNext()) { > try { > RichSequence rs = gbSeqs.nextRichSequence(); > // Further processing or RichSequence object from here > > } catch (BioException be){ > be.printStackTrace(); > } > } > > The multi-sequence EMBL file looks like this: > --------------------------------------------------------------------------------- > ID DQ472184 standard; DNA; INV; 546 BP. > XX > AC DQ472184; > XX > SV DQ472184.1 > DT 15-MAY-2006 > XX > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) > gene, > DE complete cds. > XX > KW . > XX > OS Trypanosoma cruzi strain CL Brener > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > OC Schizotrypanum. > XX > RN [1] > RP 1-546 > RA De Melo L.D.B.; > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > RL Unpublished. > XX > RN [2] > RP 1-546 > RA De Melo L.D.B.; > RT ; > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do > Rio > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, > RJ > RL 21949-900, Brazil > XX > FH Key Location/Qualifiers > FH > FT source 1..546 > FT /organism="Trypanosoma cruzi strain CL Brener" > FT /mol_type="genomic DNA" > FT /strain="CL Brener" > FT /db_xref="taxon:353153" > FT gene <1..>546 > FT /gene="ARC21" > FT /note="TcARC21" > FT mRNA <1..>546 > FT /gene="ARC21" > FT /product="actin-related protein 3" > FT CDS 1..546 > FT /gene="ARC21" > FT /note="actin-binding protein; ARPC3 21 kDa; putative > FT member of Arp2/3 complex" > FT /codon_start=1 > FT /product="actin-related protein 3" > FT /protein_id="ABF13401.1" > FT /db_xref="GI:93360014" > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > FT FPEKDGTGNKFWMAFAKRPFLASS" > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt > 120 > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc > 180 > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg > 240 > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat > 300 > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg > 360 > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca > 420 > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag > 480 > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct > 540 > agttag 546 > // > ID DQ472185 standard; DNA; INV; 543 BP. > XX > AC DQ472185; > XX > SV DQ472185.1 > DT 15-MAY-2006 > XX > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) > gene, > DE complete cds. > XX > KW . > XX > OS Trypanosoma cruzi strain CL Brener > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > OC Schizotrypanum. > XX > RN [1] > RP 1-543 > RA De Melo L.D.B.; > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > RL Unpublished. > XX > RN [2] > RP 1-543 > RA De Melo L.D.B.; > RT ; > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do > Rio > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, > RJ > RL 21949-900, Brazil > XX > FH Key Location/Qualifiers > FH > FT source 1..543 > FT /organism="Trypanosoma cruzi strain CL Brener" > FT /mol_type="genomic DNA" > FT /strain="CL Brener" > FT /db_xref="taxon:353153" > FT gene <1..>543 > FT /gene="ARC20" > FT /note="TcARC20" > FT mRNA <1..>543 > FT /gene="ARC20" > FT /product="actin-related protein 4" > FT CDS 1..543 > FT /gene="ARC20" > FT /note="actin-binding protein; ARPC4 20 kDa; putative > FT member of Arp2/3 complex" > FT /codon_start=1 > FT /product="actin-related protein 4" > FT /protein_id="ABF13402.1" > FT /db_xref="GI:93360016" > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > FT MKLNVNQRARRAAMEFFLALNFT" > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt > 120 > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata > 180 > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc > 240 > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt > 300 > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga > 360 > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt > 420 > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg > 480 > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca > 540 > tga 543 > // > ----------------------------------------------------------------------- > I get an exception message "Could Not Read Sequence". Same thing > happens if I use the readINSDSetDNA reader instead of readEMBLDNA one > with the following INSDset file (beginning of the file): > > > > > DQ022078 > 16729 > DNA > linear > ENV > 15-MAY-2006 > 15-MAY-2006 > Uncultured bacterium WWRS-2005 putative > aminoglycoside phosphotransferase (a3.001), putative oxidoreductase > (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase > class C (estA3), putative permease (a3.005), putative transmembrane > signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative > asparaginase (a3.010), hypothetical protein (a3.011), hypothetical > protein (a3.012), putative membrane protease subunit (a3.013), > putative haloalkane dehalogenase (a3.014), putative transcriptional > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and > hypothetical protein (a3.017) genes, complete cds > DQ022078 > > gb|DQ022078.1| > gi|71842722 > > > ENV > > > > ? > 1..16729 > > Schmeisser,C. > Elend,C. > Streit,W.R. > > Isolation and biochemical characterization > of two novel metagenome derived esterases > Appl. Environ. Microbiol. 0:0-0 > (2006) > > > ? > 1..16729 > > Schmeisser,C. > Elend,C. > Streit,W.R. > > Submitted (29-APR-2005) to the > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057, > Germany > > > > So my question is wether the ASN2GB produces output that's > incompatible with BioJava parsers or is there a problem with the > sequence themselves or the problems with the majority of parsers??? > Could it be that I'm using the API wrongly for the above formats, > although GenBank parser works as advertised with some exceptions > below: > > ISSUE #2: > When I try to parse GenBank files using the following code: > > BufferedReader inBuf = new BufferedReader(new > FileReader("genbank_output.gb")); > Namespace gbNspace = (Namespace) > RichObjectFactory.getObject(SimpleNamespace.class, new > Object[]{"gbSpace"} ); > RichSequenceIterator gbSeqs = > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace); > while (gbSeqs.hasNext()) { > try { > RichSequence rs = gbSeqs.nextRichSequence(); > // Further processing or RichSequence object from here > > } catch (BioException be){ > be.printStackTrace(); > } > } > > Genbank file in question: > > LOCUS BC074905 838 bp mRNA linear PRI > 15-APR-2006 > DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038 > IMAGE:30915482), complete cds. > ACCESSION BC074905 > VERSION BC074905.2 GI:50959825 > KEYWORDS MGC. > SOURCE Homo sapiens (human) > ORGANISM Homo sapiens > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; > Euteleostomi; > Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; > Catarrhini; Hominidae; Homo. > REFERENCE 1 (bases 1 to 838) > AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G., > Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., > Schuler,G.D., > Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., > Bhat,N.K., > Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., > Hsieh,F., > Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L., > Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L., > Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S., > Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., > Peters,G.J., > Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J., > McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S., > Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W., > Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A., > Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S., > Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y., > Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D., > Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., > Myers,R.M., > Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E., > Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A. > CONSRTM Mammalian Gene Collection Program Team > TITLE Generation and initial analysis of more than 15,000 > full-length > human and mouse cDNA sequences > JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002) > PUBMED 12477932 > REFERENCE 2 (bases 1 to 838) > CONSRTM NIH MGC Project > TITLE Direct Submission > JOURNAL Submitted (25-JUN-2004) National Institutes of Health, > Mammalian > Gene Collection (MGC), Bethesda, MD 20892-2590, USA > REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov > COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832. > Contact: MGC help desk > Email: cgapbs-r at mail.nih.gov > Tissue Procurement: Genome Sequence Centre, British Columbia > Cancer > Center > cDNA Library Preparation: British Columbia Cancer Research > Center > cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL) > DNA Sequencing by: Genome Sequence Centre, > BC Cancer Agency, Vancouver, BC, Canada > info at bcgsc.bc.ca > Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson > Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen > Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel > Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave > Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, > Ruth > Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy > Liao, > Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR > Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang, > Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco > Marra. > > Clone distribution: MGC clone distribution information can be > found > through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov > Series: IRBU Plate: 4 Row: C Column: 3. > > Differences found between this sequence and the human > reference > genome (build 36) are described in misc_difference features > below. > FEATURES Location/Qualifiers > source 1..838 > /organism="Homo sapiens" > /mol_type="mRNA" > /db_xref="taxon:9606" > /clone="MGC:104038 IMAGE:30915482" > /tissue_type="Lung, PCR rescued clones" > /clone_lib="NIH_MGC_273" > /lab_host="DH10B" > /note="Vector: pCR4 Topo TA with reversed insert" > gene 1..838 > /gene="KLK14" > /note="synonym: KLK-L6" > /db_xref="GeneID:43847" > /db_xref="HGNC:6362" > /db_xref="IMGT/GENE-DB:6362" > /db_xref="MIM:606135" > CDS 49..804 > /gene="KLK14" > /codon_start=1 > /product="KLK14 protein" > /protein_id="AAH74905.1" > /db_xref="GI:50959826" > /db_xref="GeneID:43847" > /db_xref="HGNC:6362" > /db_xref="IMGT/GENE-DB:6362" > /db_xref="MIM:606135" > /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA > GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN > YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA > SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV > SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK" > misc_difference 98 > /gene="KLK14" > /note="'G' in cDNA is 'A' in the human genome; amino > acid > difference: 'R' in cDNA, 'Q' in the human genome." > misc_difference 133 > /gene="KLK14" > /note="'T' in cDNA is 'C' in the human genome; amino > acid > difference: 'Y' in cDNA, 'H' in the human genome." > ORIGIN > 1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat > gttcctcctg > 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga > tgagaacaag > 121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc > cctgctggcg > 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg > ggtcatcact > 241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa > cctgaggagg > 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc > caactacaac > 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc > acggatcggg > 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac > ctcctgccga > 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc > tctgcaatgc > 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag > aaccatcacg > 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca > gggtgactct > 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg > aatggagcgc > 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag > aagctggatt > 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc > // > > I get the following exception: > > java.lang.IllegalArgumentException: Authors string cannot be null > org.biojava.bio.BioException: Could not read sequence > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > at > exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107) > at > exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258) > at exonhit.parsers.GenBankParser.main(GenBankParser.java:341) > Caused by: java.lang.IllegalArgumentException: Authors string cannot be > null > at > org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76) > at > org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356) > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > ----------------------------------------------------------------------- > > I'm trying to see what could be the problem with this particular > sequence. Looks to me like the AUTHORS portion is not getting parsed > correctly. Any ideas would be greatly appreciated! > > -- > Best Regards, > > > Seth Johnson > Senior Bioinformatics Associate > > Ph: (202) 470-0900 > Fx: (775) 251-0358 > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- Best Regards, Seth Johnson Senior Bioinformatics Associate Ph: (202) 470-0900 Fx: (775) 251-0358 From mark.schreiber at novartis.com Mon Jun 5 02:57:35 2006 From: mark.schreiber at novartis.com (mark.schreiber at novartis.com) Date: Mon, 5 Jun 2006 10:57:35 +0800 Subject: [Biojava-l] en.wikipedia.org/wiki/BioJava Message-ID: Hi all - This page looks pretty sad and sparse (http://en.wikipedia.org/wiki/BioJava), anyone feel like updating the information in it? - Mark From richard.holland at ebi.ac.uk Mon Jun 5 08:44:26 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Mon, 05 Jun 2006 09:44:26 +0100 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: References: <1149238900.3948.87.camel@texas.ebi.ac.uk> Message-ID: <1149497066.3947.12.camel@texas.ebi.ac.uk> This one should be fixed in CVS now. Typo on my behalf - I put in code to make it work with both 87+ and pre-87 version of EMBL, then got the regexes the wrong way round!! Could you send the full stacktrace for the INSDseq format problem you're having? (The one where you say you've tracked it down to the qualifier value being missing). I can't see anything wrong there, so I need the stacktrace in order to know which exact sequence of events is throwing the exception. cheers, Richard On Fri, 2006-06-02 at 13:04 -0400, Seth Johnson wrote: > Hi Richard, > > I made sure I have the latest source code from CVS compiled > (EMBLFormat.java & GenbankFormat.java are from 05/24/06). I'm happy > to report that GenBank issue is solved!!!! > As far as EMBL parsing, I apologize for not providing the stack dump > for ISSUE #1. Here's the dump of the exception: > -------------------------------------------------------- > org.biojava.bio.BioException: Could not read sequence > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > at exonhit.parsers.GenBankParser.main(GenBankParser.java:359) > Caused by: java.lang.NumberFormatException: null > at java.lang.Integer.parseInt(Integer.java:415) > at java.lang.Integer.parseInt(Integer.java:497) > at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:299) > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > ... 1 more > Java Result: -1 > ------------------------------------------------------- > Here, again, is the code that I'm using to to parse: > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > BufferedReader gbBR = null; > try { > gbBR = new BufferedReader(new > FileReader("C:\\Download\\ASN2BSML\\seth_06_02.emb")); > } catch (FileNotFoundException fnfe) { > fnfe.printStackTrace(); > System.exit(-1); > } > Namespace gbNspace = (Namespace) > RichObjectFactory.getObject(SimpleNamespace.class, new > Object[]{"gbSpace"} ); > RichSequenceIterator gbSeqs = > RichSequence.IOTools.readEMBLDNA(gbBR,gbNspace); > while (gbSeqs.hasNext()) { > try { > RichSequence rs = gbSeqs.nextRichSequence(); > NCBITaxon myTaxon = rs.getTaxon(); > }catch (BioException be){ > be.printStackTrace(); > System.exit(-1); > } > } > ~~~~~~~~~~~~~~~~~~~~~~~~~ > And here's the EMBL file that I'm trying to parse: > +++++++++++++++++++++++++ > ID DQ472184 standard; DNA; INV; 546 BP. > XX > AC DQ472184; > XX > SV DQ472184.1 > DT 15-MAY-2006 > XX > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > DE complete cds. > XX > KW . > XX > OS Trypanosoma cruzi strain CL Brener > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > OC Schizotrypanum. > XX > RN [1] > RP 1-546 > RA De Melo L.D.B.; > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > RL Unpublished. > XX > RN [2] > RP 1-546 > RA De Melo L.D.B.; > RT ; > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > RL 21949-900, Brazil > XX > FH Key Location/Qualifiers > FH > FT source 1..546 > FT /organism="Trypanosoma cruzi strain CL Brener" > FT /mol_type="genomic DNA" > FT /strain="CL Brener" > FT /db_xref="taxon:353153" > FT gene <1..>546 > FT /gene="ARC21" > FT /note="TcARC21" > FT mRNA <1..>546 > FT /gene="ARC21" > FT /product="actin-related protein 3" > FT CDS 1..546 > FT /gene="ARC21" > FT /note="actin-binding protein; ARPC3 21 kDa; putative > FT member of Arp2/3 complex" > FT /codon_start=1 > FT /product="actin-related protein 3" > FT /protein_id="ABF13401.1" > FT /db_xref="GI:93360014" > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > FT FPEKDGTGNKFWMAFAKRPFLASS" > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > agttag 546 > // > ID DQ472185 standard; DNA; INV; 543 BP. > XX > AC DQ472185; > XX > SV DQ472185.1 > DT 15-MAY-2006 > XX > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > DE complete cds. > XX > KW . > XX > OS Trypanosoma cruzi strain CL Brener > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > OC Schizotrypanum. > XX > RN [1] > RP 1-543 > RA De Melo L.D.B.; > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > RL Unpublished. > XX > RN [2] > RP 1-543 > RA De Melo L.D.B.; > RT ; > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > RL 21949-900, Brazil > XX > FH Key Location/Qualifiers > FH > FT source 1..543 > FT /organism="Trypanosoma cruzi strain CL Brener" > FT /mol_type="genomic DNA" > FT /strain="CL Brener" > FT /db_xref="taxon:353153" > FT gene <1..>543 > FT /gene="ARC20" > FT /note="TcARC20" > FT mRNA <1..>543 > FT /gene="ARC20" > FT /product="actin-related protein 4" > FT CDS 1..543 > FT /gene="ARC20" > FT /note="actin-binding protein; ARPC4 20 kDa; putative > FT member of Arp2/3 complex" > FT /codon_start=1 > FT /product="actin-related protein 4" > FT /protein_id="ABF13402.1" > FT /db_xref="GI:93360016" > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > FT MKLNVNQRARRAAMEFFLALNFT" > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > tga 543 > // > +++++++++++++++++++++++++++++++++ > > It looks to me like there's some kind of problem with parsing the > sequence version number. I even tried the sequence from test directory > (AY069118.em) with same outcome. > > Regards, > > Seth > > On 6/2/06, Richard Holland wrote: > > Hi Seth. > > > > Your second point, about the authors string not being read correctly in > > Genbank format, has been fixed (or should have been if I got the code > > right!). Could you check the latest version of biojava-live out of CVS > > and give it another go? Basically the parser did not recognise the > > CONSRTM tag, as it is not mentioned in the sample record provided by > > NCBI, which is what I based the parser on. > > > > I've set it up now so that it reads the CONSRTM tag, but the value is > > merged with the authors tag with (consortium) appended. There will still > > be problems if the consortium value has commas in it - not sure how to > > fix this yet. > > > > Your first point is harder to solve because you did not provide a > > complete stack trace for the exceptions you are getting. The complete > > stack trace would enable me to identify exactly where things are going > > wrong and give me a better chance of fixing them. Could you send the > > stack trace, and I'll see what I can do. > > > > cheers, > > Richard > > > > > > On Thu, 2006-06-01 at 18:03 -0400, Seth Johnson wrote: > > > Hi All, > > > > > > I'm a newbie to the whole BioJava(X) API and was hoping to get some > > > clarification on several issues that I'm having. > > > I am developing a parser that would take as input "NCBI Incremental > > > ASN.1 Sequence Updates to Genbank" files ( > > > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the > > > ASN2GB converter ( > > > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert > > > resulting sequences to a format parsable by BioJava(X) ( > > > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where > > > my problems start. > > > > > > ISSUE 1: > > > I've tried to parse all of the formats that ASN2GB outputs ( GenBank > > > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML), > > > tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank > > > format is recognized by the > > > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with > > > some exceptions that I'll describe in issue #2. This is the code that > > > I'm using to parse, for example, the EMBL output: > > > > > > BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb")); > > > Namespace gbNspace = (Namespace) > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > Object[]{"gbSpace"} ); > > > RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace); > > > while (gbSeqs.hasNext()) { > > > try { > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > // Further processing or RichSequence object from here > > > > > > } catch (BioException be){ > > > be.printStackTrace(); > > > } > > > } > > > > > > The multi-sequence EMBL file looks like this: > > > --------------------------------------------------------------------------------- > > > ID DQ472184 standard; DNA; INV; 546 BP. > > > XX > > > AC DQ472184; > > > XX > > > SV DQ472184.1 > > > DT 15-MAY-2006 > > > XX > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > > > DE complete cds. > > > XX > > > KW . > > > XX > > > OS Trypanosoma cruzi strain CL Brener > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > OC Schizotrypanum. > > > XX > > > RN [1] > > > RP 1-546 > > > RA De Melo L.D.B.; > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > RL Unpublished. > > > XX > > > RN [2] > > > RP 1-546 > > > RA De Melo L.D.B.; > > > RT ; > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > RL 21949-900, Brazil > > > XX > > > FH Key Location/Qualifiers > > > FH > > > FT source 1..546 > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > FT /mol_type="genomic DNA" > > > FT /strain="CL Brener" > > > FT /db_xref="taxon:353153" > > > FT gene <1..>546 > > > FT /gene="ARC21" > > > FT /note="TcARC21" > > > FT mRNA <1..>546 > > > FT /gene="ARC21" > > > FT /product="actin-related protein 3" > > > FT CDS 1..546 > > > FT /gene="ARC21" > > > FT /note="actin-binding protein; ARPC3 21 kDa; putative > > > FT member of Arp2/3 complex" > > > FT /codon_start=1 > > > FT /product="actin-related protein 3" > > > FT /protein_id="ABF13401.1" > > > FT /db_xref="GI:93360014" > > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > > > FT FPEKDGTGNKFWMAFAKRPFLASS" > > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > > > agttag 546 > > > // > > > ID DQ472185 standard; DNA; INV; 543 BP. > > > XX > > > AC DQ472185; > > > XX > > > SV DQ472185.1 > > > DT 15-MAY-2006 > > > XX > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > > > DE complete cds. > > > XX > > > KW . > > > XX > > > OS Trypanosoma cruzi strain CL Brener > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > OC Schizotrypanum. > > > XX > > > RN [1] > > > RP 1-543 > > > RA De Melo L.D.B.; > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > RL Unpublished. > > > XX > > > RN [2] > > > RP 1-543 > > > RA De Melo L.D.B.; > > > RT ; > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > RL 21949-900, Brazil > > > XX > > > FH Key Location/Qualifiers > > > FH > > > FT source 1..543 > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > FT /mol_type="genomic DNA" > > > FT /strain="CL Brener" > > > FT /db_xref="taxon:353153" > > > FT gene <1..>543 > > > FT /gene="ARC20" > > > FT /note="TcARC20" > > > FT mRNA <1..>543 > > > FT /gene="ARC20" > > > FT /product="actin-related protein 4" > > > FT CDS 1..543 > > > FT /gene="ARC20" > > > FT /note="actin-binding protein; ARPC4 20 kDa; putative > > > FT member of Arp2/3 complex" > > > FT /codon_start=1 > > > FT /product="actin-related protein 4" > > > FT /protein_id="ABF13402.1" > > > FT /db_xref="GI:93360016" > > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > > > FT MKLNVNQRARRAAMEFFLALNFT" > > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > > > tga 543 > > > // > > > ----------------------------------------------------------------------- > > > I get an exception message "Could Not Read Sequence". Same thing > > > happens if I use the readINSDSetDNA reader instead of readEMBLDNA one > > > with the following INSDset file (beginning of the file): > > > > > > > > > > > > > > > DQ022078 > > > 16729 > > > DNA > > > linear > > > ENV > > > 15-MAY-2006 > > > 15-MAY-2006 > > > Uncultured bacterium WWRS-2005 putative > > > aminoglycoside phosphotransferase (a3.001), putative oxidoreductase > > > (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase > > > class C (estA3), putative permease (a3.005), putative transmembrane > > > signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone > > > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative > > > asparaginase (a3.010), hypothetical protein (a3.011), hypothetical > > > protein (a3.012), putative membrane protease subunit (a3.013), > > > putative haloalkane dehalogenase (a3.014), putative transcriptional > > > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and > > > hypothetical protein (a3.017) genes, complete cds > > > DQ022078 > > > > > > gb|DQ022078.1| > > > gi|71842722 > > > > > > > > > ENV > > > > > > > > > > > > ? > > > 1..16729 > > > > > > Schmeisser,C. > > > Elend,C. > > > Streit,W.R. > > > > > > Isolation and biochemical characterization > > > of two novel metagenome derived esterases > > > Appl. Environ. Microbiol. 0:0-0 > > > (2006) > > > > > > > > > ? > > > 1..16729 > > > > > > Schmeisser,C. > > > Elend,C. > > > Streit,W.R. > > > > > > Submitted (29-APR-2005) to the > > > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University > > > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057, > > > Germany > > > > > > > > > > > > So my question is wether the ASN2GB produces output that's > > > incompatible with BioJava parsers or is there a problem with the > > > sequence themselves or the problems with the majority of parsers??? > > > Could it be that I'm using the API wrongly for the above formats, > > > although GenBank parser works as advertised with some exceptions > > > below: > > > > > > ISSUE #2: > > > When I try to parse GenBank files using the following code: > > > > > > BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb")); > > > Namespace gbNspace = (Namespace) > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > Object[]{"gbSpace"} ); > > > RichSequenceIterator gbSeqs = > > > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace); > > > while (gbSeqs.hasNext()) { > > > try { > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > // Further processing or RichSequence object from here > > > > > > } catch (BioException be){ > > > be.printStackTrace(); > > > } > > > } > > > > > > Genbank file in question: > > > > > > LOCUS BC074905 838 bp mRNA linear PRI 15-APR-2006 > > > DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038 > > > IMAGE:30915482), complete cds. > > > ACCESSION BC074905 > > > VERSION BC074905.2 GI:50959825 > > > KEYWORDS MGC. > > > SOURCE Homo sapiens (human) > > > ORGANISM Homo sapiens > > > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; > > > Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; > > > Catarrhini; Hominidae; Homo. > > > REFERENCE 1 (bases 1 to 838) > > > AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G., > > > Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D., > > > Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K., > > > Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F., > > > Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L., > > > Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L., > > > Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S., > > > Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J., > > > Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J., > > > McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S., > > > Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W., > > > Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A., > > > Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S., > > > Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y., > > > Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D., > > > Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M., > > > Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E., > > > Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A. > > > CONSRTM Mammalian Gene Collection Program Team > > > TITLE Generation and initial analysis of more than 15,000 full-length > > > human and mouse cDNA sequences > > > JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002) > > > PUBMED 12477932 > > > REFERENCE 2 (bases 1 to 838) > > > CONSRTM NIH MGC Project > > > TITLE Direct Submission > > > JOURNAL Submitted (25-JUN-2004) National Institutes of Health, Mammalian > > > Gene Collection (MGC), Bethesda, MD 20892-2590, USA > > > REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov > > > COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832. > > > Contact: MGC help desk > > > Email: cgapbs-r at mail.nih.gov > > > Tissue Procurement: Genome Sequence Centre, British Columbia Cancer > > > Center > > > cDNA Library Preparation: British Columbia Cancer Research Center > > > cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL) > > > DNA Sequencing by: Genome Sequence Centre, > > > BC Cancer Agency, Vancouver, BC, Canada > > > info at bcgsc.bc.ca > > > Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson > > > Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen > > > Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel > > > Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave > > > Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth > > > Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao, > > > Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR > > > Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang, > > > Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra. > > > > > > Clone distribution: MGC clone distribution information can be found > > > through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov > > > Series: IRBU Plate: 4 Row: C Column: 3. > > > > > > Differences found between this sequence and the human reference > > > genome (build 36) are described in misc_difference features below. > > > FEATURES Location/Qualifiers > > > source 1..838 > > > /organism="Homo sapiens" > > > /mol_type="mRNA" > > > /db_xref="taxon:9606" > > > /clone="MGC:104038 IMAGE:30915482" > > > /tissue_type="Lung, PCR rescued clones" > > > /clone_lib="NIH_MGC_273" > > > /lab_host="DH10B" > > > /note="Vector: pCR4 Topo TA with reversed insert" > > > gene 1..838 > > > /gene="KLK14" > > > /note="synonym: KLK-L6" > > > /db_xref="GeneID:43847" > > > /db_xref="HGNC:6362" > > > /db_xref="IMGT/GENE-DB:6362" > > > /db_xref="MIM:606135" > > > CDS 49..804 > > > /gene="KLK14" > > > /codon_start=1 > > > /product="KLK14 protein" > > > /protein_id="AAH74905.1" > > > /db_xref="GI:50959826" > > > /db_xref="GeneID:43847" > > > /db_xref="HGNC:6362" > > > /db_xref="IMGT/GENE-DB:6362" > > > /db_xref="MIM:606135" > > > /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA > > > GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN > > > YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA > > > SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV > > > SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK" > > > misc_difference 98 > > > /gene="KLK14" > > > /note="'G' in cDNA is 'A' in the human genome; amino acid > > > difference: 'R' in cDNA, 'Q' in the human genome." > > > misc_difference 133 > > > /gene="KLK14" > > > /note="'T' in cDNA is 'C' in the human genome; amino acid > > > difference: 'Y' in cDNA, 'H' in the human genome." > > > ORIGIN > > > 1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg > > > 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag > > > 121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg > > > 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact > > > 241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg > > > 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac > > > 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg > > > 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga > > > 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc > > > 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg > > > 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct > > > 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc > > > 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt > > > 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc > > > // > > > > > > I get the following exception: > > > > > > java.lang.IllegalArgumentException: Authors string cannot be null > > > org.biojava.bio.BioException: Could not read sequence > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107) > > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258) > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:341) > > > Caused by: java.lang.IllegalArgumentException: Authors string cannot be null > > > at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76) > > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356) > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > > > ----------------------------------------------------------------------- > > > > > > I'm trying to see what could be the problem with this particular > > > sequence. Looks to me like the AUTHORS portion is not getting parsed > > > correctly. Any ideas would be greatly appreciated! > > > > > -- > > Richard Holland (BioMart Team) > > EMBL-EBI > > Wellcome Trust Genome Campus > > Hinxton > > Cambridge CB10 1SD > > UNITED KINGDOM > > Tel: +44-(0)1223-494416 > > > > > > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From richard.holland at ebi.ac.uk Mon Jun 5 08:47:33 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Mon, 05 Jun 2006 09:47:33 +0100 Subject: [Biojava-l] viterbi training in biojava In-Reply-To: <1149065381.3948.2.camel@texas.ebi.ac.uk> References:

<1149065381.3948.2.camel@texas.ebi.ac.uk> Message-ID: <1149497253.3947.13.camel@texas.ebi.ac.uk> I just got a bounce response for this message. So I'm trying again in case you didn't get it the first time... cheers, Richard On Wed, 2006-05-31 at 09:49 +0100, Richard Holland wrote: > I've modified BaumWelchSampler in CVS so that it accepts alternative > score types as an additional parameter to singleSequenceIterator(). > > cheers, > Richard. > > > On Tue, 2006-05-30 at 16:43 +0100, wendy wong wrote: > > thanks! i only need one head so BaumWelchSampler works fine with me. > > The default SCORETYPE is probability and when I tried it the score > > goes back and forth, like + for one time and - for the next time. I > > then changed it to LOGODDS and recompiled biojava and now that the > > score is steadily increasing. I was wondering if the SCORETYPE could > > be passed in as an argument in the next version of biojava? > > > > thanks, > > wendy > > > > > > On 30 May 2006 12:19:15 +0100, David Huen wrote: > > > On May 30 2006, wendy wong wrote: > > > > > > >Hi, > > > > > > > >I was wondering if viterbi training is implemented in biojava, or if > > > >there's any open source version implemented using biojava? > > > > > > > There is one-head viterbi training already I think. The training framework > > > doesn't work for two-head - I wrote a viterbi training API that works for > > > two head but it is not fully compatible with the existing API so I never > > > put it into CVS, plus it doesn't have Baum-Welch implemented either. > > > > > > If it is any use to you you can have it. > > > > > > Regards, > > > David > > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From mark.schreiber at novartis.com Mon Jun 5 09:43:14 2006 From: mark.schreiber at novartis.com (mark.schreiber at novartis.com) Date: Mon, 5 Jun 2006 17:43:14 +0800 Subject: [Biojava-l] where is biojava used Message-ID: Hello - I have added a page to the biojava site that talks about the use of biojava in projects and publications. Please feel free to add your own URLS and citations. - Mark From johnson.biotech at gmail.com Mon Jun 5 14:37:31 2006 From: johnson.biotech at gmail.com (Seth Johnson) Date: Mon, 5 Jun 2006 10:37:31 -0400 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: <1149238900.3948.87.camel@texas.ebi.ac.uk> References: <1149238900.3948.87.camel@texas.ebi.ac.uk> Message-ID: Hell again Richard, No sooner I've said about the fix of the last parsing exception than another one came up with Genbank format: -------------------------------------- org.biojava.bio.seq.io.ParseException: DQ431065 org.biojava.bio.BioException: Could not read sequence at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) at exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151) at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246) at exonhit.parsers.GenBankParser.main(GenBankParser.java:326) Caused by: org.biojava.bio.seq.io.ParseException: DQ431065 at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) ... 3 more org.biojava.bio.seq.io.ParseException: org.biojava.bio.symbol.IllegalSymbolException: This tokenization doesn't contain character: 't' ---------------------------------------- The Genbank file that caused it is as follows: ========================================= LOCUS DQ431065 425 bp DNA linear INV 01-JUN-2006 DEFINITION Reticulitermes sp. ALS-2006c 16S ribosomal RNA gene, partial sequence; mitochondrial. ACCESSION DQ431065 VERSION DQ431065.1 GI:90102206 KEYWORDS . SOURCE Vaccinium corymbosum ORGANISM Vaccinium corymbosum Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; asterids; Ericales; Ericaceae; Vaccinioideae; Vaccinieae; Vaccinium. ? REFERENCE 2 (bases 1 to 425) AUTHORS Naik,L.D. and Rowland,L.J. TITLE Expressed Sequence Tags of cDNA clones from subtracted library of Vaccinium corymbosum JOURNAL Unpublished (2005) FEATURES Location/Qualifiers source 1..425 /organism="Vaccinium corymbosum" /mol_type="genomic DNA" /cultivar="Bluecrop" /db_xref="taxon:69266" /tissue_type="Flower buds" /clone_lib="Subtracted cDNA library of Vaccinium corymbosum" /dev_stage="399 hour chill unit exposure" /note="Vector: pCR4TOPO; Site_1: Eco R I; Site_2: Eco R I" rRNA <1..>425 /product="16S ribosomal RNA" ORIGIN 1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga agtatggcct gcccgctgac 61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg tagcatagtc attagttctt 121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc tgtcttaatt ttgaattgtt 181 tattgaattt ggtctttgag ttaaaattct tagatgtttt tatgggacga gaagacccta 241 tagagtttaa catttattat ggtccttttc tgtttgtgag ggctcactgg gccgtctaat 301 atgttttgtt ggggtgatgg gagggaataa tttaacccct cctttttatt attatattta 361 tttatattta tttgatccat ttattttgat tgtaagatta aattacctta gggataacag 421 cgtaa // ================================== I think it's the presence of the '?' at the beginning of the line?!?! I'm not sure wether the information that was supposed to be present instead of those question marks is absent from the original ASN.1 batch file or it's a bug in the NCBI ASN2GO software. It looks to me that the former is the case since the file from NCBI website contains much more information than the batch file. Just bringing this to everyone's attention. -- Best Regards, Seth Johnson Senior Bioinformatics Associate Ph: (202) 470-0900 Fx: (775) 251-0358 On 6/2/06, Richard Holland wrote: > Hi Seth. > > Your second point, about the authors string not being read correctly in > Genbank format, has been fixed (or should have been if I got the code > right!). Could you check the latest version of biojava-live out of CVS > and give it another go? Basically the parser did not recognise the > CONSRTM tag, as it is not mentioned in the sample record provided by > NCBI, which is what I based the parser on. ... > > cheers, > Richard > > From richard.holland at ebi.ac.uk Mon Jun 5 15:11:07 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Mon, 05 Jun 2006 16:11:07 +0100 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: References: <1149238900.3948.87.camel@texas.ebi.ac.uk> Message-ID: <1149520267.3947.36.camel@texas.ebi.ac.uk> Hi again. Could you remove the offending question mark from the GenBank file and try it again to see if that fixes it? The parser should just ignore it but apparently not. The error looks weird to me because the tokenization for a DNA GenBank file _does_ contain the letter 't'! Not sure what's going on here. With regard to your INSDseqXML problems, the stacktrace pointed to a bug in SimpleRichSequenceBuilder that would actually cause these problems for any file containing a no qualifier value for a feature, regardless of format. I think I have fixed this now. Could you test it? (It's in CVS already). cheers, Richard On Mon, 2006-06-05 at 10:37 -0400, Seth Johnson wrote: > Hell again Richard, > > No sooner I've said about the fix of the last parsing exception than > another one came up with Genbank format: > -------------------------------------- > org.biojava.bio.seq.io.ParseException: DQ431065 > org.biojava.bio.BioException: Could not read sequence > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > at exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151) > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246) > at exonhit.parsers.GenBankParser.main(GenBankParser.java:326) > Caused by: org.biojava.bio.seq.io.ParseException: DQ431065 > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > ... 3 more > org.biojava.bio.seq.io.ParseException: > org.biojava.bio.symbol.IllegalSymbolException: This tokenization > doesn't contain character: 't' > ---------------------------------------- > The Genbank file that caused it is as follows: > ========================================= > LOCUS DQ431065 425 bp DNA linear INV 01-JUN-2006 > DEFINITION Reticulitermes sp. ALS-2006c 16S ribosomal RNA gene, partial > sequence; mitochondrial. > ACCESSION DQ431065 > VERSION DQ431065.1 GI:90102206 > KEYWORDS . > SOURCE Vaccinium corymbosum > ORGANISM Vaccinium corymbosum > Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; > Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; > asterids; Ericales; Ericaceae; Vaccinioideae; Vaccinieae; > Vaccinium. > ? > REFERENCE 2 (bases 1 to 425) > AUTHORS Naik,L.D. and Rowland,L.J. > TITLE Expressed Sequence Tags of cDNA clones from subtracted library of > Vaccinium corymbosum > JOURNAL Unpublished (2005) > FEATURES Location/Qualifiers > source 1..425 > /organism="Vaccinium corymbosum" > /mol_type="genomic DNA" > /cultivar="Bluecrop" > /db_xref="taxon:69266" > /tissue_type="Flower buds" > /clone_lib="Subtracted cDNA library of Vaccinium > corymbosum" > /dev_stage="399 hour chill unit exposure" > /note="Vector: pCR4TOPO; Site_1: Eco R I; Site_2: Eco R I" > rRNA <1..>425 > /product="16S ribosomal RNA" > ORIGIN > 1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga agtatggcct gcccgctgac > 61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg tagcatagtc attagttctt > 121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc tgtcttaatt ttgaattgtt > 181 tattgaattt ggtctttgag ttaaaattct tagatgtttt tatgggacga gaagacccta > 241 tagagtttaa catttattat ggtccttttc tgtttgtgag ggctcactgg gccgtctaat > 301 atgttttgtt ggggtgatgg gagggaataa tttaacccct cctttttatt attatattta > 361 tttatattta tttgatccat ttattttgat tgtaagatta aattacctta gggataacag > 421 cgtaa > // > ================================== > I think it's the presence of the '?' at the beginning of the line?!?! > I'm not sure wether the information that was supposed to be present > instead of those question marks is absent from the original ASN.1 > batch file or it's a bug in the NCBI ASN2GO software. It looks to me > that the former is the case since the file from NCBI website contains > much more information than the batch file. Just bringing this to > everyone's attention. > > > -- > Best Regards, > > > Seth Johnson > Senior Bioinformatics Associate > > Ph: (202) 470-0900 > Fx: (775) 251-0358 > > On 6/2/06, Richard Holland wrote: > > Hi Seth. > > > > Your second point, about the authors string not being read correctly in > > Genbank format, has been fixed (or should have been if I got the code > > right!). Could you check the latest version of biojava-live out of CVS > > and give it another go? Basically the parser did not recognise the > > CONSRTM tag, as it is not mentioned in the sample record provided by > > NCBI, which is what I based the parser on. > ... > > > > cheers, > > Richard > > > > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From richard.holland at ebi.ac.uk Mon Jun 5 15:16:37 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Mon, 05 Jun 2006 16:16:37 +0100 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: References: <1149238900.3948.87.camel@texas.ebi.ac.uk> <1149497066.3947.12.camel@texas.ebi.ac.uk> Message-ID: <1149520598.3947.38.camel@texas.ebi.ac.uk> Doh! I am in desparate need of coffee methinks... that's the second error in EMBLFormat directly related to me being stupid when I cut-and-pasted the stuff for the new 87+ ID line format... Should be fixed now in CVS (as of about 30 seconds ago). cheers, Richard On Mon, 2006-06-05 at 11:05 -0400, Seth Johnson wrote: > Hi Richard, > > I go another exception on EMBL format: > ============================= > org.biojava.bio.BioException: Could not read sequence > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > at exonhit.parsers.GenBankParser.main(GenBankParser.java:347) > Caused by: java.lang.IllegalStateException: No match found > at java.util.regex.Matcher.group(Matcher.java:461) > at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:311) > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > ... 1 more > Java Result: -1 > ============================= > I used the same file from test directory:(AY069118.em) > > > Seth > > On 6/5/06, Richard Holland wrote: > > This one should be fixed in CVS now. Typo on my behalf - I put in code > > to make it work with both 87+ and pre-87 version of EMBL, then got the > > regexes the wrong way round!! > > > ... > > > > cheers, > > Richard > > > > > > On Fri, 2006-06-02 at 13:04 -0400, Seth Johnson wrote: > > > Hi Richard, > > > > > > I made sure I have the latest source code from CVS compiled > > > (EMBLFormat.java & GenbankFormat.java are from 05/24/06). I'm happy > > > to report that GenBank issue is solved!!!! > > > As far as EMBL parsing, I apologize for not providing the stack dump > > > for ISSUE #1. Here's the dump of the exception: > > > -------------------------------------------------------- > > > org.biojava.bio.BioException: Could not read sequence > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:359) > > > Caused by: java.lang.NumberFormatException: null > > > at java.lang.Integer.parseInt(Integer.java:415) > > > at java.lang.Integer.parseInt(Integer.java:497) > > > at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:299) > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > ... 1 more > > > Java Result: -1 > > > ------------------------------------------------------- > > > Here, again, is the code that I'm using to to parse: > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > BufferedReader gbBR = null; > > > try { > > > gbBR = new BufferedReader(new > > > FileReader("C:\\Download\\ASN2BSML\\seth_06_02.emb")); > > > } catch (FileNotFoundException fnfe) { > > > fnfe.printStackTrace(); > > > System.exit(-1); > > > } > > > Namespace gbNspace = (Namespace) > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > Object[]{"gbSpace"} ); > > > RichSequenceIterator gbSeqs = > > > RichSequence.IOTools.readEMBLDNA(gbBR,gbNspace); > > > while (gbSeqs.hasNext()) { > > > try { > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > NCBITaxon myTaxon = rs.getTaxon(); > > > }catch (BioException be){ > > > be.printStackTrace(); > > > System.exit(-1); > > > } > > > } > > > ~~~~~~~~~~~~~~~~~~~~~~~~~ > > > And here's the EMBL file that I'm trying to parse: > > > +++++++++++++++++++++++++ > > > ID DQ472184 standard; DNA; INV; 546 BP. > > > XX > > > AC DQ472184; > > > XX > > > SV DQ472184.1 > > > DT 15-MAY-2006 > > > XX > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > > > DE complete cds. > > > XX > > > KW . > > > XX > > > OS Trypanosoma cruzi strain CL Brener > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > OC Schizotrypanum. > > > XX > > > RN [1] > > > RP 1-546 > > > RA De Melo L.D.B.; > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > RL Unpublished. > > > XX > > > RN [2] > > > RP 1-546 > > > RA De Melo L.D.B.; > > > RT ; > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > RL 21949-900, Brazil > > > XX > > > FH Key Location/Qualifiers > > > FH > > > FT source 1..546 > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > FT /mol_type="genomic DNA" > > > FT /strain="CL Brener" > > > FT /db_xref="taxon:353153" > > > FT gene <1..>546 > > > FT /gene="ARC21" > > > FT /note="TcARC21" > > > FT mRNA <1..>546 > > > FT /gene="ARC21" > > > FT /product="actin-related protein 3" > > > FT CDS 1..546 > > > FT /gene="ARC21" > > > FT /note="actin-binding protein; ARPC3 21 kDa; putative > > > FT member of Arp2/3 complex" > > > FT /codon_start=1 > > > FT /product="actin-related protein 3" > > > FT /protein_id="ABF13401.1" > > > FT /db_xref="GI:93360014" > > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > > > FT FPEKDGTGNKFWMAFAKRPFLASS" > > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > > > agttag 546 > > > // > > > ID DQ472185 standard; DNA; INV; 543 BP. > > > XX > > > AC DQ472185; > > > XX > > > SV DQ472185.1 > > > DT 15-MAY-2006 > > > XX > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > > > DE complete cds. > > > XX > > > KW . > > > XX > > > OS Trypanosoma cruzi strain CL Brener > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > OC Schizotrypanum. > > > XX > > > RN [1] > > > RP 1-543 > > > RA De Melo L.D.B.; > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > RL Unpublished. > > > XX > > > RN [2] > > > RP 1-543 > > > RA De Melo L.D.B.; > > > RT ; > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > RL 21949-900, Brazil > > > XX > > > FH Key Location/Qualifiers > > > FH > > > FT source 1..543 > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > FT /mol_type="genomic DNA" > > > FT /strain="CL Brener" > > > FT /db_xref="taxon:353153" > > > FT gene <1..>543 > > > FT /gene="ARC20" > > > FT /note="TcARC20" > > > FT mRNA <1..>543 > > > FT /gene="ARC20" > > > FT /product="actin-related protein 4" > > > FT CDS 1..543 > > > FT /gene="ARC20" > > > FT /note="actin-binding protein; ARPC4 20 kDa; putative > > > FT member of Arp2/3 complex" > > > FT /codon_start=1 > > > FT /product="actin-related protein 4" > > > FT /protein_id="ABF13402.1" > > > FT /db_xref="GI:93360016" > > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > > > FT MKLNVNQRARRAAMEFFLALNFT" > > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > > > tga 543 > > > // > > > +++++++++++++++++++++++++++++++++ > > > > > > It looks to me like there's some kind of problem with parsing the > > > sequence version number. I even tried the sequence from test directory > > > (AY069118.em) with same outcome. > > > > > > Regards, > > > > > > Seth > > > > > > On 6/2/06, Richard Holland wrote: > > > > Hi Seth. > > > > > > > > Your second point, about the authors string not being read correctly in > > > > Genbank format, has been fixed (or should have been if I got the code > > > > right!). Could you check the latest version of biojava-live out of CVS > > > > and give it another go? Basically the parser did not recognise the > > > > CONSRTM tag, as it is not mentioned in the sample record provided by > > > > NCBI, which is what I based the parser on. > > > > > > > > I've set it up now so that it reads the CONSRTM tag, but the value is > > > > merged with the authors tag with (consortium) appended. There will still > > > > be problems if the consortium value has commas in it - not sure how to > > > > fix this yet. > > > > > > > > Your first point is harder to solve because you did not provide a > > > > complete stack trace for the exceptions you are getting. The complete > > > > stack trace would enable me to identify exactly where things are going > > > > wrong and give me a better chance of fixing them. Could you send the > > > > stack trace, and I'll see what I can do. > > > > > > > > cheers, > > > > Richard > > > > > > > > > > > > On Thu, 2006-06-01 at 18:03 -0400, Seth Johnson wrote: > > > > > Hi All, > > > > > > > > > > I'm a newbie to the whole BioJava(X) API and was hoping to get some > > > > > clarification on several issues that I'm having. > > > > > I am developing a parser that would take as input "NCBI Incremental > > > > > ASN.1 Sequence Updates to Genbank" files ( > > > > > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the > > > > > ASN2GB converter ( > > > > > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert > > > > > resulting sequences to a format parsable by BioJava(X) ( > > > > > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where > > > > > my problems start. > > > > > > > > > > ISSUE 1: > > > > > I've tried to parse all of the formats that ASN2GB outputs ( GenBank > > > > > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML), > > > > > tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank > > > > > format is recognized by the > > > > > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with > > > > > some exceptions that I'll describe in issue #2. This is the code that > > > > > I'm using to parse, for example, the EMBL output: > > > > > > > > > > BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb")); > > > > > Namespace gbNspace = (Namespace) > > > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > > > Object[]{"gbSpace"} ); > > > > > RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace); > > > > > while (gbSeqs.hasNext()) { > > > > > try { > > > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > > > // Further processing or RichSequence object from here > > > > > > > > > > } catch (BioException be){ > > > > > be.printStackTrace(); > > > > > } > > > > > } > > > > > > > > > > The multi-sequence EMBL file looks like this: > > > > > --------------------------------------------------------------------------------- > > > > > ID DQ472184 standard; DNA; INV; 546 BP. > > > > > XX > > > > > AC DQ472184; > > > > > XX > > > > > SV DQ472184.1 > > > > > DT 15-MAY-2006 > > > > > XX > > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > > > > > DE complete cds. > > > > > XX > > > > > KW . > > > > > XX > > > > > OS Trypanosoma cruzi strain CL Brener > > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > > OC Schizotrypanum. > > > > > XX > > > > > RN [1] > > > > > RP 1-546 > > > > > RA De Melo L.D.B.; > > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > > RL Unpublished. > > > > > XX > > > > > RN [2] > > > > > RP 1-546 > > > > > RA De Melo L.D.B.; > > > > > RT ; > > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > > RL 21949-900, Brazil > > > > > XX > > > > > FH Key Location/Qualifiers > > > > > FH > > > > > FT source 1..546 > > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > > FT /mol_type="genomic DNA" > > > > > FT /strain="CL Brener" > > > > > FT /db_xref="taxon:353153" > > > > > FT gene <1..>546 > > > > > FT /gene="ARC21" > > > > > FT /note="TcARC21" > > > > > FT mRNA <1..>546 > > > > > FT /gene="ARC21" > > > > > FT /product="actin-related protein 3" > > > > > FT CDS 1..546 > > > > > FT /gene="ARC21" > > > > > FT /note="actin-binding protein; ARPC3 21 kDa; putative > > > > > FT member of Arp2/3 complex" > > > > > FT /codon_start=1 > > > > > FT /product="actin-related protein 3" > > > > > FT /protein_id="ABF13401.1" > > > > > FT /db_xref="GI:93360014" > > > > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > > > > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > > > > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > > > > > FT FPEKDGTGNKFWMAFAKRPFLASS" > > > > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > > > > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > > > > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > > > > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > > > > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > > > > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > > > > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > > > > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > > > > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > > > > > agttag 546 > > > > > // > > > > > ID DQ472185 standard; DNA; INV; 543 BP. > > > > > XX > > > > > AC DQ472185; > > > > > XX > > > > > SV DQ472185.1 > > > > > DT 15-MAY-2006 > > > > > XX > > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > > > > > DE complete cds. > > > > > XX > > > > > KW . > > > > > XX > > > > > OS Trypanosoma cruzi strain CL Brener > > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > > OC Schizotrypanum. > > > > > XX > > > > > RN [1] > > > > > RP 1-543 > > > > > RA De Melo L.D.B.; > > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > > RL Unpublished. > > > > > XX > > > > > RN [2] > > > > > RP 1-543 > > > > > RA De Melo L.D.B.; > > > > > RT ; > > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > > RL 21949-900, Brazil > > > > > XX > > > > > FH Key Location/Qualifiers > > > > > FH > > > > > FT source 1..543 > > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > > FT /mol_type="genomic DNA" > > > > > FT /strain="CL Brener" > > > > > FT /db_xref="taxon:353153" > > > > > FT gene <1..>543 > > > > > FT /gene="ARC20" > > > > > FT /note="TcARC20" > > > > > FT mRNA <1..>543 > > > > > FT /gene="ARC20" > > > > > FT /product="actin-related protein 4" > > > > > FT CDS 1..543 > > > > > FT /gene="ARC20" > > > > > FT /note="actin-binding protein; ARPC4 20 kDa; putative > > > > > FT member of Arp2/3 complex" > > > > > FT /codon_start=1 > > > > > FT /product="actin-related protein 4" > > > > > FT /protein_id="ABF13402.1" > > > > > FT /db_xref="GI:93360016" > > > > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > > > > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > > > > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > > > > > FT MKLNVNQRARRAAMEFFLALNFT" > > > > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > > > > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > > > > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > > > > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > > > > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > > > > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > > > > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > > > > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > > > > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > > > > > tga 543 > > > > > // > > > > > ----------------------------------------------------------------------- > > > > > I get an exception message "Could Not Read Sequence". Same thing > > > > > happens if I use the readINSDSetDNA reader instead of readEMBLDNA one > > > > > with the following INSDset file (beginning of the file): > > > > > > > > > > > > > > > > > > > > > > > > > DQ022078 > > > > > 16729 > > > > > DNA > > > > > linear > > > > > ENV > > > > > 15-MAY-2006 > > > > > 15-MAY-2006 > > > > > Uncultured bacterium WWRS-2005 putative > > > > > aminoglycoside phosphotransferase (a3.001), putative oxidoreductase > > > > > (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase > > > > > class C (estA3), putative permease (a3.005), putative transmembrane > > > > > signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone > > > > > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative > > > > > asparaginase (a3.010), hypothetical protein (a3.011), hypothetical > > > > > protein (a3.012), putative membrane protease subunit (a3.013), > > > > > putative haloalkane dehalogenase (a3.014), putative transcriptional > > > > > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and > > > > > hypothetical protein (a3.017) genes, complete cds > > > > > DQ022078 > > > > > > > > > > gb|DQ022078.1| > > > > > gi|71842722 > > > > > > > > > > > > > > > ENV > > > > > > > > > > > > > > > > > > > > ? > > > > > 1..16729 > > > > > > > > > > Schmeisser,C. > > > > > Elend,C. > > > > > Streit,W.R. > > > > > > > > > > Isolation and biochemical characterization > > > > > of two novel metagenome derived esterases > > > > > Appl. Environ. Microbiol. 0:0-0 > > > > > (2006) > > > > > > > > > > > > > > > ? > > > > > 1..16729 > > > > > > > > > > Schmeisser,C. > > > > > Elend,C. > > > > > Streit,W.R. > > > > > > > > > > Submitted (29-APR-2005) to the > > > > > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University > > > > > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057, > > > > > Germany > > > > > > > > > > > > > > > > > > > > So my question is wether the ASN2GB produces output that's > > > > > incompatible with BioJava parsers or is there a problem with the > > > > > sequence themselves or the problems with the majority of parsers??? > > > > > Could it be that I'm using the API wrongly for the above formats, > > > > > although GenBank parser works as advertised with some exceptions > > > > > below: > > > > > > > > > > ISSUE #2: > > > > > When I try to parse GenBank files using the following code: > > > > > > > > > > BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb")); > > > > > Namespace gbNspace = (Namespace) > > > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > > > Object[]{"gbSpace"} ); > > > > > RichSequenceIterator gbSeqs = > > > > > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace); > > > > > while (gbSeqs.hasNext()) { > > > > > try { > > > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > > > // Further processing or RichSequence object from here > > > > > > > > > > } catch (BioException be){ > > > > > be.printStackTrace(); > > > > > } > > > > > } > > > > > > > > > > Genbank file in question: > > > > > > > > > > LOCUS BC074905 838 bp mRNA linear PRI 15-APR-2006 > > > > > DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038 > > > > > IMAGE:30915482), complete cds. > > > > > ACCESSION BC074905 > > > > > VERSION BC074905.2 GI:50959825 > > > > > KEYWORDS MGC. > > > > > SOURCE Homo sapiens (human) > > > > > ORGANISM Homo sapiens > > > > > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; > > > > > Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; > > > > > Catarrhini; Hominidae; Homo. > > > > > REFERENCE 1 (bases 1 to 838) > > > > > AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G., > > > > > Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D., > > > > > Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K., > > > > > Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F., > > > > > Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L., > > > > > Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L., > > > > > Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S., > > > > > Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J., > > > > > Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J., > > > > > McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S., > > > > > Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W., > > > > > Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A., > > > > > Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S., > > > > > Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y., > > > > > Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D., > > > > > Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M., > > > > > Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E., > > > > > Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A. > > > > > CONSRTM Mammalian Gene Collection Program Team > > > > > TITLE Generation and initial analysis of more than 15,000 full-length > > > > > human and mouse cDNA sequences > > > > > JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002) > > > > > PUBMED 12477932 > > > > > REFERENCE 2 (bases 1 to 838) > > > > > CONSRTM NIH MGC Project > > > > > TITLE Direct Submission > > > > > JOURNAL Submitted (25-JUN-2004) National Institutes of Health, Mammalian > > > > > Gene Collection (MGC), Bethesda, MD 20892-2590, USA > > > > > REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov > > > > > COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832. > > > > > Contact: MGC help desk > > > > > Email: cgapbs-r at mail.nih.gov > > > > > Tissue Procurement: Genome Sequence Centre, British Columbia Cancer > > > > > Center > > > > > cDNA Library Preparation: British Columbia Cancer Research Center > > > > > cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL) > > > > > DNA Sequencing by: Genome Sequence Centre, > > > > > BC Cancer Agency, Vancouver, BC, Canada > > > > > info at bcgsc.bc.ca > > > > > Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson > > > > > Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen > > > > > Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel > > > > > Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave > > > > > Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth > > > > > Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao, > > > > > Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR > > > > > Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang, > > > > > Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra. > > > > > > > > > > Clone distribution: MGC clone distribution information can be found > > > > > through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov > > > > > Series: IRBU Plate: 4 Row: C Column: 3. > > > > > > > > > > Differences found between this sequence and the human reference > > > > > genome (build 36) are described in misc_difference features below. > > > > > FEATURES Location/Qualifiers > > > > > source 1..838 > > > > > /organism="Homo sapiens" > > > > > /mol_type="mRNA" > > > > > /db_xref="taxon:9606" > > > > > /clone="MGC:104038 IMAGE:30915482" > > > > > /tissue_type="Lung, PCR rescued clones" > > > > > /clone_lib="NIH_MGC_273" > > > > > /lab_host="DH10B" > > > > > /note="Vector: pCR4 Topo TA with reversed insert" > > > > > gene 1..838 > > > > > /gene="KLK14" > > > > > /note="synonym: KLK-L6" > > > > > /db_xref="GeneID:43847" > > > > > /db_xref="HGNC:6362" > > > > > /db_xref="IMGT/GENE-DB:6362" > > > > > /db_xref="MIM:606135" > > > > > CDS 49..804 > > > > > /gene="KLK14" > > > > > /codon_start=1 > > > > > /product="KLK14 protein" > > > > > /protein_id="AAH74905.1" > > > > > /db_xref="GI:50959826" > > > > > /db_xref="GeneID:43847" > > > > > /db_xref="HGNC:6362" > > > > > /db_xref="IMGT/GENE-DB:6362" > > > > > /db_xref="MIM:606135" > > > > > /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA > > > > > GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN > > > > > YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA > > > > > SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV > > > > > SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK" > > > > > misc_difference 98 > > > > > /gene="KLK14" > > > > > /note="'G' in cDNA is 'A' in the human genome; amino acid > > > > > difference: 'R' in cDNA, 'Q' in the human genome." > > > > > misc_difference 133 > > > > > /gene="KLK14" > > > > > /note="'T' in cDNA is 'C' in the human genome; amino acid > > > > > difference: 'Y' in cDNA, 'H' in the human genome." > > > > > ORIGIN > > > > > 1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg > > > > > 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag > > > > > 121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg > > > > > 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact > > > > > 241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg > > > > > 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac > > > > > 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg > > > > > 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga > > > > > 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc > > > > > 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg > > > > > 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct > > > > > 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc > > > > > 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt > > > > > 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc > > > > > // > > > > > > > > > > I get the following exception: > > > > > > > > > > java.lang.IllegalArgumentException: Authors string cannot be null > > > > > org.biojava.bio.BioException: Could not read sequence > > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > > at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107) > > > > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258) > > > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:341) > > > > > Caused by: java.lang.IllegalArgumentException: Authors string cannot be null > > > > > at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76) > > > > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356) > > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > > > > > > > ----------------------------------------------------------------------- > > > > > > > > > > I'm trying to see what could be the problem with this particular > > > > > sequence. Looks to me like the AUTHORS portion is not getting parsed > > > > > correctly. Any ideas would be greatly appreciated! > > > > > > > > > -- > > > > Richard Holland (BioMart Team) > > > > EMBL-EBI > > > > Wellcome Trust Genome Campus > > > > Hinxton > > > > Cambridge CB10 1SD > > > > UNITED KINGDOM > > > > Tel: +44-(0)1223-494416 > > > > > > > > > > > > > > > > -- > > Richard Holland (BioMart Team) > > EMBL-EBI > > Wellcome Trust Genome Campus > > Hinxton > > Cambridge CB10 1SD > > UNITED KINGDOM > > Tel: +44-(0)1223-494416 > > > > > > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From johnson.biotech at gmail.com Mon Jun 5 15:05:21 2006 From: johnson.biotech at gmail.com (Seth Johnson) Date: Mon, 5 Jun 2006 11:05:21 -0400 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: <1149497066.3947.12.camel@texas.ebi.ac.uk> References: <1149238900.3948.87.camel@texas.ebi.ac.uk> <1149497066.3947.12.camel@texas.ebi.ac.uk> Message-ID: Hi Richard, I go another exception on EMBL format: ============================= org.biojava.bio.BioException: Could not read sequence at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) at exonhit.parsers.GenBankParser.main(GenBankParser.java:347) Caused by: java.lang.IllegalStateException: No match found at java.util.regex.Matcher.group(Matcher.java:461) at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:311) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) ... 1 more Java Result: -1 ============================= I used the same file from test directory:(AY069118.em) Seth On 6/5/06, Richard Holland wrote: > This one should be fixed in CVS now. Typo on my behalf - I put in code > to make it work with both 87+ and pre-87 version of EMBL, then got the > regexes the wrong way round!! > ... > > cheers, > Richard > > > On Fri, 2006-06-02 at 13:04 -0400, Seth Johnson wrote: > > Hi Richard, > > > > I made sure I have the latest source code from CVS compiled > > (EMBLFormat.java & GenbankFormat.java are from 05/24/06). I'm happy > > to report that GenBank issue is solved!!!! > > As far as EMBL parsing, I apologize for not providing the stack dump > > for ISSUE #1. Here's the dump of the exception: > > -------------------------------------------------------- > > org.biojava.bio.BioException: Could not read sequence > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:359) > > Caused by: java.lang.NumberFormatException: null > > at java.lang.Integer.parseInt(Integer.java:415) > > at java.lang.Integer.parseInt(Integer.java:497) > > at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:299) > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > ... 1 more > > Java Result: -1 > > ------------------------------------------------------- > > Here, again, is the code that I'm using to to parse: > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > BufferedReader gbBR = null; > > try { > > gbBR = new BufferedReader(new > > FileReader("C:\\Download\\ASN2BSML\\seth_06_02.emb")); > > } catch (FileNotFoundException fnfe) { > > fnfe.printStackTrace(); > > System.exit(-1); > > } > > Namespace gbNspace = (Namespace) > > RichObjectFactory.getObject(SimpleNamespace.class, new > > Object[]{"gbSpace"} ); > > RichSequenceIterator gbSeqs = > > RichSequence.IOTools.readEMBLDNA(gbBR,gbNspace); > > while (gbSeqs.hasNext()) { > > try { > > RichSequence rs = gbSeqs.nextRichSequence(); > > NCBITaxon myTaxon = rs.getTaxon(); > > }catch (BioException be){ > > be.printStackTrace(); > > System.exit(-1); > > } > > } > > ~~~~~~~~~~~~~~~~~~~~~~~~~ > > And here's the EMBL file that I'm trying to parse: > > +++++++++++++++++++++++++ > > ID DQ472184 standard; DNA; INV; 546 BP. > > XX > > AC DQ472184; > > XX > > SV DQ472184.1 > > DT 15-MAY-2006 > > XX > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > > DE complete cds. > > XX > > KW . > > XX > > OS Trypanosoma cruzi strain CL Brener > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > OC Schizotrypanum. > > XX > > RN [1] > > RP 1-546 > > RA De Melo L.D.B.; > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > RL Unpublished. > > XX > > RN [2] > > RP 1-546 > > RA De Melo L.D.B.; > > RT ; > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > RL 21949-900, Brazil > > XX > > FH Key Location/Qualifiers > > FH > > FT source 1..546 > > FT /organism="Trypanosoma cruzi strain CL Brener" > > FT /mol_type="genomic DNA" > > FT /strain="CL Brener" > > FT /db_xref="taxon:353153" > > FT gene <1..>546 > > FT /gene="ARC21" > > FT /note="TcARC21" > > FT mRNA <1..>546 > > FT /gene="ARC21" > > FT /product="actin-related protein 3" > > FT CDS 1..546 > > FT /gene="ARC21" > > FT /note="actin-binding protein; ARPC3 21 kDa; putative > > FT member of Arp2/3 complex" > > FT /codon_start=1 > > FT /product="actin-related protein 3" > > FT /protein_id="ABF13401.1" > > FT /db_xref="GI:93360014" > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > > FT FPEKDGTGNKFWMAFAKRPFLASS" > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > > agttag 546 > > // > > ID DQ472185 standard; DNA; INV; 543 BP. > > XX > > AC DQ472185; > > XX > > SV DQ472185.1 > > DT 15-MAY-2006 > > XX > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > > DE complete cds. > > XX > > KW . > > XX > > OS Trypanosoma cruzi strain CL Brener > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > OC Schizotrypanum. > > XX > > RN [1] > > RP 1-543 > > RA De Melo L.D.B.; > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > RL Unpublished. > > XX > > RN [2] > > RP 1-543 > > RA De Melo L.D.B.; > > RT ; > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > RL 21949-900, Brazil > > XX > > FH Key Location/Qualifiers > > FH > > FT source 1..543 > > FT /organism="Trypanosoma cruzi strain CL Brener" > > FT /mol_type="genomic DNA" > > FT /strain="CL Brener" > > FT /db_xref="taxon:353153" > > FT gene <1..>543 > > FT /gene="ARC20" > > FT /note="TcARC20" > > FT mRNA <1..>543 > > FT /gene="ARC20" > > FT /product="actin-related protein 4" > > FT CDS 1..543 > > FT /gene="ARC20" > > FT /note="actin-binding protein; ARPC4 20 kDa; putative > > FT member of Arp2/3 complex" > > FT /codon_start=1 > > FT /product="actin-related protein 4" > > FT /protein_id="ABF13402.1" > > FT /db_xref="GI:93360016" > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > > FT MKLNVNQRARRAAMEFFLALNFT" > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > > tga 543 > > // > > +++++++++++++++++++++++++++++++++ > > > > It looks to me like there's some kind of problem with parsing the > > sequence version number. I even tried the sequence from test directory > > (AY069118.em) with same outcome. > > > > Regards, > > > > Seth > > > > On 6/2/06, Richard Holland wrote: > > > Hi Seth. > > > > > > Your second point, about the authors string not being read correctly in > > > Genbank format, has been fixed (or should have been if I got the code > > > right!). Could you check the latest version of biojava-live out of CVS > > > and give it another go? Basically the parser did not recognise the > > > CONSRTM tag, as it is not mentioned in the sample record provided by > > > NCBI, which is what I based the parser on. > > > > > > I've set it up now so that it reads the CONSRTM tag, but the value is > > > merged with the authors tag with (consortium) appended. There will still > > > be problems if the consortium value has commas in it - not sure how to > > > fix this yet. > > > > > > Your first point is harder to solve because you did not provide a > > > complete stack trace for the exceptions you are getting. The complete > > > stack trace would enable me to identify exactly where things are going > > > wrong and give me a better chance of fixing them. Could you send the > > > stack trace, and I'll see what I can do. > > > > > > cheers, > > > Richard > > > > > > > > > On Thu, 2006-06-01 at 18:03 -0400, Seth Johnson wrote: > > > > Hi All, > > > > > > > > I'm a newbie to the whole BioJava(X) API and was hoping to get some > > > > clarification on several issues that I'm having. > > > > I am developing a parser that would take as input "NCBI Incremental > > > > ASN.1 Sequence Updates to Genbank" files ( > > > > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the > > > > ASN2GB converter ( > > > > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert > > > > resulting sequences to a format parsable by BioJava(X) ( > > > > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where > > > > my problems start. > > > > > > > > ISSUE 1: > > > > I've tried to parse all of the formats that ASN2GB outputs ( GenBank > > > > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML), > > > > tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank > > > > format is recognized by the > > > > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with > > > > some exceptions that I'll describe in issue #2. This is the code that > > > > I'm using to parse, for example, the EMBL output: > > > > > > > > BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb")); > > > > Namespace gbNspace = (Namespace) > > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > > Object[]{"gbSpace"} ); > > > > RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace); > > > > while (gbSeqs.hasNext()) { > > > > try { > > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > > // Further processing or RichSequence object from here > > > > > > > > } catch (BioException be){ > > > > be.printStackTrace(); > > > > } > > > > } > > > > > > > > The multi-sequence EMBL file looks like this: > > > > --------------------------------------------------------------------------------- > > > > ID DQ472184 standard; DNA; INV; 546 BP. > > > > XX > > > > AC DQ472184; > > > > XX > > > > SV DQ472184.1 > > > > DT 15-MAY-2006 > > > > XX > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene, > > > > DE complete cds. > > > > XX > > > > KW . > > > > XX > > > > OS Trypanosoma cruzi strain CL Brener > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > OC Schizotrypanum. > > > > XX > > > > RN [1] > > > > RP 1-546 > > > > RA De Melo L.D.B.; > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > RL Unpublished. > > > > XX > > > > RN [2] > > > > RP 1-546 > > > > RA De Melo L.D.B.; > > > > RT ; > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > RL 21949-900, Brazil > > > > XX > > > > FH Key Location/Qualifiers > > > > FH > > > > FT source 1..546 > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > FT /mol_type="genomic DNA" > > > > FT /strain="CL Brener" > > > > FT /db_xref="taxon:353153" > > > > FT gene <1..>546 > > > > FT /gene="ARC21" > > > > FT /note="TcARC21" > > > > FT mRNA <1..>546 > > > > FT /gene="ARC21" > > > > FT /product="actin-related protein 3" > > > > FT CDS 1..546 > > > > FT /gene="ARC21" > > > > FT /note="actin-binding protein; ARPC3 21 kDa; putative > > > > FT member of Arp2/3 complex" > > > > FT /codon_start=1 > > > > FT /product="actin-related protein 3" > > > > FT /protein_id="ABF13401.1" > > > > FT /db_xref="GI:93360014" > > > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG > > > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH > > > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL > > > > FT FPEKDGTGNKFWMAFAKRPFLASS" > > > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60 > > > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120 > > > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180 > > > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240 > > > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300 > > > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360 > > > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420 > > > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480 > > > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540 > > > > agttag 546 > > > > // > > > > ID DQ472185 standard; DNA; INV; 543 BP. > > > > XX > > > > AC DQ472185; > > > > XX > > > > SV DQ472185.1 > > > > DT 15-MAY-2006 > > > > XX > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene, > > > > DE complete cds. > > > > XX > > > > KW . > > > > XX > > > > OS Trypanosoma cruzi strain CL Brener > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma; > > > > OC Schizotrypanum. > > > > XX > > > > RN [1] > > > > RP 1-543 > > > > RA De Melo L.D.B.; > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins"; > > > > RL Unpublished. > > > > XX > > > > RN [2] > > > > RP 1-543 > > > > RA De Melo L.D.B.; > > > > RT ; > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases. > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ > > > > RL 21949-900, Brazil > > > > XX > > > > FH Key Location/Qualifiers > > > > FH > > > > FT source 1..543 > > > > FT /organism="Trypanosoma cruzi strain CL Brener" > > > > FT /mol_type="genomic DNA" > > > > FT /strain="CL Brener" > > > > FT /db_xref="taxon:353153" > > > > FT gene <1..>543 > > > > FT /gene="ARC20" > > > > FT /note="TcARC20" > > > > FT mRNA <1..>543 > > > > FT /gene="ARC20" > > > > FT /product="actin-related protein 4" > > > > FT CDS 1..543 > > > > FT /gene="ARC20" > > > > FT /note="actin-binding protein; ARPC4 20 kDa; putative > > > > FT member of Arp2/3 complex" > > > > FT /codon_start=1 > > > > FT /product="actin-related protein 4" > > > > FT /protein_id="ABF13402.1" > > > > FT /db_xref="GI:93360016" > > > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH > > > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV > > > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA > > > > FT MKLNVNQRARRAAMEFFLALNFT" > > > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60 > > > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120 > > > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180 > > > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240 > > > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300 > > > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360 > > > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420 > > > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480 > > > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540 > > > > tga 543 > > > > // > > > > ----------------------------------------------------------------------- > > > > I get an exception message "Could Not Read Sequence". Same thing > > > > happens if I use the readINSDSetDNA reader instead of readEMBLDNA one > > > > with the following INSDset file (beginning of the file): > > > > > > > > > > > > > > > > > > > > DQ022078 > > > > 16729 > > > > DNA > > > > linear > > > > ENV > > > > 15-MAY-2006 > > > > 15-MAY-2006 > > > > Uncultured bacterium WWRS-2005 putative > > > > aminoglycoside phosphotransferase (a3.001), putative oxidoreductase > > > > (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase > > > > class C (estA3), putative permease (a3.005), putative transmembrane > > > > signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone > > > > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative > > > > asparaginase (a3.010), hypothetical protein (a3.011), hypothetical > > > > protein (a3.012), putative membrane protease subunit (a3.013), > > > > putative haloalkane dehalogenase (a3.014), putative transcriptional > > > > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and > > > > hypothetical protein (a3.017) genes, complete cds > > > > DQ022078 > > > > > > > > gb|DQ022078.1| > > > > gi|71842722 > > > > > > > > > > > > ENV > > > > > > > > > > > > > > > > ? > > > > 1..16729 > > > > > > > > Schmeisser,C. > > > > Elend,C. > > > > Streit,W.R. > > > > > > > > Isolation and biochemical characterization > > > > of two novel metagenome derived esterases > > > > Appl. Environ. Microbiol. 0:0-0 > > > > (2006) > > > > > > > > > > > > ? > > > > 1..16729 > > > > > > > > Schmeisser,C. > > > > Elend,C. > > > > Streit,W.R. > > > > > > > > Submitted (29-APR-2005) to the > > > > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University > > > > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057, > > > > Germany > > > > > > > > > > > > > > > > So my question is wether the ASN2GB produces output that's > > > > incompatible with BioJava parsers or is there a problem with the > > > > sequence themselves or the problems with the majority of parsers??? > > > > Could it be that I'm using the API wrongly for the above formats, > > > > although GenBank parser works as advertised with some exceptions > > > > below: > > > > > > > > ISSUE #2: > > > > When I try to parse GenBank files using the following code: > > > > > > > > BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb")); > > > > Namespace gbNspace = (Namespace) > > > > RichObjectFactory.getObject(SimpleNamespace.class, new > > > > Object[]{"gbSpace"} ); > > > > RichSequenceIterator gbSeqs = > > > > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace); > > > > while (gbSeqs.hasNext()) { > > > > try { > > > > RichSequence rs = gbSeqs.nextRichSequence(); > > > > // Further processing or RichSequence object from here > > > > > > > > } catch (BioException be){ > > > > be.printStackTrace(); > > > > } > > > > } > > > > > > > > Genbank file in question: > > > > > > > > LOCUS BC074905 838 bp mRNA linear PRI 15-APR-2006 > > > > DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038 > > > > IMAGE:30915482), complete cds. > > > > ACCESSION BC074905 > > > > VERSION BC074905.2 GI:50959825 > > > > KEYWORDS MGC. > > > > SOURCE Homo sapiens (human) > > > > ORGANISM Homo sapiens > > > > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; > > > > Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; > > > > Catarrhini; Hominidae; Homo. > > > > REFERENCE 1 (bases 1 to 838) > > > > AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G., > > > > Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D., > > > > Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K., > > > > Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F., > > > > Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L., > > > > Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L., > > > > Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S., > > > > Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J., > > > > Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J., > > > > McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S., > > > > Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W., > > > > Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A., > > > > Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S., > > > > Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y., > > > > Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D., > > > > Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M., > > > > Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E., > > > > Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A. > > > > CONSRTM Mammalian Gene Collection Program Team > > > > TITLE Generation and initial analysis of more than 15,000 full-length > > > > human and mouse cDNA sequences > > > > JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002) > > > > PUBMED 12477932 > > > > REFERENCE 2 (bases 1 to 838) > > > > CONSRTM NIH MGC Project > > > > TITLE Direct Submission > > > > JOURNAL Submitted (25-JUN-2004) National Institutes of Health, Mammalian > > > > Gene Collection (MGC), Bethesda, MD 20892-2590, USA > > > > REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov > > > > COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832. > > > > Contact: MGC help desk > > > > Email: cgapbs-r at mail.nih.gov > > > > Tissue Procurement: Genome Sequence Centre, British Columbia Cancer > > > > Center > > > > cDNA Library Preparation: British Columbia Cancer Research Center > > > > cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL) > > > > DNA Sequencing by: Genome Sequence Centre, > > > > BC Cancer Agency, Vancouver, BC, Canada > > > > info at bcgsc.bc.ca > > > > Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson > > > > Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen > > > > Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel > > > > Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave > > > > Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth > > > > Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao, > > > > Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR > > > > Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang, > > > > Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra. > > > > > > > > Clone distribution: MGC clone distribution information can be found > > > > through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov > > > > Series: IRBU Plate: 4 Row: C Column: 3. > > > > > > > > Differences found between this sequence and the human reference > > > > genome (build 36) are described in misc_difference features below. > > > > FEATURES Location/Qualifiers > > > > source 1..838 > > > > /organism="Homo sapiens" > > > > /mol_type="mRNA" > > > > /db_xref="taxon:9606" > > > > /clone="MGC:104038 IMAGE:30915482" > > > > /tissue_type="Lung, PCR rescued clones" > > > > /clone_lib="NIH_MGC_273" > > > > /lab_host="DH10B" > > > > /note="Vector: pCR4 Topo TA with reversed insert" > > > > gene 1..838 > > > > /gene="KLK14" > > > > /note="synonym: KLK-L6" > > > > /db_xref="GeneID:43847" > > > > /db_xref="HGNC:6362" > > > > /db_xref="IMGT/GENE-DB:6362" > > > > /db_xref="MIM:606135" > > > > CDS 49..804 > > > > /gene="KLK14" > > > > /codon_start=1 > > > > /product="KLK14 protein" > > > > /protein_id="AAH74905.1" > > > > /db_xref="GI:50959826" > > > > /db_xref="GeneID:43847" > > > > /db_xref="HGNC:6362" > > > > /db_xref="IMGT/GENE-DB:6362" > > > > /db_xref="MIM:606135" > > > > /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA > > > > GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN > > > > YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA > > > > SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV > > > > SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK" > > > > misc_difference 98 > > > > /gene="KLK14" > > > > /note="'G' in cDNA is 'A' in the human genome; amino acid > > > > difference: 'R' in cDNA, 'Q' in the human genome." > > > > misc_difference 133 > > > > /gene="KLK14" > > > > /note="'T' in cDNA is 'C' in the human genome; amino acid > > > > difference: 'Y' in cDNA, 'H' in the human genome." > > > > ORIGIN > > > > 1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg > > > > 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag > > > > 121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg > > > > 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact > > > > 241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg > > > > 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac > > > > 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg > > > > 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga > > > > 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc > > > > 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg > > > > 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct > > > > 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc > > > > 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt > > > > 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc > > > > // > > > > > > > > I get the following exception: > > > > > > > > java.lang.IllegalArgumentException: Authors string cannot be null > > > > org.biojava.bio.BioException: Could not read sequence > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107) > > > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258) > > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:341) > > > > Caused by: java.lang.IllegalArgumentException: Authors string cannot be null > > > > at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76) > > > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356) > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > > > > > ----------------------------------------------------------------------- > > > > > > > > I'm trying to see what could be the problem with this particular > > > > sequence. Looks to me like the AUTHORS portion is not getting parsed > > > > correctly. Any ideas would be greatly appreciated! > > > > > > > -- > > > Richard Holland (BioMart Team) > > > EMBL-EBI > > > Wellcome Trust Genome Campus > > > Hinxton > > > Cambridge CB10 1SD > > > UNITED KINGDOM > > > Tel: +44-(0)1223-494416 > > > > > > > > > > > -- > Richard Holland (BioMart Team) > EMBL-EBI > Wellcome Trust Genome Campus > Hinxton > Cambridge CB10 1SD > UNITED KINGDOM > Tel: +44-(0)1223-494416 > > -- Best Regards, Seth Johnson Senior Bioinformatics Associate Ph: (202) 470-0900 Fx: (775) 251-0358 From richard.holland at ebi.ac.uk Mon Jun 5 15:45:13 2006 From: richard.holland at ebi.ac.uk (Richard Holland) Date: Mon, 05 Jun 2006 16:45:13 +0100 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: References: <1149238900.3948.87.camel@texas.ebi.ac.uk> <1149520267.3947.36.camel@texas.ebi.ac.uk> Message-ID: <1149522313.3947.48.camel@texas.ebi.ac.uk> Hmmm... interesting. I _could_ put in a special case that ignores the question marks, but that wouldn't be 'nice' really - this is more of a problem with the program that is producing the Genbank files than a problem with the parser trying to read them. '?' is not a valid tag in the official Genbank format, and has no meaning attached to it that I can work out, so I'm reluctant to make the parser recognise it. I'd suggest you contact the people who write the software you are using to produce the Genbank files and ask them if they could stick to the rules! In the meantime you could work around the problem by stripping the question marks in some kind of pre-processor before passing it onto BioJavaX for parsing. cheers, Richard On Mon, 2006-06-05 at 11:39 -0400, Seth Johnson wrote: > Removing '?' (or several of them in my case) avoids the following exception: > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > org.biojava.bio.BioException: Could not read sequence > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > at exonhit.parsers.GenBankParser.main(GenBankParser.java:348) > Caused by: org.biojava.bio.seq.io.ParseException: DQ415957 > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > ... 1 more > Java Result: -1 > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > I don't know where that previous tokenization problem came from since > I can no longer reproduce it. This time it's more or less straight > forward. > Here's the original file with question marks: > ============================ > LOCUS DQ415957 1437 bp mRNA linear VRT 01-JUN-2006 > DEFINITION Danio rerio capillary morphogenesis protein 2A (cmg2a) mRNA, > complete cds. > ACCESSION DQ415957 > VERSION DQ415957.1 GI:89513612 > KEYWORDS . > SOURCE Unknown. > ORGANISM Unknown. > Unclassified. > ? > ? > FEATURES Location/Qualifiers > ? > gene 1..1437 > /gene="cmg2a" > CDS 1..1437 > /gene="cmg2a" > /note="cell surface receptor; similar to anthrax toxin > receptor 2 (ANTXR2, ATR2, CMG2)" > /codon_start=1 > /product="capillary morphogenesis protein 2A" > /protein_id="ABD74633.1" > /db_xref="GI:89513613" > /translation="MTKENLWSVATTATLFFCLCFSSFKAETPSCHGAYDLYFVLDRS > GSVSTDWSEIYDFVKNLTERFVSPNLRVSFIVFSSRAEIVLPLTGDRSEINKGLKTLS > EVNPAGETYMHEGIKLATEQMKKEPKKSSSIIVALTDGKLETYIHQLTIDEADSARKY > GARVYCVGVKDFDEEQLADVADSKEQVFPVKGGFQALKGIVNSILKQSCTEILTVEPS > SVCVNQSFDIVLRGNGFAVGRQTEGVICSFIVDGVTYKQKPTKVKIDYILCPAPVLYT > VGQQMEVLISLNSGTSYITSAFIITASSCSDGTVVAIVFLVLFLLLALALMWWFWPLC > CTVVIKDPPPQRPPPPPPKLEPDPEPKKKWPTVDASYYGGRGAGGIKRMEVRWGEKGS > TEEGARLEMAKNAVVSIQEESEEPMVKKPRAPAQTCHQSESKWYTPIRGRLDALWALL > RRQYDRVSVMRPTSADKGRCMNFSRTQH" > ORIGIN > 1 atgacaaagg aaaatctctg gagcgtggca accacggcga ctcttttctt ctgtttatgc > 61 ttttcatctt ttaaagcgga aaccccatct tgtcatggtg cctacgacct gtactttgtg > 121 ttggaccgat ctggaagtgt ttcgactgac tggagtgaaa tctatgactt tgtcaaaaat > 181 cttacagaga gatttgtgag tccaaatctg cgagtgtcct tcattgtttt ttcatcaaga > 241 gcagagattg tgttaccgct cactggagac aggtcagaaa ttaataaagg cctgaagacc > 301 ttaagtgagg tcaatccagc tggagaaaca tacatgcatg aaggaattaa attggcaact > 361 gaacaaatga aaaaagagcc taaaaagtcc tctagtatta ttgtggcctt gactgatgga > 421 aagcttgaaa cgtatatcca tcaactcact attgacgagg ctgattcagc aaggaagtat > 481 ggggctcgtg tgtactgtgt tggtgtaaaa gactttgatg aagaacagct agccgatgtg > 541 gctgattcca aggagcaagt gttcccagtc aaaggaggct ttcaggctct caaaggcatc > 601 gttaactcga tcctcaagca atcatgcacc gaaatcctaa cagtggaacc gtccagcgtc > 661 tgcgtgaacc agtcctttga cattgttttg agagggaacg ggttcgcagt ggggagacaa > 721 acagaaggag tcatctgcag tttcatagtg gatggagtta cttacaaaca aaaaccaacc > 781 aaagtgaaga ttgactacat cctatgtcct gctccagtgc tgtatacagt tggacagcaa > 841 atggaggttc tgatcagttt gaacagtgga acatcatata tcaccagtgc tttcatcatc > 901 actgcctctt catgttcgga cggcacagtg gtggccattg tgttcttggt gctttttctc > 961 ctgttggctt tggctctgat gtggtggttc tggcctctat gctgcactgt cgttattaaa > 1021 gacccacctc cacaaagacc tcctccacct ccacctaagc tagagccaga cccggaaccc > 1081 aagaagaagt ggccaactgt ggatgcatct tactatgggg gaagaggagc tggtggaatc > 1141 aaacgcatgg aggtccgttg gggagaaaaa gggtctacag aggaaggtgc aagactagag > 1201 atggctaaga atgcagtagt gtcaatacaa gaggaatcag aagaacccat ggtcaaaaag > 1261 ccaagagcac ctgcacaaac atgccatcaa tctgaatcca agtggtatac accaatcaga > 1321 ggccgtcttg acgcactgtg ggctcttttg cggcggcaat atgaccgagt ttcagttatg > 1381 cgaccaactt ctgcagataa gggtcgctgt atgaatttca gtcgcacgca gcattaa > // > > ============================ > > > On 6/5/06, Richard Holland wrote: > > Hi again. > > > > Could you remove the offending question mark from the GenBank file and > > try it again to see if that fixes it? The parser should just ignore it > > but apparently not. The error looks weird to me because the tokenization > > for a DNA GenBank file _does_ contain the letter 't'! Not sure what's > > going on here. > ... > > > > cheers, > > Richard > > > > On Mon, 2006-06-05 at 10:37 -0400, Seth Johnson wrote: > > > Hell again Richard, > > > > > > No sooner I've said about the fix of the last parsing exception than > > > another one came up with Genbank format: > > > -------------------------------------- > > > org.biojava.bio.seq.io.ParseException: DQ431065 > > > org.biojava.bio.BioException: Could not read sequence > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > at exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151) > > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246) > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:326) > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ431065 > > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > ... 3 more > > > org.biojava.bio.seq.io.ParseException: > > > org.biojava.bio.symbol.IllegalSymbolException: This tokenization > > > doesn't contain character: 't' > > > ---------------------------------------- > > > The Genbank file that caused it is as follows: > > > ========================================= > > > LOCUS DQ431065 425 bp DNA linear INV 01-JUN-2006 > > > DEFINITION Reticulitermes sp. ALS-2006c 16S ribosomal RNA gene, partial > > > sequence; mitochondrial. > > > ACCESSION DQ431065 > > > VERSION DQ431065.1 GI:90102206 > > > KEYWORDS . > > > SOURCE Vaccinium corymbosum > > > ORGANISM Vaccinium corymbosum > > > Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; > > > Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; > > > asterids; Ericales; Ericaceae; Vaccinioideae; Vaccinieae; > > > Vaccinium. > > > ? > > > REFERENCE 2 (bases 1 to 425) > > > AUTHORS Naik,L.D. and Rowland,L.J. > > > TITLE Expressed Sequence Tags of cDNA clones from subtracted library of > > > Vaccinium corymbosum > > > JOURNAL Unpublished (2005) > > > FEATURES Location/Qualifiers > > > source 1..425 > > > /organism="Vaccinium corymbosum" > > > /mol_type="genomic DNA" > > > /cultivar="Bluecrop" > > > /db_xref="taxon:69266" > > > /tissue_type="Flower buds" > > > /clone_lib="Subtracted cDNA library of Vaccinium > > > corymbosum" > > > /dev_stage="399 hour chill unit exposure" > > > /note="Vector: pCR4TOPO; Site_1: Eco R I; Site_2: Eco R I" > > > rRNA <1..>425 > > > /product="16S ribosomal RNA" > > > ORIGIN > > > 1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga agtatggcct gcccgctgac > > > 61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg tagcatagtc attagttctt > > > 121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc tgtcttaatt ttgaattgtt > > > 181 tattgaattt ggtctttgag ttaaaattct tagatgtttt tatgggacga gaagacccta > > > 241 tagagtttaa catttattat ggtccttttc tgtttgtgag ggctcactgg gccgtctaat > > > 301 atgttttgtt ggggtgatgg gagggaataa tttaacccct cctttttatt attatattta > > > 361 tttatattta tttgatccat ttattttgat tgtaagatta aattacctta gggataacag > > > 421 cgtaa > > > // > > > ================================== > > > I think it's the presence of the '?' at the beginning of the line?!?! > > > I'm not sure wether the information that was supposed to be present > > > instead of those question marks is absent from the original ASN.1 > > > batch file or it's a bug in the NCBI ASN2GO software. It looks to me > > > that the former is the case since the file from NCBI website contains > > > much more information than the batch file. Just bringing this to > > > everyone's attention. > > > > > > > > > -- > > > Best Regards, > > > > > > > > > Seth Johnson > > > Senior Bioinformatics Associate > > > > > > Ph: (202) 470-0900 > > > Fx: (775) 251-0358 > > > > > > On 6/2/06, Richard Holland wrote: > > > > Hi Seth. > > > > > > > > Your second point, about the authors string not being read correctly in > > > > Genbank format, has been fixed (or should have been if I got the code > > > > right!). Could you check the latest version of biojava-live out of CVS > > > > and give it another go? Basically the parser did not recognise the > > > > CONSRTM tag, as it is not mentioned in the sample record provided by > > > > NCBI, which is what I based the parser on. > > > ... > > > > > > > > cheers, > > > > Richard > > > > > > > > > > -- > > Richard Holland (BioMart Team) > > EMBL-EBI > > Wellcome Trust Genome Campus > > Hinxton > > Cambridge CB10 1SD > > UNITED KINGDOM > > Tel: +44-(0)1223-494416 > > > > > > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 From johnson.biotech at gmail.com Mon Jun 5 15:39:40 2006 From: johnson.biotech at gmail.com (Seth Johnson) Date: Mon, 5 Jun 2006 11:39:40 -0400 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: <1149520267.3947.36.camel@texas.ebi.ac.uk> References: <1149238900.3948.87.camel@texas.ebi.ac.uk> <1149520267.3947.36.camel@texas.ebi.ac.uk> Message-ID: Removing '?' (or several of them in my case) avoids the following exception: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ org.biojava.bio.BioException: Could not read sequence at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) at exonhit.parsers.GenBankParser.main(GenBankParser.java:348) Caused by: org.biojava.bio.seq.io.ParseException: DQ415957 at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) ... 1 more Java Result: -1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I don't know where that previous tokenization problem came from since I can no longer reproduce it. This time it's more or less straight forward. Here's the original file with question marks: ============================ LOCUS DQ415957 1437 bp mRNA linear VRT 01-JUN-2006 DEFINITION Danio rerio capillary morphogenesis protein 2A (cmg2a) mRNA, complete cds. ACCESSION DQ415957 VERSION DQ415957.1 GI:89513612 KEYWORDS . SOURCE Unknown. ORGANISM Unknown. Unclassified. ? ? FEATURES Location/Qualifiers ? gene 1..1437 /gene="cmg2a" CDS 1..1437 /gene="cmg2a" /note="cell surface receptor; similar to anthrax toxin receptor 2 (ANTXR2, ATR2, CMG2)" /codon_start=1 /product="capillary morphogenesis protein 2A" /protein_id="ABD74633.1" /db_xref="GI:89513613" /translation="MTKENLWSVATTATLFFCLCFSSFKAETPSCHGAYDLYFVLDRS GSVSTDWSEIYDFVKNLTERFVSPNLRVSFIVFSSRAEIVLPLTGDRSEINKGLKTLS EVNPAGETYMHEGIKLATEQMKKEPKKSSSIIVALTDGKLETYIHQLTIDEADSARKY GARVYCVGVKDFDEEQLADVADSKEQVFPVKGGFQALKGIVNSILKQSCTEILTVEPS SVCVNQSFDIVLRGNGFAVGRQTEGVICSFIVDGVTYKQKPTKVKIDYILCPAPVLYT VGQQMEVLISLNSGTSYITSAFIITASSCSDGTVVAIVFLVLFLLLALALMWWFWPLC CTVVIKDPPPQRPPPPPPKLEPDPEPKKKWPTVDASYYGGRGAGGIKRMEVRWGEKGS TEEGARLEMAKNAVVSIQEESEEPMVKKPRAPAQTCHQSESKWYTPIRGRLDALWALL RRQYDRVSVMRPTSADKGRCMNFSRTQH" ORIGIN 1 atgacaaagg aaaatctctg gagcgtggca accacggcga ctcttttctt ctgtttatgc 61 ttttcatctt ttaaagcgga aaccccatct tgtcatggtg cctacgacct gtactttgtg 121 ttggaccgat ctggaagtgt ttcgactgac tggagtgaaa tctatgactt tgtcaaaaat 181 cttacagaga gatttgtgag tccaaatctg cgagtgtcct tcattgtttt ttcatcaaga 241 gcagagattg tgttaccgct cactggagac aggtcagaaa ttaataaagg cctgaagacc 301 ttaagtgagg tcaatccagc tggagaaaca tacatgcatg aaggaattaa attggcaact 361 gaacaaatga aaaaagagcc taaaaagtcc tctagtatta ttgtggcctt gactgatgga 421 aagcttgaaa cgtatatcca tcaactcact attgacgagg ctgattcagc aaggaagtat 481 ggggctcgtg tgtactgtgt tggtgtaaaa gactttgatg aagaacagct agccgatgtg 541 gctgattcca aggagcaagt gttcccagtc aaaggaggct ttcaggctct caaaggcatc 601 gttaactcga tcctcaagca atcatgcacc gaaatcctaa cagtggaacc gtccagcgtc 661 tgcgtgaacc agtcctttga cattgttttg agagggaacg ggttcgcagt ggggagacaa 721 acagaaggag tcatctgcag tttcatagtg gatggagtta cttacaaaca aaaaccaacc 781 aaagtgaaga ttgactacat cctatgtcct gctccagtgc tgtatacagt tggacagcaa 841 atggaggttc tgatcagttt gaacagtgga acatcatata tcaccagtgc tttcatcatc 901 actgcctctt catgttcgga cggcacagtg gtggccattg tgttcttggt gctttttctc 961 ctgttggctt tggctctgat gtggtggttc tggcctctat gctgcactgt cgttattaaa 1021 gacccacctc cacaaagacc tcctccacct ccacctaagc tagagccaga cccggaaccc 1081 aagaagaagt ggccaactgt ggatgcatct tactatgggg gaagaggagc tggtggaatc 1141 aaacgcatgg aggtccgttg gggagaaaaa gggtctacag aggaaggtgc aagactagag 1201 atggctaaga atgcagtagt gtcaatacaa gaggaatcag aagaacccat ggtcaaaaag 1261 ccaagagcac ctgcacaaac atgccatcaa tctgaatcca agtggtatac accaatcaga 1321 ggccgtcttg acgcactgtg ggctcttttg cggcggcaat atgaccgagt ttcagttatg 1381 cgaccaactt ctgcagataa gggtcgctgt atgaatttca gtcgcacgca gcattaa // ============================ On 6/5/06, Richard Holland wrote: > Hi again. > > Could you remove the offending question mark from the GenBank file and > try it again to see if that fixes it? The parser should just ignore it > but apparently not. The error looks weird to me because the tokenization > for a DNA GenBank file _does_ contain the letter 't'! Not sure what's > going on here. ... > > cheers, > Richard > > On Mon, 2006-06-05 at 10:37 -0400, Seth Johnson wrote: > > Hell again Richard, > > > > No sooner I've said about the fix of the last parsing exception than > > another one came up with Genbank format: > > -------------------------------------- > > org.biojava.bio.seq.io.ParseException: DQ431065 > > org.biojava.bio.BioException: Could not read sequence > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > at exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151) > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246) > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:326) > > Caused by: org.biojava.bio.seq.io.ParseException: DQ431065 > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > ... 3 more > > org.biojava.bio.seq.io.ParseException: > > org.biojava.bio.symbol.IllegalSymbolException: This tokenization > > doesn't contain character: 't' > > ---------------------------------------- > > The Genbank file that caused it is as follows: > > ========================================= > > LOCUS DQ431065 425 bp DNA linear INV 01-JUN-2006 > > DEFINITION Reticulitermes sp. ALS-2006c 16S ribosomal RNA gene, partial > > sequence; mitochondrial. > > ACCESSION DQ431065 > > VERSION DQ431065.1 GI:90102206 > > KEYWORDS . > > SOURCE Vaccinium corymbosum > > ORGANISM Vaccinium corymbosum > > Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; > > Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; > > asterids; Ericales; Ericaceae; Vaccinioideae; Vaccinieae; > > Vaccinium. > > ? > > REFERENCE 2 (bases 1 to 425) > > AUTHORS Naik,L.D. and Rowland,L.J. > > TITLE Expressed Sequence Tags of cDNA clones from subtracted library of > > Vaccinium corymbosum > > JOURNAL Unpublished (2005) > > FEATURES Location/Qualifiers > > source 1..425 > > /organism="Vaccinium corymbosum" > > /mol_type="genomic DNA" > > /cultivar="Bluecrop" > > /db_xref="taxon:69266" > > /tissue_type="Flower buds" > > /clone_lib="Subtracted cDNA library of Vaccinium > > corymbosum" > > /dev_stage="399 hour chill unit exposure" > > /note="Vector: pCR4TOPO; Site_1: Eco R I; Site_2: Eco R I" > > rRNA <1..>425 > > /product="16S ribosomal RNA" > > ORIGIN > > 1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga agtatggcct gcccgctgac > > 61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg tagcatagtc attagttctt > > 121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc tgtcttaatt ttgaattgtt > > 181 tattgaattt ggtctttgag ttaaaattct tagatgtttt tatgggacga gaagacccta > > 241 tagagtttaa catttattat ggtccttttc tgtttgtgag ggctcactgg gccgtctaat > > 301 atgttttgtt ggggtgatgg gagggaataa tttaacccct cctttttatt attatattta > > 361 tttatattta tttgatccat ttattttgat tgtaagatta aattacctta gggataacag > > 421 cgtaa > > // > > ================================== > > I think it's the presence of the '?' at the beginning of the line?!?! > > I'm not sure wether the information that was supposed to be present > > instead of those question marks is absent from the original ASN.1 > > batch file or it's a bug in the NCBI ASN2GO software. It looks to me > > that the former is the case since the file from NCBI website contains > > much more information than the batch file. Just bringing this to > > everyone's attention. > > > > > > -- > > Best Regards, > > > > > > Seth Johnson > > Senior Bioinformatics Associate > > > > Ph: (202) 470-0900 > > Fx: (775) 251-0358 > > > > On 6/2/06, Richard Holland wrote: > > > Hi Seth. > > > > > > Your second point, about the authors string not being read correctly in > > > Genbank format, has been fixed (or should have been if I got the code > > > right!). Could you check the latest version of biojava-live out of CVS > > > and give it another go? Basically the parser did not recognise the > > > CONSRTM tag, as it is not mentioned in the sample record provided by > > > NCBI, which is what I based the parser on. > > ... > > > > > > cheers, > > > Richard > > > > > > > -- > Richard Holland (BioMart Team) > EMBL-EBI > Wellcome Trust Genome Campus > Hinxton > Cambridge CB10 1SD > UNITED KINGDOM > Tel: +44-(0)1223-494416 > > -- Best Regards, Seth Johnson Senior Bioinformatics Associate Ph: (202) 470-0900 Fx: (775) 251-0358 From johnson.biotech at gmail.com Mon Jun 5 14:22:57 2006 From: johnson.biotech at gmail.com (Seth Johnson) Date: Mon, 5 Jun 2006 10:22:57 -0400 Subject: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files In-Reply-To: <1149497066.3947.12.camel@texas.ebi.ac.uk> References: <1149238900.3948.87.camel@texas.ebi.ac.uk> <1149497066.3947.12.camel@texas.ebi.ac.uk> Message-ID: I apologize again for not posting the stacktrace. Here it is: ========================== org.biojava.bio.BioException: Could not read sequence at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) at exonhit.parsers.GenBankParser.main(GenBankParser.java:347) Caused by: java.lang.NullPointerException at org.biojavax.bio.seq.io.SimpleRichSequenceBuilder.addFeatureProperty(SimpleRichSequenceBuilder.java:356) at org.biojavax.bio.seq.io.INSDseqFormat$INSDseqHandler.endElement(INSDseqFormat.java:853) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242) at javax.xml.parsers.SAXParser.parse(SAXParser.java:375) at org.biojavax.utils.XMLTools.readXMLChunk(XMLTools.java:97) at org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:246) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) ... 1 more Java Result: -1 ============================ Here's the XML that causes that exception (taken out of a bigger file of several hundred sequences): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ DQ485973 1356 DNA linear ENV

08-MAY-2006

Uncultured Mollicutes bacterium clone P7 16S ribosomal RNA gene, partial sequence

DQ485973

DQ485973.1

gb|DQ485973.1| gi|94482885