From andreas at sdsc.edu Wed Jul 8 02:20:59 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 7 Jul 2009 23:20:59 -0700 Subject: [Biojava-l] summary biojava user meeting Message-ID: <59a41c430907072320k3d5a4415u962d59a10d286beb@mail.gmail.com> Hi, Here a quick summary of the BioJava user meeting we had last week at the BOSC conference: The following people were present: Mattias Piipari Martijn Devisscher Frederik Decouttere Richard Holland Andreas Prlic The new modularized code base will allow for individual people to take over responsibility of some of the sub-modules as well as the contribution of new modules., which I both welcome greatly. As such it was great to have Mattias, Martijn and Frederik there and expressing their interest in this. Mattias is interested in contributing a new module related to machine learning. Martijn and Frederik are interested in providing a new GUI module (seqpad). Due to this our discussions were mainly related to how to organize the contribution of new modules and their maintainance: * Before starting a new module the code should undergo public code review * New modules need docu (wiki cookbook) and junit tests. * A Module Maintainer (MM) is the main responsible for everything related to the module. * MM coordinates patches and other user contributions for the module * MM can write papers related to the code in the module without having to cite all of the other BioJava contributors. * A MM volunteers to support the module for (at least) a year. * All MMs will be listed by name on a wiki page in order to clarify responsibilities Andreas From andrew at nervechannel.com Wed Jul 8 14:38:53 2009 From: andrew at nervechannel.com (Andrew Clegg) Date: Wed, 8 Jul 2009 19:38:53 +0100 Subject: [Biojava-l] Retrieving AUTHOR information from PDB file Message-ID: Hopefully this isn't a FAQ but I couldn't find it via Google. I'm parsing PDB files with PDBFileReader. I want to extract the AUTHOR line(s) but I can't find a way to do this. You can get the corresponding JRNL field with Structure#getJournalArticle() but the authors of this aren't necessarily the same as the authors of the structure itself. Any ideas? I'm probably just being shortsighted and overlooking something... Thanks! Andrew. -- :: http://biotext.org.uk/ :: From andreas at sdsc.edu Thu Jul 9 00:21:49 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 8 Jul 2009 21:21:49 -0700 Subject: [Biojava-l] Retrieving AUTHOR information from PDB file In-Reply-To: References: Message-ID: <59a41c430907082121x23fceb7hdc7d51c861b84c9c@mail.gmail.com> Hi Andrew, The PdbFileParser at the present does not process the AUTHOR lines, yet, but should be easy to add.. If you need this urgently, you could quickly patch the PdbFileParser, otherwise I'll add it to SVN in the next couple of days... Andreas On Wed, Jul 8, 2009 at 11:38 AM, Andrew Clegg wrote: > Hopefully this isn't a FAQ but I couldn't find it via Google. > > I'm parsing PDB files with PDBFileReader. I want to extract the AUTHOR > line(s) but I can't find a way to do this. > > You can get the corresponding JRNL field with > Structure#getJournalArticle() but the authors of this aren't > necessarily the same as the authors of the structure itself. > > Any ideas? I'm probably just being shortsighted and overlooking > something... > > Thanks! > > Andrew. > > -- > :: http://biotext.org.uk/ :: > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From andrew at nervechannel.com Thu Jul 9 04:38:08 2009 From: andrew at nervechannel.com (Andrew Clegg) Date: Thu, 9 Jul 2009 09:38:08 +0100 Subject: [Biojava-l] Retrieving AUTHOR information from PDB file In-Reply-To: <59a41c430907082121x23fceb7hdc7d51c861b84c9c@mail.gmail.com> References: <59a41c430907082121x23fceb7hdc7d51c861b84c9c@mail.gmail.com> Message-ID: 2009/7/9 Andreas Prlic : > Hi Andrew, > > The PdbFileParser at the present does not process the AUTHOR lines, yet, but > should be easy to add.. If you need this urgently, you could quickly patch > the PdbFileParser, otherwise I'll add it to SVN in the next couple of > days... No, it's not urgent, I can wait til you have a chance to do it rather than trying to figure it out myself :-) Could you let me know when you've added it in though please? Many thanks! Andrew. From andrew at nervechannel.com Thu Jul 9 07:12:16 2009 From: andrew at nervechannel.com (Andrew Clegg) Date: Thu, 9 Jul 2009 12:12:16 +0100 Subject: [Biojava-l] Retrieving AUTHOR information from PDB file In-Reply-To: References: <59a41c430907082121x23fceb7hdc7d51c861b84c9c@mail.gmail.com> Message-ID: By the way -- does anyone have any documentation on which PDB fields correspond to which methods on the Structure and PDBHeader objects? In some cases, this is obvious, but in others it's not so clear -- for example PDBHeader#getDescription(), since there isn't a DESCRIPTION field in a PDB file. If not, I can start putting one together for the wiki, but I don't want to duplicate work... Andrew. 2009/7/9 Andrew Clegg : > 2009/7/9 Andreas Prlic : >> Hi Andrew, >> >> The PdbFileParser at the present does not process the AUTHOR lines, yet, but >> should be easy to add.. If you need this urgently, you could quickly patch >> the PdbFileParser, otherwise I'll add it to SVN in the next couple of >> days... > > No, it's not urgent, I can wait til you have a chance to do it rather > than trying to figure it out myself :-) > > Could you let me know when you've added it in though please? > > Many thanks! > > Andrew. > -- :: http://biotext.org.uk/ :: From paolo.pavan at gmail.com Thu Jul 9 11:58:07 2009 From: paolo.pavan at gmail.com (Paolo Pavan) Date: Thu, 9 Jul 2009 17:58:07 +0200 Subject: [Biojava-l] Assembly data reading Message-ID: <56be91b60907090858t41f2c72cwf7db057e6390d6db@mail.gmail.com> Hi everybody, I'm almost new to this topic, I would like to know if there is something can help me to load in my java program data from a large 454 contig. I need to retain in memory and access data from the single reads forming the contig too. If it is not possible to load a *.gff data file it should be ok to load a *.ace data file too. Many thanks for any suggestion you can give me! Greetings, Paolo From wzhao6898 at gmail.com Thu Jul 9 13:16:37 2009 From: wzhao6898 at gmail.com (David Zhao) Date: Thu, 9 Jul 2009 17:16:37 +0000 (UTC) Subject: [Biojava-l] Pairwise alignment of protein sequences Message-ID: Hi there, I'm new to biojava, and trying to generate a pairwise alignment of 2 protein sequences following the example here (http://www.biojava.org/wiki/BioJava:CookBook:DP:PairWise2). However, I got: java.lang.StringIndexOutOfBoundsException: String index out of range: 0 at java.lang.String.charAt(Unknown Source) at org.biojava.bio.alignment.SubstitutionMatrix.parseMatrix(SubstitutionMatrix.java :304) at org.biojava.bio.alignment.SubstitutionMatrix.(SubstitutionMatrix.java:100) at com.activx.lims.util.ms.tests.TargetListGenerationUtilsTest.testBioJava(TargetLi stGenerationUtilsTest.java:94) ... error, and here is my code: import ... FiniteAlphabet alphabet = (FiniteAlphabet) AlphabetManager.alphabetForName("PROTEIN"); File matrixFile = new File(FULL_PATH_NUC44); SubstitutionMatrix matrix = new SubstitutionMatrix(alphabet,matrixFile); SequenceAlignment aligner = new NeedlemanWunsch(new Short("0"),new Short("3"),new Short("2"),new Short("2"),new Short("1") ,matrix); Sequence query = ProteinTools.createProteinSequence(PeptidePeer.retrieveByPK(10126404).getSequenc e(), "query"); Sequence target = ProteinTools.createProteinSequence(PeptidePeer.retrieveByPK(10109235).getSequenc e(), "target"); // Perform an alignment and save the results. aligner.pairwiseAlignment( query, // first sequence target // second one ); // Print the alignment to the screen System.out.println("Global alignment with Needleman- Wunsch:\n" + aligner.getAlignmentString()); Thanks in advance for any help! David From wzhao6898 at gmail.com Thu Jul 9 14:13:45 2009 From: wzhao6898 at gmail.com (David Zhao) Date: Thu, 9 Jul 2009 18:13:45 +0000 (UTC) Subject: [Biojava-l] Pairwise alignment of protein sequences References: Message-ID: David Zhao gmail.com> writes: I found the answer from a post here(http://portal.open- bio.org/pipermail/biojava-l/2007-February.txt#). In short, I need to use AlphabetManager.alphabetForName("PROTEIN-TERM"); instead of AlphabetManager.alphabetForName("PROTEIN"); to parse the matrix file. Another question though, now I have the alignment, how do I retrieve score and mismatch, gap information from it? Time (ms): 62 Length: 25 Score: 180 Query: query, Length: 25 Target: target, Length: 25 Query: 1 KLFVGGIKEDTEEHHLRDYFEEYGK 25 | ||||||||||||||||||| ||| Target: 1 KIFVGGIKEDTEEHHLRDYFEQYGK 25 Thanks! David From wzhao6898 at gmail.com Thu Jul 9 16:22:47 2009 From: wzhao6898 at gmail.com (David Zhao) Date: Thu, 9 Jul 2009 20:22:47 +0000 (UTC) Subject: [Biojava-l] How to parse pairwise alignment result string Message-ID: Hi there, I've successfully aligned 2 peptide sequence using Needleman-Wunsch algorithm: SequenceAlignment aligner.pairwiseAlignment( query, // first sequence target // second one ); Now, the only output I can get is this string returned by aligner.getAlignmentString(); showing below: Time (ms): 63 Length: 25 Score: 180 Query: query, Length: 25 Target: target, Length: 25 Query: 1 KLFVGGIKEDTEEHHLRDYFEEYGK 25 | ||||||||||||||||||| ||| Target: 1 KIFVGGIKEDTEEHHLRDYFEQYGK 25 How can create an gapped alignment object in biojava from this, so I can retrieve score, gap information, etc. from the object? Thanks in advance! David From andreas at sdsc.edu Thu Jul 9 20:16:19 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 9 Jul 2009 17:16:19 -0700 Subject: [Biojava-l] Retrieving AUTHOR information from PDB file In-Reply-To: References: <59a41c430907082121x23fceb7hdc7d51c861b84c9c@mail.gmail.com> Message-ID: <59a41c430907091716m30251d7as30ca73b8c620bb61@mail.gmail.com> Such documentation does not exist yet, so please go ahead and add something to the wiki. Since the datamodel works with PDB files and MMCIF files ideally the documentation would cover both ;-) A On Thu, Jul 9, 2009 at 4:12 AM, Andrew Clegg wrote: > By the way -- does anyone have any documentation on which PDB fields > correspond to which methods on the Structure and PDBHeader objects? > > In some cases, this is obvious, but in others it's not so clear -- for > example PDBHeader#getDescription(), since there isn't a DESCRIPTION > field in a PDB file. > > If not, I can start putting one together for the wiki, but I don't > want to duplicate work... > > Andrew. > > 2009/7/9 Andrew Clegg : >> 2009/7/9 Andreas Prlic : >>> Hi Andrew, >>> >>> The PdbFileParser at the present does not process the AUTHOR lines, yet, but >>> should be easy to add.. If you need this urgently, you could quickly patch >>> the PdbFileParser, otherwise I'll add it to SVN in the next couple of >>> days... >> >> No, it's not urgent, I can wait til you have a chance to do it rather >> than trying to figure it out myself :-) >> >> Could you let me know when you've added it in though please? >> >> Many thanks! >> >> Andrew. >> > > > > -- > :: http://biotext.org.uk/ :: > From nathan.genome at gmail.com Thu Jul 16 11:14:09 2009 From: nathan.genome at gmail.com (nathan genome) Date: Thu, 16 Jul 2009 17:14:09 +0200 Subject: [Biojava-l] visualizing pair alignment to a genome Message-ID: hi i am working on genomic variations. i am interested to initially visualize mappings from clone-ends to reference sequence. i have data in the following tab. format : CAX100A01FOR1 1 748 scaffold_116 507411 508161 CAX100A01REV1 1 702 scaffold_116 512322 511611 1st & 7th cols. represent clone-ends, the numbers are positions, 4th col. indicates the reference. How do i visualize this pair alignment in a window with biojava. thanks Natthan From florian.mittag at uni-tuebingen.de Thu Jul 16 11:38:25 2009 From: florian.mittag at uni-tuebingen.de (Florian Mittag) Date: Thu, 16 Jul 2009 17:38:25 +0200 Subject: [Biojava-l] Load Genbank files takes ages Message-ID: <200907161738.29913.florian.mittag@uni-tuebingen.de> Hi all! We try to load Genbank files into our bioseqdb database using BioJava. I copy-pasted the code together from tutorials and previous posts on this mailinglist. My problems: 1) It eats huge amounts of memory, so that I needed to increase the heap size to 2GB. 2) Loading the first two files works great, but the third one ran for one two hours without completion. Here is my code: --- snip --- // loop over all downloaded *.gbk files starting with the highest number System.out.println("Updating chromosome " + chrNo[j] + " ..."); BufferedReader fileIn = new BufferedReader(new FileReader(localFile)); tx = session.beginTransaction(); GenbankFormat gf = new GenbankFormat(); SimpleRichSequenceBuilder listener = new SimpleRichSequenceBuilder(); RichSequence seq = null; gf.readRichSequence(fileIn, dnaTokenization, listener, nsGenbank); seq = listener.makeRichSequence(); if( seq != null ) { // check, if a sequence with this identifier is already in the DB Query q = session.createQuery( "select be from BioEntry as be where identifier=:identifier"); q.setString("identifier",seq.getIdentifier()); List entries = q.list(); for( Object o : entries ) { // delete the old sequence in the DB BioEntry oldSeq = (BioEntry)o; session.delete("BioEntry", oldSeq); } tx.commit(); tx = session.beginTransaction(); session.save("Sequence", seq); System.out.println("Chromosome " + chrNo[j] + " was updated.\n"); } else { System.out.println("Chromosome " + chrNo[j] + " was NOT updated.\n"); } tx.commit(); --- snap --- This is the generated output: ---snip --- Jul 16, 2009 4:33:53 PM - FINE: Starting update of chromosome 001807 Updating chromosome 001807 ... Chromosome 001807 was updated. Jul 16, 2009 4:33:55 PM - FINE: Starting update of chromosome 000024 Updating chromosome 000024 ... Chromosome 000024 was updated. Jul 16, 2009 4:35:27 PM - FINE: Starting update of chromosome 000023 Updating chromosome 000023 ... --- snap --- The files for this are downloaded from Genbank and the file sizes are: NC_001807.gbk 58.4 KB NC_000024.gbk 70.8 MB NC_000023.gbk 190.1 MB So, I don't see, why loading a 70.8 MB file took less than 2 minutes and a 190.1 MB file isn't completed after 2 hours. But during this time, the CPU load is almost 100% and there is no significant network or harddisk activity. When I paused the program (I'm using Eclipse) and looked, where the whole processing power is going to, I ended up with the following stacktrace (sorry for the unreadable format): CharacterTokenization.tokenizeSymbolList(SymbolList) line: 214 AlphabetManager$WellKnownTokenizationWrapper.tokenizeSymbolList(SymbolList) line: 1460 SimpleSymbolList(AbstractSymbolList).seqString() line: 102 BioSQLRichSequenceHandler(DummyRichSequenceHandler).seqString(RichSequence) line: 115 BioSQLRichSequenceHandler.seqString(RichSequence) line: 155 SimpleRichSequence(ThinRichSequence).seqString() line: 203 SimpleRichSequence.getStringSequence() line: 77 GeneratedMethodAccessor132.invoke(Object, Object[]) line: not available DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25 Method.invoke(Object, Object...) line: 597 BasicPropertyAccessor$BasicGetter.get(Object) line: 145 PojoEntityTuplizer(AbstractEntityTuplizer).getPropertyValues(Object) line: 249 PojoEntityTuplizer.getPropertyValues(Object) line: 244 JoinedSubclassEntityPersister(AbstractEntityPersister).getPropertyValues(Object, EntityMode) line: 3567 DefaultFlushEntityEventListener.getValues(Object, EntityEntry, EntityMode, boolean, SessionImplementor) line: 167 DefaultFlushEntityEventListener.onFlushEntity(FlushEntityEvent) line: 120 DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEntities(FlushEvent) line: 196 DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEverythingToExecutions(FlushEvent) line: 76 DefaultAutoFlushEventListener.onAutoFlush(AutoFlushEvent) line: 35 SessionImpl.autoFlushIfRequired(Set) line: 970 SessionImpl.list(String, QueryParameters) line: 1115 QueryImpl.list() line: 79 QueryImpl(AbstractQueryImpl).uniqueResult() line: 811 GeneratedMethodAccessor38.invoke(Object, Object[]) line: not available DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25 Method.invoke(Object, Object...) line: 597 BioSQLRichObjectBuilder.buildObject(Class, List) line: 133 RichObjectFactory.getObject(Class, Object[]) line: 107 GenbankFormat.readRichSequence(BufferedReader, SymbolTokenization, RichSeqIOListener, Namespace) line: 450 UpdateDB_Main.updateChromosome() line: 542 Now we go to GenbankFormat.readRichSequence(). It hangs at about line 450, the line where it loads a CrossRef object, so I added debug output: --- snip --- // parameter on old feature if (key.equals("db_xref")) { Matcher m = dbxp.matcher(val); if (m.matches()) { String dbname = m.group(1); String raccession = m.group(2); if (dbname.equalsIgnoreCase("taxon")) { [...] } else { try { long starttime = System.currentTimeMillis(); CrossRef cr = (CrossRef)RichObjectFactory.getObject(SimpleCrossRef.class,new Object[] {dbname, raccession, new Integer(0)}); long duration = System.currentTimeMillis() - starttime; if( duration > 100 ) { System.out.println("dbname: " + dbname + ", raccession: " + raccession); System.out.println(" took " + duration + "ms"); } RankedCrossRef rcr = new SimpleRankedCrossRef(cr, ++rcrossrefCount); rlistener.getCurrentFeature().addRankedCrossRef(rcr); --- snap --- Which leads to: --- snip --- dbname: GeneID, raccession: 677739 took 3291ms dbname: HGNC, raccession: 31847 took 2427ms dbname: GeneID, raccession: 55344 took 2932ms dbname: HGNC, raccession: 23148 took 2339ms dbname: GI, raccession: 94158612 took 2418ms dbname: GI, raccession: 8922995 took 2920ms [...] --- snap --- Which are all /db_xref properties of the NC_000023.gbk file. Searching deeper, it looks like for every CrossRef object loaded, the whole BioEntry object is built and the sequence parsed. But remember, this only happens on chromosome 23, not on 24, which has /db_xref, too. I already spent some time on this, but I can't figure out, what could be the cause. Thanks Florian From markjschreiber at gmail.com Thu Jul 16 21:33:58 2009 From: markjschreiber at gmail.com (Mark Schreiber) Date: Fri, 17 Jul 2009 09:33:58 +0800 Subject: [Biojava-l] Load Genbank files takes ages In-Reply-To: <200907161738.29913.florian.mittag@uni-tuebingen.de> References: <200907161738.29913.florian.mittag@uni-tuebingen.de> Message-ID: <93b45ca50907161833v1d6fed1fyd0030bf37271889d@mail.gmail.com> I wonder if there is some kind of memory leak or infinite loop? Have you considered running a profiler? Also, are you able to parse that sequence when you don't put it into BioSQL. It could be the parser not the BioSQL binding. - Mark On Thu, Jul 16, 2009 at 11:38 PM, Florian Mittag wrote: > > Hi all! > > We try to load Genbank files into our bioseqdb database using BioJava. I > copy-pasted the code together from tutorials and previous posts on this > mailinglist. My problems: > > 1) It eats huge amounts of memory, so that I needed to increase the heap size > to 2GB. > > 2) Loading the first two files works great, but the third one ran for one two > hours without completion. Here is my code: > > --- snip --- > // loop over all downloaded *.gbk files starting with the highest number > System.out.println("Updating chromosome " + chrNo[j] + " ..."); > > BufferedReader fileIn = new BufferedReader(new FileReader(localFile)); > > tx = session.beginTransaction(); > GenbankFormat gf = new GenbankFormat(); > SimpleRichSequenceBuilder listener = new SimpleRichSequenceBuilder(); > RichSequence seq = null; > > gf.readRichSequence(fileIn, dnaTokenization, listener, nsGenbank); > seq = listener.makeRichSequence(); > > if( seq != null ) { > ? ? ? ?// check, if a sequence with this identifier is already in the DB > ? ? ? ?Query q = session.createQuery( > ? ? ? ? ? ? ? ?"select be from BioEntry as be where identifier=:identifier"); > ? ? ? ?q.setString("identifier",seq.getIdentifier()); > ? ? ? ?List entries = q.list(); > ? ? ? ?for( Object o : entries ) { > ? ? ? ? ? ? ? ?// delete the old sequence in the DB > ? ? ? ? ? ? ? ?BioEntry oldSeq = (BioEntry)o; > ? ? ? ? ? ? ? ?session.delete("BioEntry", oldSeq); > ? ? ? ?} > ? ? ? ?tx.commit(); > > ? ? ? ?tx = session.beginTransaction(); > ? ? ? ?session.save("Sequence", seq); > > ? ? ? ?System.out.println("Chromosome " + chrNo[j] + " was updated.\n"); > } else { > ? ? ? ?System.out.println("Chromosome " + chrNo[j] + " was NOT updated.\n"); > } > > tx.commit(); > --- snap --- > > > This is the generated output: > ---snip --- > Jul 16, 2009 4:33:53 PM - FINE: Starting update of chromosome 001807 > Updating chromosome 001807 ... > Chromosome 001807 was updated. > Jul 16, 2009 4:33:55 PM - FINE: Starting update of chromosome 000024 > Updating chromosome 000024 ... > Chromosome 000024 was updated. > Jul 16, 2009 4:35:27 PM - FINE: Starting update of chromosome 000023 > Updating chromosome 000023 ... > --- snap --- > > > The files for this are downloaded from Genbank and the file sizes are: > NC_001807.gbk ? 58.4 KB > NC_000024.gbk ? 70.8 MB > NC_000023.gbk ? 190.1 MB > > So, I don't see, why loading a 70.8 MB file took less than 2 minutes and a > 190.1 MB file isn't completed after 2 hours. But during this time, the CPU > load is almost 100% and there is no significant network or harddisk activity. > > When I paused the program (I'm using Eclipse) and looked, where the whole > processing power is going to, I ended up with the following stacktrace (sorry > for the unreadable format): > > CharacterTokenization.tokenizeSymbolList(SymbolList) line: 214 > AlphabetManager$WellKnownTokenizationWrapper.tokenizeSymbolList(SymbolList) > line: 1460 > SimpleSymbolList(AbstractSymbolList).seqString() line: 102 > BioSQLRichSequenceHandler(DummyRichSequenceHandler).seqString(RichSequence) > line: 115 > BioSQLRichSequenceHandler.seqString(RichSequence) line: 155 > SimpleRichSequence(ThinRichSequence).seqString() line: 203 > SimpleRichSequence.getStringSequence() line: 77 > GeneratedMethodAccessor132.invoke(Object, Object[]) line: not available > DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25 > Method.invoke(Object, Object...) line: 597 > BasicPropertyAccessor$BasicGetter.get(Object) line: 145 > PojoEntityTuplizer(AbstractEntityTuplizer).getPropertyValues(Object) line: 249 > PojoEntityTuplizer.getPropertyValues(Object) line: 244 > JoinedSubclassEntityPersister(AbstractEntityPersister).getPropertyValues(Object, > EntityMode) line: 3567 > DefaultFlushEntityEventListener.getValues(Object, EntityEntry, EntityMode, > boolean, SessionImplementor) line: 167 > DefaultFlushEntityEventListener.onFlushEntity(FlushEntityEvent) line: 120 > DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEntities(FlushEvent) > line: 196 > DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEverythingToExecutions(FlushEvent) > line: 76 > DefaultAutoFlushEventListener.onAutoFlush(AutoFlushEvent) line: 35 > SessionImpl.autoFlushIfRequired(Set) line: 970 > SessionImpl.list(String, QueryParameters) line: 1115 > QueryImpl.list() line: 79 > QueryImpl(AbstractQueryImpl).uniqueResult() line: 811 > GeneratedMethodAccessor38.invoke(Object, Object[]) line: not available > DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25 > Method.invoke(Object, Object...) line: 597 > BioSQLRichObjectBuilder.buildObject(Class, List) line: 133 > RichObjectFactory.getObject(Class, Object[]) line: 107 > GenbankFormat.readRichSequence(BufferedReader, SymbolTokenization, > RichSeqIOListener, Namespace) line: 450 > UpdateDB_Main.updateChromosome() line: 542 > > > Now we go to GenbankFormat.readRichSequence(). It hangs at about line 450, the > line where it loads a CrossRef object, so I added debug output: > > --- snip --- > // parameter on old feature > if (key.equals("db_xref")) { > ? ? ? ?Matcher m = dbxp.matcher(val); > ? ? ? ?if (m.matches()) { > ? ? ? ? ? ? ? ?String dbname = m.group(1); > ? ? ? ? ? ? ? ?String raccession = m.group(2); > ? ? ? ? ? ? ? ?if (dbname.equalsIgnoreCase("taxon")) { > ? ? ? ? ? ? ? ? ? ? ? ?[...] > ? ? ? ? ? ? ? ?} else { > ? ? ? ? ? ? ? ? ? ? ? ?try { > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?long starttime = System.currentTimeMillis(); > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?CrossRef cr = > (CrossRef)RichObjectFactory.getObject(SimpleCrossRef.class,new Object[] > {dbname, raccession, new Integer(0)}); > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?long duration = System.currentTimeMillis() - starttime; > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if( duration > 100 ) { > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("dbname: " + dbname + ", raccession: " + raccession); > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(" ?took " + duration + "ms"); > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?} > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?RankedCrossRef rcr = new SimpleRankedCrossRef(cr, ++rcrossrefCount); > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?rlistener.getCurrentFeature().addRankedCrossRef(rcr); > --- snap --- > > Which leads to: > > --- snip --- > dbname: GeneID, raccession: 677739 > ?took 3291ms > dbname: HGNC, raccession: 31847 > ?took 2427ms > dbname: GeneID, raccession: 55344 > ?took 2932ms > dbname: HGNC, raccession: 23148 > ?took 2339ms > dbname: GI, raccession: 94158612 > ?took 2418ms > dbname: GI, raccession: 8922995 > ?took 2920ms > [...] > --- snap --- > > Which are all /db_xref properties of the NC_000023.gbk file. Searching deeper, > it looks like for every CrossRef object loaded, the whole BioEntry object is > built and the sequence parsed. But remember, this only happens on chromosome > 23, not on 24, which has /db_xref, too. > > I already spent some time on this, but I can't figure out, what could be the > cause. > > > Thanks > ? Florian > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From florian.mittag at uni-tuebingen.de Fri Jul 17 08:03:54 2009 From: florian.mittag at uni-tuebingen.de (Florian Mittag) Date: Fri, 17 Jul 2009 14:03:54 +0200 Subject: [Biojava-l] Load Genbank files takes ages In-Reply-To: <93b45ca50907161833v1d6fed1fyd0030bf37271889d@mail.gmail.com> References: <200907161738.29913.florian.mittag@uni-tuebingen.de> <93b45ca50907161833v1d6fed1fyd0030bf37271889d@mail.gmail.com> Message-ID: <200907171403.55631.florian.mittag@uni-tuebingen.de> Thanks for the quick answer! On Friday 17 July 2009 03:33, Mark Schreiber wrote: > I wonder if there is some kind of memory leak or infinite loop? As far as I can tell, there is no infinite loop, just a loop that takes very, very long because for every parsed feature the whole RichSequence object is reprocessed (especially the conversion of the sequence to characters), which takes about 3 seconds. > Have you considered running a profiler? Yes, I have considered this, but the profilers I know for Eclipse are a pain in the a** and don't work, so I will have to use NetBeans or something to do profiling. I noticed another funny thing: When I run let our program skip the first two chromosomes (1807 and 24), then this is the output: Jul 17, 2009 1:50:36 PM - FINE: Starting update of chromosome 000023 dbname: GeneID, raccession: 100132775 took 273ms dbname: CCDS, raccession: CCDS35344.1 took 452ms dbname: GeneID, raccession: 644403 took 283ms Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.(String.java:216) at java.lang.StringBuffer.toString(StringBuffer.java:585) at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:526) at org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Main.java:542) at org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java:473) at org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:169) Harddisk activity is pretty high (probably reading the sequence) and the OutOfMemoryError occurs after about 2 minutes. It seems like loading the other chromosomes before this one somehow changes the behavior. > Also, are you able to parse that sequence when you don't put it into > BioSQL. It could be the parser not the BioSQL binding. > > - Mark That's a good idea, I will try this. I don't know if I will have time for this today, but I should be able to give an update next week. Florian > On Thu, Jul 16, 2009 at 11:38 PM, Florian Mittag > > wrote: > > Hi all! > > > > We try to load Genbank files into our bioseqdb database using BioJava. I > > copy-pasted the code together from tutorials and previous posts on this > > mailinglist. My problems: > > > > 1) It eats huge amounts of memory, so that I needed to increase the heap > > size to 2GB. > > > > 2) Loading the first two files works great, but the third one ran for one > > two hours without completion. Here is my code: > > > > --- snip --- > > // loop over all downloaded *.gbk files starting with the highest number > > System.out.println("Updating chromosome " + chrNo[j] + " ..."); > > > > BufferedReader fileIn = new BufferedReader(new FileReader(localFile)); > > > > tx = session.beginTransaction(); > > GenbankFormat gf = new GenbankFormat(); > > SimpleRichSequenceBuilder listener = new SimpleRichSequenceBuilder(); > > RichSequence seq = null; > > > > gf.readRichSequence(fileIn, dnaTokenization, listener, nsGenbank); > > seq = listener.makeRichSequence(); > > > > if( seq != null ) { > > ? ? ? ?// check, if a sequence with this identifier is already in the DB > > ? ? ? ?Query q = session.createQuery( > > ? ? ? ? ? ? ? ?"select be from BioEntry as be where > > identifier=:identifier"); q.setString("identifier",seq.getIdentifier()); > > ? ? ? ?List entries = q.list(); > > ? ? ? ?for( Object o : entries ) { > > ? ? ? ? ? ? ? ?// delete the old sequence in the DB > > ? ? ? ? ? ? ? ?BioEntry oldSeq = (BioEntry)o; > > ? ? ? ? ? ? ? ?session.delete("BioEntry", oldSeq); > > ? ? ? ?} > > ? ? ? ?tx.commit(); > > > > ? ? ? ?tx = session.beginTransaction(); > > ? ? ? ?session.save("Sequence", seq); > > > > ? ? ? ?System.out.println("Chromosome " + chrNo[j] + " was updated.\n"); > > } else { > > ? ? ? ?System.out.println("Chromosome " + chrNo[j] + " was NOT > > updated.\n"); } > > > > tx.commit(); > > --- snap --- > > > > > > This is the generated output: > > ---snip --- > > Jul 16, 2009 4:33:53 PM - FINE: Starting update of chromosome 001807 > > Updating chromosome 001807 ... > > Chromosome 001807 was updated. > > Jul 16, 2009 4:33:55 PM - FINE: Starting update of chromosome 000024 > > Updating chromosome 000024 ... > > Chromosome 000024 was updated. > > Jul 16, 2009 4:35:27 PM - FINE: Starting update of chromosome 000023 > > Updating chromosome 000023 ... > > --- snap --- > > > > > > The files for this are downloaded from Genbank and the file sizes are: > > NC_001807.gbk ? 58.4 KB > > NC_000024.gbk ? 70.8 MB > > NC_000023.gbk ? 190.1 MB > > > > So, I don't see, why loading a 70.8 MB file took less than 2 minutes and > > a 190.1 MB file isn't completed after 2 hours. But during this time, the > > CPU load is almost 100% and there is no significant network or harddisk > > activity. > > > > When I paused the program (I'm using Eclipse) and looked, where the whole > > processing power is going to, I ended up with the following stacktrace > > (sorry for the unreadable format): > > > > CharacterTokenization.tokenizeSymbolList(SymbolList) line: 214 > > AlphabetManager$WellKnownTokenizationWrapper.tokenizeSymbolList(SymbolLis > >t) line: 1460 > > SimpleSymbolList(AbstractSymbolList).seqString() line: 102 > > BioSQLRichSequenceHandler(DummyRichSequenceHandler).seqString(RichSequenc > >e) line: 115 > > BioSQLRichSequenceHandler.seqString(RichSequence) line: 155 > > SimpleRichSequence(ThinRichSequence).seqString() line: 203 > > SimpleRichSequence.getStringSequence() line: 77 > > GeneratedMethodAccessor132.invoke(Object, Object[]) line: not available > > DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25 > > Method.invoke(Object, Object...) line: 597 > > BasicPropertyAccessor$BasicGetter.get(Object) line: 145 > > PojoEntityTuplizer(AbstractEntityTuplizer).getPropertyValues(Object) > > line: 249 PojoEntityTuplizer.getPropertyValues(Object) line: 244 > > JoinedSubclassEntityPersister(AbstractEntityPersister).getPropertyValues( > >Object, EntityMode) line: 3567 > > DefaultFlushEntityEventListener.getValues(Object, EntityEntry, > > EntityMode, boolean, SessionImplementor) line: 167 > > DefaultFlushEntityEventListener.onFlushEntity(FlushEntityEvent) line: 120 > > DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEntitie > >s(FlushEvent) line: 196 > > DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEveryth > >ingToExecutions(FlushEvent) line: 76 > > DefaultAutoFlushEventListener.onAutoFlush(AutoFlushEvent) line: 35 > > SessionImpl.autoFlushIfRequired(Set) line: 970 > > SessionImpl.list(String, QueryParameters) line: 1115 > > QueryImpl.list() line: 79 > > QueryImpl(AbstractQueryImpl).uniqueResult() line: 811 > > GeneratedMethodAccessor38.invoke(Object, Object[]) line: not available > > DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25 > > Method.invoke(Object, Object...) line: 597 > > BioSQLRichObjectBuilder.buildObject(Class, List) line: 133 > > RichObjectFactory.getObject(Class, Object[]) line: 107 > > GenbankFormat.readRichSequence(BufferedReader, SymbolTokenization, > > RichSeqIOListener, Namespace) line: 450 > > UpdateDB_Main.updateChromosome() line: 542 > > > > > > Now we go to GenbankFormat.readRichSequence(). It hangs at about line > > 450, the line where it loads a CrossRef object, so I added debug output: > > > > --- snip --- > > // parameter on old feature > > if (key.equals("db_xref")) { > > ? ? ? ?Matcher m = dbxp.matcher(val); > > ? ? ? ?if (m.matches()) { > > ? ? ? ? ? ? ? ?String dbname = m.group(1); > > ? ? ? ? ? ? ? ?String raccession = m.group(2); > > ? ? ? ? ? ? ? ?if (dbname.equalsIgnoreCase("taxon")) { > > ? ? ? ? ? ? ? ? ? ? ? ?[...] > > ? ? ? ? ? ? ? ?} else { > > ? ? ? ? ? ? ? ? ? ? ? ?try { > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?long starttime = > > System.currentTimeMillis(); CrossRef cr = > > (CrossRef)RichObjectFactory.getObject(SimpleCrossRef.class,new Object[] > > {dbname, raccession, new Integer(0)}); > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?long duration = System.currentTimeMillis() > > - starttime; if( duration > 100 ) { > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("dbname: " + > > dbname + ", raccession: " + raccession); System.out.println(" ?took " + > > duration + "ms"); } > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?RankedCrossRef rcr = new > > SimpleRankedCrossRef(cr, ++rcrossrefCount); > > rlistener.getCurrentFeature().addRankedCrossRef(rcr); --- snap --- > > > > Which leads to: > > > > --- snip --- > > dbname: GeneID, raccession: 677739 > > ?took 3291ms > > dbname: HGNC, raccession: 31847 > > ?took 2427ms > > dbname: GeneID, raccession: 55344 > > ?took 2932ms > > dbname: HGNC, raccession: 23148 > > ?took 2339ms > > dbname: GI, raccession: 94158612 > > ?took 2418ms > > dbname: GI, raccession: 8922995 > > ?took 2920ms > > [...] > > --- snap --- > > > > Which are all /db_xref properties of the NC_000023.gbk file. Searching > > deeper, it looks like for every CrossRef object loaded, the whole > > BioEntry object is built and the sequence parsed. But remember, this only > > happens on chromosome 23, not on 24, which has /db_xref, too. > > > > I already spent some time on this, but I can't figure out, what could be > > the cause. > > > > > > Thanks > > ? Florian > > _______________________________________________ > > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Dipl. Inf. Florian Mittag Universit?t Tuebingen WSI-RA, Sand 1 72076 Tuebingen, Germany Phone: +49 7071 / 29 78985 Fax: +49 7071 / 29 5091 From ola.spjuth at farmbio.uu.se Sat Jul 18 10:22:03 2009 From: ola.spjuth at farmbio.uu.se (Ola Spjuth) Date: Sat, 18 Jul 2009 16:22:03 +0200 Subject: [Biojava-l] Load Genbank files takes ages In-Reply-To: <200907171403.55631.florian.mittag@uni-tuebingen.de> References: <200907161738.29913.florian.mittag@uni-tuebingen.de> <93b45ca50907161833v1d6fed1fyd0030bf37271889d@mail.gmail.com> <200907171403.55631.florian.mittag@uni-tuebingen.de> Message-ID: On 17 jul 2009, at 14.03, Florian Mittag wrote: > Thanks for the quick answer! > > On Friday 17 July 2009 03:33, Mark Schreiber wrote: >> Have you considered running a profiler? > > Yes, I have considered this, but the profilers I know for Eclipse > are a pain > in the a** and don't work, so I will have to use NetBeans or > something to do > profiling. I just wanted to comment on this since I had the same experience, and just a few days ago I tried YourKit (http://yourkit.com/). It integrated very nicely with Eclipse and worked extremely well. Best is that they offer free licenses for open source projects. Recommended! /Ola From koen.bruynseels at cropdesign.com Sat Jul 18 12:14:23 2009 From: koen.bruynseels at cropdesign.com (koen.bruynseels at cropdesign.com) Date: Sat, 18 Jul 2009 18:14:23 +0200 Subject: [Biojava-l] Koen Bruynseels is out of the office. Message-ID: I will be out of the office starting 07/17/2009 and will not return until 07/23/2009. I will respond to your message when I return. From pzgyuanf at gmail.com Sat Jul 18 23:41:27 2009 From: pzgyuanf at gmail.com (pprun) Date: Sun, 19 Jul 2009 11:41:27 +0800 Subject: [Biojava-l] Can I get the additional information, such as chromosome number from "source" feature from genbank file? Message-ID: Hi, Given a genbank file with feature as this: FEATURES Location/Qualifiers source 1..412 /chromosome="12" /map="12q22-qter" /organism="Homo sapiens" /db_xref="taxon:9606" How can I get the chromosome="12" information? I need it to sort out the sequence by chromosome. Appreciate your help. Pprun From holland at eaglegenomics.com Sun Jul 19 06:16:11 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Sun, 19 Jul 2009 11:16:11 +0100 Subject: [Biojava-l] Can I get the additional information, such as chromosome number from "source" feature from genbank file? In-Reply-To: References: Message-ID: <1247998571.28340.1.camel@buzzybee> It's in the RichAnnotation object associated with the RichFeature inside the parsed RichSequence object (if you're using the BioJavaX GenbankFormat parser). The RichAnnotation is a key/value map - the keys are term objects, which you can find by requesting the term for "chromosome" from the default ontology. You can then search the map for the matching key/value pair. cheers, Richard On Sun, 2009-07-19 at 11:41 +0800, pprun wrote: > Hi, > > Given a genbank file with feature as this: > > FEATURES Location/Qualifiers > source 1..412 > /chromosome="12" > /map="12q22-qter" > /organism="Homo sapiens" > /db_xref="taxon:9606" > > > How can I get the chromosome="12" information? I need it to sort out the > sequence by chromosome. > > Appreciate your help. > Pprun > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From pzgyuanf at gmail.com Sun Jul 19 10:26:05 2009 From: pzgyuanf at gmail.com (pprun) Date: Sun, 19 Jul 2009 22:26:05 +0800 Subject: [Biojava-l] Can I get the additional information, such as chromosome number from "source" feature from genbank file? In-Reply-To: <1247998571.28340.1.camel@buzzybee> References: <1247998571.28340.1.camel@buzzybee> Message-ID: <4A632CFD.70200@gmail.com> Thanks Richard for supplying this detail process. Also nice to know you are still working back-stage of Biojava. :) Pprun Richard Holland wrote: > It's in the RichAnnotation object associated with the RichFeature inside > the parsed RichSequence object (if you're using the BioJavaX > GenbankFormat parser). The RichAnnotation is a key/value map - the keys > are term objects, which you can find by requesting the term for > "chromosome" from the default ontology. You can then search the map for > the matching key/value pair. > > cheers, > Richard > > On Sun, 2009-07-19 at 11:41 +0800, pprun wrote: > >>Hi, >> >>Given a genbank file with feature as this: >> >>FEATURES Location/Qualifiers >> source 1..412 >> /chromosome="12" >> /map="12q22-qter" >> /organism="Homo sapiens" >> /db_xref="taxon:9606" >> >> >>How can I get the chromosome="12" information? I need it to sort out the >>sequence by chromosome. >> >>Appreciate your help. >>Pprun >> >>_______________________________________________ >>Biojava-l mailing list - Biojava-l at lists.open-bio.org >>http://lists.open-bio.org/mailman/listinfo/biojava-l From pzgyuanf at gmail.com Sun Jul 19 10:26:05 2009 From: pzgyuanf at gmail.com (pprun) Date: Sun, 19 Jul 2009 22:26:05 +0800 Subject: [Biojava-l] Can I get the additional information, such as chromosome number from "source" feature from genbank file? In-Reply-To: <1247998571.28340.1.camel@buzzybee> References: <1247998571.28340.1.camel@buzzybee> Message-ID: <4A632CFD.70200@gmail.com> Thanks Richard for supplying this detail process. Also nice to know you are still working back-stage of Biojava. :) Pprun Richard Holland wrote: > It's in the RichAnnotation object associated with the RichFeature inside > the parsed RichSequence object (if you're using the BioJavaX > GenbankFormat parser). The RichAnnotation is a key/value map - the keys > are term objects, which you can find by requesting the term for > "chromosome" from the default ontology. You can then search the map for > the matching key/value pair. > > cheers, > Richard > > On Sun, 2009-07-19 at 11:41 +0800, pprun wrote: > >>Hi, >> >>Given a genbank file with feature as this: >> >>FEATURES Location/Qualifiers >> source 1..412 >> /chromosome="12" >> /map="12q22-qter" >> /organism="Homo sapiens" >> /db_xref="taxon:9606" >> >> >>How can I get the chromosome="12" information? I need it to sort out the >>sequence by chromosome. >> >>Appreciate your help. >>Pprun >> >>_______________________________________________ >>Biojava-l mailing list - Biojava-l at lists.open-bio.org >>http://lists.open-bio.org/mailman/listinfo/biojava-l From andreas at sdsc.edu Mon Jul 20 22:20:35 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 20 Jul 2009 19:20:35 -0700 Subject: [Biojava-l] Pairwise alignment of protein sequences In-Reply-To: References: Message-ID: <59a41c430907201920m3b696aa9yda9c9d3a7f3d4cfa@mail.gmail.com> Hi David, I patched the SequenceAlignent class in svn. It now displays more scores in the produced alignment image. Also you can now request the strings for the aligned sequences from the outside. Alignment score is the return value from the pairwiseAlignment method.... Hope that helps... Andreas On Thu, Jul 9, 2009 at 11:13 AM, David Zhao wrote: > David Zhao gmail.com> writes: > I found the answer from a post here(http://portal.open- > bio.org/pipermail/biojava-l/2007-February.txt#). In short, I need to use > AlphabetManager.alphabetForName("PROTEIN-TERM"); instead of > AlphabetManager.alphabetForName("PROTEIN"); to parse the matrix file. > Another question though, now I have the alignment, > how do I retrieve score and > mismatch, gap information from it? > ?Time (ms): ? ? 62 > ?Length: ? ? ? ?25 > ?Score: ? ? ? ?180 > ?Query: ? ? ? ?query, ?Length: 25 > ?Target: ? ? ? target, Length: 25 > > > Query: ? ?1 KLFVGGIKEDTEEHHLRDYFEEYGK 25 > ? ? ? ? ? ?| ||||||||||||||||||| ||| > Target: ? 1 KIFVGGIKEDTEEHHLRDYFEQYGK 25 > > Thanks! > > David > > > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From andreas at sdsc.edu Tue Jul 21 00:26:40 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 20 Jul 2009 21:26:40 -0700 Subject: [Biojava-l] feature request Sequence Alignment Message-ID: <59a41c430907202126y66716314n5d82b5a47890c9a2@mail.gmail.com> Hi Andreas, I was working with the Sequence Alignment package today. In particular I would be interested to have an alignment display that looks nice in HTML. Playing around with the code it seems the alignment image generation is closely tied to the actual alignment implementation. Do you think it would be possible to change this a bit and provide a way so there could be multiple ways to print out (display) an alignment? Ideally the core of an alignment would be just a data-container (a bean?) and the alignment calculation would operate on this bean. Then after the alignment has been calculated other objects could be used to provide a print out based on the data in the container-bean. Would that make sense and do you think this would be difficult to implement? Thanks, Andreas (the other) From andreas.draeger at uni-tuebingen.de Tue Jul 21 01:35:22 2009 From: andreas.draeger at uni-tuebingen.de (Andreas Draeger) Date: Tue, 21 Jul 2009 07:35:22 +0200 Subject: [Biojava-l] feature request Sequence Alignment In-Reply-To: <59a41c430907202126y66716314n5d82b5a47890c9a2@mail.gmail.com> References: <59a41c430907202126y66716314n5d82b5a47890c9a2@mail.gmail.com> Message-ID: <4A65539A.4010805@uni-tuebingen.de> Hi Andreas, What you suggest makes sense. I put it on my todo list. I think some derivatives of the existing Alignment interface should be used as objects that can be displayed in HTML. I'll play around with it a bit. Cheers Andreas (the other) -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From andreas.draeger at uni-tuebingen.de Tue Jul 21 04:25:27 2009 From: andreas.draeger at uni-tuebingen.de (Andreas Draeger) Date: Tue, 21 Jul 2009 10:25:27 +0200 Subject: [Biojava-l] How to parse pairwise alignment result string In-Reply-To: References: Message-ID: <4A657B77.5030904@uni-tuebingen.de> Hi David! Yes, the NeedlemanWunsch class and also the SmithWaterman class should not produce these Strings but rather dedicated Alignment objects that can be further processed more easily. I think they already do produce some alignment object but this seems to be sub-optimal. I am going to improve this but this will take some time. Cheers Andreas -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From pzgyuanf at gmail.com Wed Jul 22 10:03:04 2009 From: pzgyuanf at gmail.com (pprun) Date: Wed, 22 Jul 2009 22:03:04 +0800 Subject: [Biojava-l] Request For Feature: let SimpleNamespace or Namespace be Serializable Message-ID: Hi, You know all the rich sequence format parsers, such as readGenbankDNA, has a Namespace parameter. Currently it prevents the related code to be used in RMI framework. What do you think about it? Thanks, Pprun From pzgyuanf at gmail.com Wed Jul 22 10:25:59 2009 From: pzgyuanf at gmail.com (pprun) Date: Wed, 22 Jul 2009 22:25:59 +0800 Subject: [Biojava-l] Pairwise alignment of protein sequences In-Reply-To: <59a41c430907201920m3b696aa9yda9c9d3a7f3d4cfa@mail.gmail.com> References: <59a41c430907201920m3b696aa9yda9c9d3a7f3d4cfa@mail.gmail.com> Message-ID: <4A672177.5040505@gmail.com> Andreas, How about also add the 'quality', 'Percent Identity' and 'Percent Similarity' values into these alignment result as the GAP does? Thanks, Pprun Andreas Prlic wrote: > Hi David, > > I patched the SequenceAlignent class in svn. It now displays more > scores in the produced alignment image. Also you can now request the > strings for the aligned sequences from the outside. Alignment score is > the return value from the pairwiseAlignment method.... > > Hope that helps... > Andreas > > > On Thu, Jul 9, 2009 at 11:13 AM, David Zhao wrote: > >>David Zhao gmail.com> writes: >>I found the answer from a post here(http://portal.open- >>bio.org/pipermail/biojava-l/2007-February.txt#). In short, I need to use >>AlphabetManager.alphabetForName("PROTEIN-TERM"); instead of >>AlphabetManager.alphabetForName("PROTEIN"); to parse the matrix file. >>Another question though, now I have the alignment, >>how do I retrieve score and >>mismatch, gap information from it? >> Time (ms): 62 >> Length: 25 >> Score: 180 >> Query: query, Length: 25 >> Target: target, Length: 25 >> >> >>Query: 1 KLFVGGIKEDTEEHHLRDYFEEYGK 25 >> | ||||||||||||||||||| ||| >>Target: 1 KIFVGGIKEDTEEHHLRDYFEQYGK 25 >> >>Thanks! >> >>David >> >> >>_______________________________________________ >>Biojava-l mailing list - Biojava-l at lists.open-bio.org >>http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From pzgyuanf at gmail.com Wed Jul 22 10:25:59 2009 From: pzgyuanf at gmail.com (pprun) Date: Wed, 22 Jul 2009 22:25:59 +0800 Subject: [Biojava-l] Pairwise alignment of protein sequences In-Reply-To: <59a41c430907201920m3b696aa9yda9c9d3a7f3d4cfa@mail.gmail.com> References: <59a41c430907201920m3b696aa9yda9c9d3a7f3d4cfa@mail.gmail.com> Message-ID: <4A672177.5040505@gmail.com> Andreas, How about also add the 'quality', 'Percent Identity' and 'Percent Similarity' values into these alignment result as the GAP does? Thanks, Pprun Andreas Prlic wrote: > Hi David, > > I patched the SequenceAlignent class in svn. It now displays more > scores in the produced alignment image. Also you can now request the > strings for the aligned sequences from the outside. Alignment score is > the return value from the pairwiseAlignment method.... > > Hope that helps... > Andreas > > > On Thu, Jul 9, 2009 at 11:13 AM, David Zhao wrote: > >>David Zhao gmail.com> writes: >>I found the answer from a post here(http://portal.open- >>bio.org/pipermail/biojava-l/2007-February.txt#). In short, I need to use >>AlphabetManager.alphabetForName("PROTEIN-TERM"); instead of >>AlphabetManager.alphabetForName("PROTEIN"); to parse the matrix file. >>Another question though, now I have the alignment, >>how do I retrieve score and >>mismatch, gap information from it? >> Time (ms): 62 >> Length: 25 >> Score: 180 >> Query: query, Length: 25 >> Target: target, Length: 25 >> >> >>Query: 1 KLFVGGIKEDTEEHHLRDYFEEYGK 25 >> | ||||||||||||||||||| ||| >>Target: 1 KIFVGGIKEDTEEHHLRDYFEQYGK 25 >> >>Thanks! >> >>David >> >> >>_______________________________________________ >>Biojava-l mailing list - Biojava-l at lists.open-bio.org >>http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From pzgyuanf at gmail.com Wed Jul 22 10:34:49 2009 From: pzgyuanf at gmail.com (pprun) Date: Wed, 22 Jul 2009 22:34:49 +0800 Subject: [Biojava-l] Compile error: unmappable character for encoding UTF-8 Message-ID: Hi there, It has been a long time(years), I got this compile error when I trying to compile source code: GUITools.java:14: unmappable character for encoding UTF-8 * @author Kalle N?slund StructureException.java:29: unmappable character for encoding UTF-8 * @author Andreas Prlic, Thomas Down, Benjamin Schuster-B?ckler I trust you are not UTF-8 for your develpment environments, I'm also awared that other global opern source projects are adopting a convention to solve this problem: by using the '\uxxxx' escape to US-ASCII characters. Sorry! N?slund and Schuster-B?ckler. Pprun From andreas at sdsc.edu Wed Jul 22 11:16:05 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 22 Jul 2009 08:16:05 -0700 Subject: [Biojava-l] Pairwise alignment of protein sequences In-Reply-To: <4A672177.5040505@gmail.com> References: <59a41c430907201920m3b696aa9yda9c9d3a7f3d4cfa@mail.gmail.com> <4A672177.5040505@gmail.com> Message-ID: <59a41c430907220816y33336d8bx76d4f999101476df@mail.gmail.com> Hi PPrun, not sure about how to calculate quality, but the other scores are there now. Andreas On Wed, Jul 22, 2009 at 7:25 AM, pprun wrote: > Andreas, > > How about also add the 'quality', 'Percent Identity' and 'Percent > Similarity' values into these alignment result as the GAP does? > > Thanks, > Pprun > > > Andreas Prlic wrote: > >> Hi David, >> >> I patched the SequenceAlignent class in svn. It now displays more >> scores in the produced alignment image. Also you can now request the >> strings for the aligned sequences from the outside. Alignment score is >> the return value from the pairwiseAlignment method.... >> >> Hope that helps... >> Andreas >> >> >> On Thu, Jul 9, 2009 at 11:13 AM, David Zhao wrote: >> >>> David Zhao gmail.com> writes: >>> I found the answer from a post here(http://portal.open- >>> bio.org/pipermail/biojava-l/2007-February.txt#). In short, I need to use >>> AlphabetManager.alphabetForName("PROTEIN-TERM"); instead of >>> AlphabetManager.alphabetForName("PROTEIN"); to parse the matrix file. >>> Another question though, now I have the alignment, >>> how do I retrieve score and >>> mismatch, gap information from it? >>> Time (ms): ? ? 62 >>> Length: ? ? ? ?25 >>> Score: ? ? ? ?180 >>> Query: ? ? ? ?query, ?Length: 25 >>> Target: ? ? ? target, Length: 25 >>> >>> >>> Query: ? ?1 KLFVGGIKEDTEEHHLRDYFEEYGK 25 >>> ? ? ? ? ?| ||||||||||||||||||| ||| >>> Target: ? 1 KIFVGGIKEDTEEHHLRDYFEQYGK 25 >>> >>> Thanks! >>> >>> David >>> >>> >>> _______________________________________________ >>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> >> >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > From holland at eaglegenomics.com Wed Jul 22 11:20:43 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 22 Jul 2009 16:20:43 +0100 Subject: [Biojava-l] Request For Feature: let SimpleNamespace or Namespace be Serializable In-Reply-To: References: Message-ID: <1248276043.28124.1.camel@buzzybee> It's there because all sequences have to belong to a namespace (to prevent duplicate identifiers from different sources from clashing). On Wed, 2009-07-22 at 22:03 +0800, pprun wrote: > Hi, > > You know all the rich sequence format parsers, such as readGenbankDNA, > has a Namespace parameter. Currently it prevents the related code to be > used in RMI framework. > > What do you think about it? > > Thanks, > Pprun > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From pzgyuanf at gmail.com Wed Jul 22 11:36:49 2009 From: pzgyuanf at gmail.com (pprun) Date: Wed, 22 Jul 2009 23:36:49 +0800 Subject: [Biojava-l] Pairwise alignment of protein sequences In-Reply-To: <59a41c430907220816y33336d8bx76d4f999101476df@mail.gmail.com> References: <59a41c430907201920m3b696aa9yda9c9d3a7f3d4cfa@mail.gmail.com> <4A672177.5040505@gmail.com> <59a41c430907220816y33336d8bx76d4f999101476df@mail.gmail.com> Message-ID: <4A673211.20503@gmail.com> Great! Just saw the new add-in after using the latest code in Trunk. Cheers Pprun Andreas Prlic wrote: > Hi PPrun, > > not sure about how to calculate quality, but the other scores are there now. > > Andreas > > On Wed, Jul 22, 2009 at 7:25 AM, pprun wrote: > >>Andreas, >> >>How about also add the 'quality', 'Percent Identity' and 'Percent >>Similarity' values into these alignment result as the GAP does? >> >>Thanks, >>Pprun >> >> >>Andreas Prlic wrote: >> >> >>>Hi David, >>> >>>I patched the SequenceAlignent class in svn. It now displays more >>>scores in the produced alignment image. Also you can now request the >>>strings for the aligned sequences from the outside. Alignment score is >>>the return value from the pairwiseAlignment method.... >>> >>>Hope that helps... >>>Andreas >>> >>> >>>On Thu, Jul 9, 2009 at 11:13 AM, David Zhao wrote: >>> >>> >>>>David Zhao gmail.com> writes: >>>>I found the answer from a post here(http://portal.open- >>>>bio.org/pipermail/biojava-l/2007-February.txt#). In short, I need to use >>>>AlphabetManager.alphabetForName("PROTEIN-TERM"); instead of >>>>AlphabetManager.alphabetForName("PROTEIN"); to parse the matrix file. >>>>Another question though, now I have the alignment, >>>>how do I retrieve score and >>>>mismatch, gap information from it? >>>>Time (ms): 62 >>>>Length: 25 >>>>Score: 180 >>>>Query: query, Length: 25 >>>>Target: target, Length: 25 >>>> >>>> >>>>Query: 1 KLFVGGIKEDTEEHHLRDYFEEYGK 25 >>>> | ||||||||||||||||||| ||| >>>>Target: 1 KIFVGGIKEDTEEHHLRDYFEQYGK 25 >>>> >>>>Thanks! >>>> >>>>David >>>> >>>> >>>>_______________________________________________ >>>>Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>> >>> >>>_______________________________________________ >>>Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> >> > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From pzgyuanf at gmail.com Wed Jul 22 11:36:49 2009 From: pzgyuanf at gmail.com (pprun) Date: Wed, 22 Jul 2009 23:36:49 +0800 Subject: [Biojava-l] Pairwise alignment of protein sequences In-Reply-To: <59a41c430907220816y33336d8bx76d4f999101476df@mail.gmail.com> References: <59a41c430907201920m3b696aa9yda9c9d3a7f3d4cfa@mail.gmail.com> <4A672177.5040505@gmail.com> <59a41c430907220816y33336d8bx76d4f999101476df@mail.gmail.com> Message-ID: <4A673211.20503@gmail.com> Great! Just saw the new add-in after using the latest code in Trunk. Cheers Pprun Andreas Prlic wrote: > Hi PPrun, > > not sure about how to calculate quality, but the other scores are there now. > > Andreas > > On Wed, Jul 22, 2009 at 7:25 AM, pprun wrote: > >>Andreas, >> >>How about also add the 'quality', 'Percent Identity' and 'Percent >>Similarity' values into these alignment result as the GAP does? >> >>Thanks, >>Pprun >> >> >>Andreas Prlic wrote: >> >> >>>Hi David, >>> >>>I patched the SequenceAlignent class in svn. It now displays more >>>scores in the produced alignment image. Also you can now request the >>>strings for the aligned sequences from the outside. Alignment score is >>>the return value from the pairwiseAlignment method.... >>> >>>Hope that helps... >>>Andreas >>> >>> >>>On Thu, Jul 9, 2009 at 11:13 AM, David Zhao wrote: >>> >>> >>>>David Zhao gmail.com> writes: >>>>I found the answer from a post here(http://portal.open- >>>>bio.org/pipermail/biojava-l/2007-February.txt#). In short, I need to use >>>>AlphabetManager.alphabetForName("PROTEIN-TERM"); instead of >>>>AlphabetManager.alphabetForName("PROTEIN"); to parse the matrix file. >>>>Another question though, now I have the alignment, >>>>how do I retrieve score and >>>>mismatch, gap information from it? >>>>Time (ms): 62 >>>>Length: 25 >>>>Score: 180 >>>>Query: query, Length: 25 >>>>Target: target, Length: 25 >>>> >>>> >>>>Query: 1 KLFVGGIKEDTEEHHLRDYFEEYGK 25 >>>> | ||||||||||||||||||| ||| >>>>Target: 1 KIFVGGIKEDTEEHHLRDYFEQYGK 25 >>>> >>>>Thanks! >>>> >>>>David >>>> >>>> >>>>_______________________________________________ >>>>Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>> >>> >>>_______________________________________________ >>>Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> >> > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From pzgyuanf at gmail.com Wed Jul 22 11:39:15 2009 From: pzgyuanf at gmail.com (pprun) Date: Wed, 22 Jul 2009 23:39:15 +0800 Subject: [Biojava-l] Request For Feature: let SimpleNamespace or Namespace be Serializable In-Reply-To: <1248276043.28124.1.camel@buzzybee> References: <1248276043.28124.1.camel@buzzybee> Message-ID: <4A6732A3.7040203@gmail.com> I don't want to remove this parameter from the API, I proposed the Namespace implements Serializable interface. Then the code can be used in RMI framework. Pprun Richard Holland wrote: > It's there because all sequences have to belong to a namespace (to > prevent duplicate identifiers from different sources from clashing). > > On Wed, 2009-07-22 at 22:03 +0800, pprun wrote: > >>Hi, >> >>You know all the rich sequence format parsers, such as readGenbankDNA, >>has a Namespace parameter. Currently it prevents the related code to be >>used in RMI framework. >> >>What do you think about it? >> >>Thanks, >>Pprun >> >>_______________________________________________ >>Biojava-l mailing list - Biojava-l at lists.open-bio.org >>http://lists.open-bio.org/mailman/listinfo/biojava-l From pzgyuanf at gmail.com Wed Jul 22 11:39:15 2009 From: pzgyuanf at gmail.com (pprun) Date: Wed, 22 Jul 2009 23:39:15 +0800 Subject: [Biojava-l] Request For Feature: let SimpleNamespace or Namespace be Serializable In-Reply-To: <1248276043.28124.1.camel@buzzybee> References: <1248276043.28124.1.camel@buzzybee> Message-ID: <4A6732A3.7040203@gmail.com> I don't want to remove this parameter from the API, I proposed the Namespace implements Serializable interface. Then the code can be used in RMI framework. Pprun Richard Holland wrote: > It's there because all sequences have to belong to a namespace (to > prevent duplicate identifiers from different sources from clashing). > > On Wed, 2009-07-22 at 22:03 +0800, pprun wrote: > >>Hi, >> >>You know all the rich sequence format parsers, such as readGenbankDNA, >>has a Namespace parameter. Currently it prevents the related code to be >>used in RMI framework. >> >>What do you think about it? >> >>Thanks, >>Pprun >> >>_______________________________________________ >>Biojava-l mailing list - Biojava-l at lists.open-bio.org >>http://lists.open-bio.org/mailman/listinfo/biojava-l From holland at eaglegenomics.com Wed Jul 22 11:53:01 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 22 Jul 2009 15:53:01 +0000 Subject: [Biojava-l] Request For Feature: let SimpleNamespace or Namespace be Serializable In-Reply-To: <4A6732A3.7040203@gmail.com> References: <1248276043.28124.1.camel@buzzybee> <4A6732A3.7040203@gmail.com> Message-ID: <1248277981.28124.18.camel@buzzybee> OK sounds good. On Wed, 2009-07-22 at 23:39 +0800, pprun wrote: > I don't want to remove this parameter from the API, > I proposed the Namespace implements Serializable interface. > Then the code can be used in RMI framework. > > Pprun > > > Richard Holland wrote: > > > It's there because all sequences have to belong to a namespace (to > > prevent duplicate identifiers from different sources from clashing). > > > > On Wed, 2009-07-22 at 22:03 +0800, pprun wrote: > > > >>Hi, > >> > >>You know all the rich sequence format parsers, such as readGenbankDNA, > >>has a Namespace parameter. Currently it prevents the related code to be > >>used in RMI framework. > >> > >>What do you think about it? > >> > >>Thanks, > >>Pprun > >> > >>_______________________________________________ > >>Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From andreas at sdsc.edu Wed Jul 22 13:25:12 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 22 Jul 2009 10:25:12 -0700 Subject: [Biojava-l] Compile error: unmappable character for encoding UTF-8 In-Reply-To: References: Message-ID: <59a41c430907221025vc6163c0mf39e0fb4276a3d2@mail.gmail.com> Hi, I have not encountered this problem before, I suppose this is because my standard encoding is ISO-8859-1. Any suggestions for how to set the default encoding for the files? Andreas On Wed, Jul 22, 2009 at 7:34 AM, pprun wrote: > Hi there, > > It has been a long time(years), I got this compile error when I trying to > compile source code: > > GUITools.java:14: unmappable character for encoding UTF-8 > ?* @author Kalle N?slund > > StructureException.java:29: unmappable character for encoding UTF-8 > ?* @author Andreas Prlic, Thomas Down, Benjamin Schuster-B?ckler > > > I trust you are not UTF-8 for your develpment environments, > I'm also awared that other global opern source projects are adopting a > convention to solve this problem: by using the '\uxxxx' escape to US-ASCII > characters. > > > Sorry! N?slund and Schuster-B?ckler. > > Pprun > > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From holland at eaglegenomics.com Wed Jul 22 13:32:46 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 22 Jul 2009 18:32:46 +0100 Subject: [Biojava-l] Compile error: unmappable character for encoding UTF-8 In-Reply-To: <59a41c430907221025vc6163c0mf39e0fb4276a3d2@mail.gmail.com> References: <59a41c430907221025vc6163c0mf39e0fb4276a3d2@mail.gmail.com> Message-ID: <1248283966.28124.41.camel@buzzybee> It's usually an operating system thing. The encoding used relates entirely to the way the user has chosen to save/read the file on disk after they've transferred it from our repository (which is encoded correctly). In this case, I expect the user's OS default is UTF-8 and unless they specify otherwise, all files get saved in that encoding. Having said that, the \uxxxx suggestion is not a bad idea. On Wed, 2009-07-22 at 10:25 -0700, Andreas Prlic wrote: > Hi, > > I have not encountered this problem before, I suppose this is because > my standard encoding is ISO-8859-1. Any suggestions for how to set the > default encoding for the files? > > Andreas > > > On Wed, Jul 22, 2009 at 7:34 AM, pprun wrote: > > Hi there, > > > > It has been a long time(years), I got this compile error when I trying to > > compile source code: > > > > GUITools.java:14: unmappable character for encoding UTF-8 > > * @author Kalle N?slund > > > > StructureException.java:29: unmappable character for encoding UTF-8 > > * @author Andreas Prlic, Thomas Down, Benjamin Schuster-B?ckler > > > > > > I trust you are not UTF-8 for your develpment environments, > > I'm also awared that other global opern source projects are adopting a > > convention to solve this problem: by using the '\uxxxx' escape to US-ASCII > > characters. > > > > > > Sorry! N?slund and Schuster-B?ckler. > > > > Pprun > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From andreas.draeger at uni-tuebingen.de Wed Jul 22 17:25:25 2009 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Wed, 22 Jul 2009 23:25:25 +0200 Subject: [Biojava-l] Pairwise alignment of protein sequences In-Reply-To: <59a41c430907220816y33336d8bx76d4f999101476df@mail.gmail.com> References: <59a41c430907201920m3b696aa9yda9c9d3a7f3d4cfa@mail.gmail.com> <4A672177.5040505@gmail.com> <59a41c430907220816y33336d8bx76d4f999101476df@mail.gmail.com> Message-ID: <4A6783C5.7040409@uni-tuebingen.de> Hi guys, > not sure about how to calculate quality, but the other scores are there now. > >> How about also add the 'quality', 'Percent Identity' and 'Percent >> Similarity' values into these alignment result as the GAP does >> Yes, calculating "quality" or "similarity" is a bit unclear. The score is actually the measurement for similarity and therefore also kind of quality. What can be done is to add a feature for the percent identity. However, we have to distinguish between two different things: An alphabet may contain matching symbols, i.e., symbols that are considered equivalent, and identical symbols. This fact makes it a bit harder to calculate the identity because there is the question if we should consider matching symbols identical. Cheers, Andreas -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From andreas.draeger at uni-tuebingen.de Wed Jul 22 17:30:12 2009 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Wed, 22 Jul 2009 23:30:12 +0200 Subject: [Biojava-l] Compile error: unmappable character for encoding UTF-8 In-Reply-To: <1248283966.28124.41.camel@buzzybee> References: <59a41c430907221025vc6163c0mf39e0fb4276a3d2@mail.gmail.com> <1248283966.28124.41.camel@buzzybee> Message-ID: <4A6784E4.3010105@uni-tuebingen.de> Richard Holland schrieb: > It's usually an operating system thing. The encoding used relates > entirely to the way the user has chosen to save/read the file on disk > after they've transferred it from our repository (which is encoded > correctly). In this case, I expect the user's OS default is UTF-8 and > unless they specify otherwise, all files get saved in that encoding. > > Having said that, the \uxxxx suggestion is not a bad idea. > > >>> StructureException.java:29: unmappable character for encoding UTF-8 >>> * @author Andreas Prlic, Thomas Down, Benjamin Schuster-B?ckler >>> Hi guys, Actually, in the comment fields appropriate HTML codes should be applied to encode special characters. In this case I guess it should read "Benjamin Schuster-Bückler" to obtain the German u-umlaut. To avoid such problems, people should not simply insert any special character but look it up in designated HTML code tables and use the correct code. Cheers, Andreas -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From jp at javaclass.co.uk Thu Jul 23 07:15:47 2009 From: jp at javaclass.co.uk (JP) Date: Thu, 23 Jul 2009 12:15:47 +0100 Subject: [Biojava-l] Ontology OBO is_a TERMs Message-ID: <4adc29060907230415t555baf8ala9ca20286c8e1b1e@mail.gmail.com> Hi there at Biojava, I have an ontology file (from www.geneontology.org, gene_ontology.1_2.obo). A typical entry for a term is: [Term] id: GO:0000025 name: maltose catabolic process namespace: biological_process def: "The chemical reactions and pathways resulting in the breakdown of the disaccharide maltose (4-O-alpha-D-glucopyranosyl-D-glucopyranose)." [GOC:jl, ISBN:0198506732 "Oxford Dictionary of Biochemistry and Molecular Biology"] subset: gosubset_prok synonym: "malt sugar catabolic process" EXACT [] synonym: "malt sugar catabolism" EXACT [] synonym: "maltose breakdown" EXACT [] synonym: "maltose degradation" EXACT [] synonym: "maltose hydrolysis" NARROW [] xref: MetaCyc:MALTOSECAT-PWY is_a: GO:0000023 ! maltose metabolic process is_a: GO:0046352 ! disaccharide catabolic process I am reading this with the code suggested in: http://biojava.open-bio.org/wiki/BioJava:CookBook:OBO:parse I would like to get the is_a entries (as Term) - is this possible ? I tried to find this everywhere (annotations?) but find it (google searches included). Many Thanks JP From peter.midford at gmail.com Thu Jul 23 11:19:10 2009 From: peter.midford at gmail.com (Peter Midford) Date: Thu, 23 Jul 2009 11:19:10 -0400 Subject: [Biojava-l] Ontology OBO is_a TERMs In-Reply-To: <4adc29060907230415t555baf8ala9ca20286c8e1b1e@mail.gmail.com> References: <4adc29060907230415t555baf8ala9ca20286c8e1b1e@mail.gmail.com> Message-ID: JP, Looking at the code for OboFileHandler.java, fresh from svn, it looks like they're presently being dropped on the floor. Perhaps someone should either implement obo restrictions or links or build some triples here, as they seem to be used for the rest of the ontology code. } else if (key.equals(IS_A) || key.equals(RELATIONSHIP) || key.equals(DISJOINT_FROM) || key.equals(INTERSECTION_OF) || key.equals(SUBSET)) { //TODO: deal with relationships } else if (key.equals(COMMENT)){ Peter On Jul 23, 2009, at 7:15, JP wrote: > Hi there at Biojava, > > I have an ontology file (from www.geneontology.org, gene_ontology. > 1_2.obo). > A typical entry for a term is: > > [Term] > id: GO:0000025 > name: maltose catabolic process > namespace: biological_process > def: "The chemical reactions and pathways resulting in the breakdown > of the > disaccharide maltose (4-O-alpha-D-glucopyranosyl-D- > glucopyranose)." [GOC:jl, > ISBN:0198506732 "Oxford Dictionary of Biochemistry and Molecular > Biology"] > subset: gosubset_prok > synonym: "malt sugar catabolic process" EXACT [] > synonym: "malt sugar catabolism" EXACT [] > synonym: "maltose breakdown" EXACT [] > synonym: "maltose degradation" EXACT [] > synonym: "maltose hydrolysis" NARROW [] > xref: MetaCyc:MALTOSECAT-PWY > is_a: GO:0000023 ! maltose metabolic process > is_a: GO:0046352 ! disaccharide catabolic process > > I am reading this with the code suggested in: > http://biojava.open-bio.org/wiki/BioJava:CookBook:OBO:parse > I would like to get the is_a entries (as Term) - is this possible ? > I tried > to find this everywhere (annotations?) but find it (google searches > included). > > Many Thanks > JP > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l Peter E. Midford Mesquite Developer Peter.Midford at gmail.com From jp at javaclass.co.uk Thu Jul 23 11:29:02 2009 From: jp at javaclass.co.uk (JP) Date: Thu, 23 Jul 2009 16:29:02 +0100 Subject: [Biojava-l] Ontology OBO is_a TERMs In-Reply-To: References: <4adc29060907230415t555baf8ala9ca20286c8e1b1e@mail.gmail.com> Message-ID: <4adc29060907230829t60ec715h31b7b8b22c041e94@mail.gmail.com> I never quite got this Peter, what is a triple ? Could these simply be considered as annotations ? Or are you thinking in the lines of building hierarchies out of these (I take it this is the most common task). These relationships are *fundamental* for any work of ontology. 2009/7/23 Peter Midford > JP, Looking at the code for OboFileHandler.java, fresh from svn, it > looks like they're presently being dropped on the floor. Perhaps someone > should either implement obo restrictions or links or build some triples > here, as they seem to be used for the rest of the ontology code. > > > > } else if (key.equals(IS_A) || > key.equals(RELATIONSHIP) || > key.equals(DISJOINT_FROM) || > key.equals(INTERSECTION_OF) || > key.equals(SUBSET)) { > //TODO: deal with relationships > > > > } else if (key.equals(COMMENT)){ > > > Peter > > On Jul 23, 2009, at 7:15, JP wrote: > > Hi there at Biojava, > > I have an ontology file (from www.geneontology.org, > gene_ontology.1_2.obo). > A typical entry for a term is: > > [Term] > id: GO:0000025 > name: maltose catabolic process > namespace: biological_process > def: "The chemical reactions and pathways resulting in the breakdown of the > disaccharide maltose (4-O-alpha-D-glucopyranosyl-D-glucopyranose)." > [GOC:jl, > ISBN:0198506732 "Oxford Dictionary of Biochemistry and Molecular Biology"] > subset: gosubset_prok > synonym: "malt sugar catabolic process" EXACT [] > synonym: "malt sugar catabolism" EXACT [] > synonym: "maltose breakdown" EXACT [] > synonym: "maltose degradation" EXACT [] > synonym: "maltose hydrolysis" NARROW [] > xref: MetaCyc:MALTOSECAT-PWY > is_a: GO:0000023 ! maltose metabolic process > is_a: GO:0046352 ! disaccharide catabolic process > > I am reading this with the code suggested in: > http://biojava.open-bio.org/wiki/BioJava:CookBook:OBO:parse > I would like to get the is_a entries (as Term) - is this possible ? I tried > to find this everywhere (annotations?) but find it (google searches > included). > > Many Thanks > JP > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > Peter E. Midford > Mesquite Developer > Peter.Midford at gmail.com > > > > > From peter.midford at gmail.com Thu Jul 23 11:39:56 2009 From: peter.midford at gmail.com (Peter Midford) Date: Thu, 23 Jul 2009 11:39:56 -0400 Subject: [Biojava-l] Ontology OBO is_a TERMs In-Reply-To: <4adc29060907230829t60ec715h31b7b8b22c041e94@mail.gmail.com> References: <4adc29060907230415t555baf8ala9ca20286c8e1b1e@mail.gmail.com> <4adc29060907230829t60ec715h31b7b8b22c041e94@mail.gmail.com> Message-ID: <0903E83C-81BA-427C-9176-0554E9642094@gmail.com> JP, No, these are not just annotations to terms, and the code jumbles together several things that should be separated. To properly handle the IS_A key, you will have to build the hierarchy, which you can do OBO style using links or restrictions (which I believe are a subclass of links in OBO, rather than adding an intermediate class to the ontology) or OWL style using triples (Subject, Predicate, Object) where is_a would be your predicate. I assume the other key values that look like set operations are for building restrictions. Peter On Jul 23, 2009, at 11:29, JP wrote: > I never quite got this Peter, what is a triple ? > Could these simply be considered as annotations ? Or are you > thinking in the lines of building hierarchies out of these (I take > it this is the most common task). > > These relationships are *fundamental* for any work of ontology. > > 2009/7/23 Peter Midford > JP, > Looking at the code for OboFileHandler.java, fresh from svn, > it looks like they're presently being dropped on the floor. Perhaps > someone should either implement obo restrictions or links or build > some triples here, as they seem to be used for the rest of the > ontology code. > > > > } else if (key.equals(IS_A) || > key.equals(RELATIONSHIP) || > key.equals(DISJOINT_FROM) || > key.equals(INTERSECTION_OF) || > key.equals(SUBSET)) { > //TODO: deal with relationships > > > } else if (key.equals(COMMENT)){ > > > Peter > > On Jul 23, 2009, at 7:15, JP wrote: > >> Hi there at Biojava, >> >> I have an ontology file (from www.geneontology.org, gene_ontology. >> 1_2.obo). >> A typical entry for a term is: >> >> [Term] >> id: GO:0000025 >> name: maltose catabolic process >> namespace: biological_process >> def: "The chemical reactions and pathways resulting in the >> breakdown of the >> disaccharide maltose (4-O-alpha-D-glucopyranosyl-D- >> glucopyranose)." [GOC:jl, >> ISBN:0198506732 "Oxford Dictionary of Biochemistry and Molecular >> Biology"] >> subset: gosubset_prok >> synonym: "malt sugar catabolic process" EXACT [] >> synonym: "malt sugar catabolism" EXACT [] >> synonym: "maltose breakdown" EXACT [] >> synonym: "maltose degradation" EXACT [] >> synonym: "maltose hydrolysis" NARROW [] >> xref: MetaCyc:MALTOSECAT-PWY >> is_a: GO:0000023 ! maltose metabolic process >> is_a: GO:0046352 ! disaccharide catabolic process >> >> I am reading this with the code suggested in: >> http://biojava.open-bio.org/wiki/BioJava:CookBook:OBO:parse >> I would like to get the is_a entries (as Term) - is this possible ? >> I tried >> to find this everywhere (annotations?) but find it (google searches >> included). >> >> Many Thanks >> JP >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > Peter E. Midford > Mesquite Developer > Peter.Midford at gmail.com > > > > > Peter E. Midford Mesquite Developer Peter.Midford at gmail.com From andreas at sdsc.edu Thu Jul 23 13:37:27 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 23 Jul 2009 10:37:27 -0700 Subject: [Biojava-l] Ontology OBO is_a TERMs In-Reply-To: <0903E83C-81BA-427C-9176-0554E9642094@gmail.com> References: <4adc29060907230415t555baf8ala9ca20286c8e1b1e@mail.gmail.com> <4adc29060907230829t60ec715h31b7b8b22c041e94@mail.gmail.com> <0903E83C-81BA-427C-9176-0554E9642094@gmail.com> Message-ID: <59a41c430907231037n26d3f760v1642c69ba660a987@mail.gmail.com> I am the one who added the OboFileHandler (based on some original code from obo-edit). I was not sure how best to build up the datastructure representing the relationships in a memory efficient way at that time, so I left it out . Does anybody already have a solution from another project for that, that we could use here? Both links and triples sound like reasonable approaches to me. I think the original ideas in the Ontology framework were to support triples. I can have a look how difficult it would be to build up the hierarchy using that. (after I added some feature requests for the structure modules...) Andreas On Thu, Jul 23, 2009 at 8:39 AM, Peter Midford wrote: > JP, > ? ? ?No, these are not just annotations to terms, and the code jumbles > together several things that should be separated. ?To properly handle the > IS_A key, you will have to build the hierarchy, which you can do OBO style > using links or restrictions (which I believe are a subclass of links in OBO, > rather than adding an intermediate class to the ontology) or OWL style using > triples (Subject, Predicate, Object) where is_a would be your predicate. ?I > assume the other key values that look like set operations are for building > restrictions. > > Peter > > > On Jul 23, 2009, at 11:29, JP wrote: > >> I never quite got this Peter, what is a triple ? >> Could these simply be considered as annotations ? ?Or are you thinking in >> the lines of building hierarchies out of these (I take it this is the most >> common task). >> >> These relationships are *fundamental* for any work of ontology. >> >> 2009/7/23 Peter Midford >> JP, >> ? ? ? Looking at the code for OboFileHandler.java, fresh from svn, it >> looks like they're presently being dropped on the floor. ?Perhaps someone >> should either implement obo restrictions or links or build some triples >> here, as they seem to be used for the rest of the ontology code. >> >> >> >> ? ? ? ? ? ? ? ? ? ? ? ?} else if (key.equals(IS_A) || >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?key.equals(RELATIONSHIP) || >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?key.equals(DISJOINT_FROM) || >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?key.equals(INTERSECTION_OF) || >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?key.equals(SUBSET)) { >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?//TODO: deal with relationships >> >> >> ? ? ? ? ? ? ? ? ? ? ? ?} else if (key.equals(COMMENT)){ >> >> >> Peter >> >> On Jul 23, 2009, at 7:15, JP wrote: >> >>> Hi there at Biojava, >>> >>> I have an ontology file (from www.geneontology.org, >>> gene_ontology.1_2.obo). >>> A typical entry for a term is: >>> >>> [Term] >>> id: GO:0000025 >>> name: maltose catabolic process >>> namespace: biological_process >>> def: "The chemical reactions and pathways resulting in the breakdown of >>> the >>> disaccharide maltose (4-O-alpha-D-glucopyranosyl-D-glucopyranose)." >>> [GOC:jl, >>> ISBN:0198506732 "Oxford Dictionary of Biochemistry and Molecular >>> Biology"] >>> subset: gosubset_prok >>> synonym: "malt sugar catabolic process" EXACT [] >>> synonym: "malt sugar catabolism" EXACT [] >>> synonym: "maltose breakdown" EXACT [] >>> synonym: "maltose degradation" EXACT [] >>> synonym: "maltose hydrolysis" NARROW [] >>> xref: MetaCyc:MALTOSECAT-PWY >>> is_a: GO:0000023 ! maltose metabolic process >>> is_a: GO:0046352 ! disaccharide catabolic process >>> >>> I am reading this with the code suggested in: >>> http://biojava.open-bio.org/wiki/BioJava:CookBook:OBO:parse >>> I would like to get the is_a entries (as Term) - is this possible ? I >>> tried >>> to find this everywhere (annotations?) but find it (google searches >>> included). >>> >>> Many Thanks >>> JP >>> _______________________________________________ >>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> Peter E. Midford >> Mesquite Developer >> Peter.Midford at gmail.com >> >> >> >> >> > > Peter E. Midford > Mesquite Developer > Peter.Midford at gmail.com > > > > > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From florian.mittag at uni-tuebingen.de Fri Jul 24 05:58:30 2009 From: florian.mittag at uni-tuebingen.de (Florian Mittag) Date: Fri, 24 Jul 2009 11:58:30 +0200 Subject: [Biojava-l] Ontology OBO is_a TERMs In-Reply-To: <0903E83C-81BA-427C-9176-0554E9642094@gmail.com> References: <4adc29060907230415t555baf8ala9ca20286c8e1b1e@mail.gmail.com> <4adc29060907230829t60ec715h31b7b8b22c041e94@mail.gmail.com> <0903E83C-81BA-427C-9176-0554E9642094@gmail.com> Message-ID: <200907241158.30719.florian.mittag@uni-tuebingen.de> On Thursday, 23. July 2009 17:39, Peter Midford wrote: > [...] To properly handle the IS_A key, you will have to build the hierarchy, > which you can do OBO style using links or restrictions (which I believe are > a subclass of links in OBO, rather than adding an intermediate class to > the ontology) or OWL style using triples (Subject, Predicate, Object) > where is_a would be your predicate. [...] I think you mean RDF style triples, no need to make it more complicated than necessary ;-) Although there are some restriction you need OWL for expressing them. Regards, Florian > On Jul 23, 2009, at 11:29, JP wrote: > > I never quite got this Peter, what is a triple ? > > Could these simply be considered as annotations ? Or are you > > thinking in the lines of building hierarchies out of these (I take > > it this is the most common task). > > > > These relationships are *fundamental* for any work of ontology. > > > > 2009/7/23 Peter Midford > > JP, > > Looking at the code for OboFileHandler.java, fresh from svn, > > it looks like they're presently being dropped on the floor. Perhaps > > someone should either implement obo restrictions or links or build > > some triples here, as they seem to be used for the rest of the > > ontology code. > > > > > > > > } else if (key.equals(IS_A) || > > key.equals(RELATIONSHIP) || > > key.equals(DISJOINT_FROM) || > > key.equals(INTERSECTION_OF) || > > key.equals(SUBSET)) { > > //TODO: deal with relationships > > > > > > } else if (key.equals(COMMENT)){ > > > > > > Peter > > > > On Jul 23, 2009, at 7:15, JP wrote: > >> Hi there at Biojava, > >> > >> I have an ontology file (from www.geneontology.org, gene_ontology. > >> 1_2.obo). > >> A typical entry for a term is: > >> > >> [Term] > >> id: GO:0000025 > >> name: maltose catabolic process > >> namespace: biological_process > >> def: "The chemical reactions and pathways resulting in the > >> breakdown of the > >> disaccharide maltose (4-O-alpha-D-glucopyranosyl-D- > >> glucopyranose)." [GOC:jl, > >> ISBN:0198506732 "Oxford Dictionary of Biochemistry and Molecular > >> Biology"] > >> subset: gosubset_prok > >> synonym: "malt sugar catabolic process" EXACT [] > >> synonym: "malt sugar catabolism" EXACT [] > >> synonym: "maltose breakdown" EXACT [] > >> synonym: "maltose degradation" EXACT [] > >> synonym: "maltose hydrolysis" NARROW [] > >> xref: MetaCyc:MALTOSECAT-PWY > >> is_a: GO:0000023 ! maltose metabolic process > >> is_a: GO:0046352 ! disaccharide catabolic process > >> > >> I am reading this with the code suggested in: > >> http://biojava.open-bio.org/wiki/BioJava:CookBook:OBO:parse > >> I would like to get the is_a entries (as Term) - is this possible ? > >> I tried > >> to find this everywhere (annotations?) but find it (google searches > >> included). > >> > >> Many Thanks > >> JP > >> _______________________________________________ > >> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > Peter E. Midford > > Mesquite Developer > > Peter.Midford at gmail.com > > Peter E. Midford > Mesquite Developer > Peter.Midford at gmail.com > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Dipl. Inf. Florian Mittag Universit?t Tuebingen WSI-RA, Sand 1 72076 Tuebingen, Germany Phone: +49 7071 / 29 78985 Fax: +49 7071 / 29 5091 From koen.bruynseels at cropdesign.com Fri Jul 24 12:49:34 2009 From: koen.bruynseels at cropdesign.com (koen.bruynseels at cropdesign.com) Date: Fri, 24 Jul 2009 18:49:34 +0200 Subject: [Biojava-l] Koen Bruynseels is out of the office. Message-ID: I will be out of the office starting 07/24/2009 and will not return until 08/09/2009. I will respond to your message when I return. From florian.mittag at uni-tuebingen.de Fri Jul 24 13:05:22 2009 From: florian.mittag at uni-tuebingen.de (Florian Mittag) Date: Fri, 24 Jul 2009 19:05:22 +0200 Subject: [Biojava-l] Load Genbank files takes ages In-Reply-To: <200907171403.55631.florian.mittag@uni-tuebingen.de> References: <200907161738.29913.florian.mittag@uni-tuebingen.de> <93b45ca50907161833v1d6fed1fyd0030bf37271889d@mail.gmail.com> <200907171403.55631.florian.mittag@uni-tuebingen.de> Message-ID: <200907241905.22390.florian.mittag@uni-tuebingen.de> Hi all, this topic gets a little bit complicated, so I will try to summarize the status quo: If I parse the .gbk files without storing the resulting RichSequence objects into the BioSQL database, the program will crash with an OutOfMemory exception at chromosome 23, but this will happen fast, so no problem with the parsing itself. If I parse the .gbk files and store them in the DB, the program will enter an almost-infinite loop, where it rebuilds objects over and over again (without reading any files). Since this process is likely to consume more memory than without storing it in the DB, I expect it to crash after a long time with the same OutOfMemory exception. The profiler didn't reveal anything new or helpful. It confirmed my observation that the method to construct sequence objects is called over and over again, but it didn't reveal why. The memory profiling showed nothing either, but since it only occurres when two other chromosomes were parsed and stored in the DB before that, I assume it is a problem with Hibernate and its caching behavior. Because of the memory problems, I'll postpone the investigation of the almost-infinite loop until I have resolved the memory problem (for which I will open a new thread). Unless anybody has another idea ;-) Florian On Friday, 17. July 2009 14:03, Florian Mittag wrote: > On Friday 17 July 2009 03:33, Mark Schreiber wrote: > > Have you considered running a profiler? > > Yes, I have considered this, but the profilers I know for Eclipse are a > pain in the a** and don't work, so I will have to use NetBeans or something > to do profiling. > > I noticed another funny thing: > When I run let our program skip the first two chromosomes (1807 and 24), > then this is the output: > > Jul 17, 2009 1:50:36 PM - FINE: Starting update of chromosome 000023 > dbname: GeneID, raccession: 100132775 > took 273ms > dbname: CCDS, raccession: CCDS35344.1 > took 452ms > dbname: GeneID, raccession: 644403 > took 283ms > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOfRange(Arrays.java:3209) > at java.lang.String.(String.java:216) > at java.lang.StringBuffer.toString(StringBuffer.java:585) > at > org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:5 >26) at > org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Main. >java:542) at > org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java:47 >3) at > org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:169) > > Harddisk activity is pretty high (probably reading the sequence) and the > OutOfMemoryError occurs after about 2 minutes. It seems like loading the > other chromosomes before this one somehow changes the behavior. > > > Also, are you able to parse that sequence when you don't put it into > > BioSQL. It could be the parser not the BioSQL binding. > > > > - Mark > > That's a good idea, I will try this. I don't know if I will have time for > this today, but I should be able to give an update next week. > > > Florian > > > On Thu, Jul 16, 2009 at 11:38 PM, Florian Mittag > > > > wrote: > > > Hi all! > > > > > > We try to load Genbank files into our bioseqdb database using BioJava. > > > I copy-pasted the code together from tutorials and previous posts on > > > this mailinglist. My problems: > > > > > > 1) It eats huge amounts of memory, so that I needed to increase the > > > heap size to 2GB. > > > > > > 2) Loading the first two files works great, but the third one ran for > > > one two hours without completion. Here is my code: > > > > > > --- snip --- > > > // loop over all downloaded *.gbk files starting with the highest > > > number System.out.println("Updating chromosome " + chrNo[j] + " ..."); > > > > > > BufferedReader fileIn = new BufferedReader(new FileReader(localFile)); > > > > > > tx = session.beginTransaction(); > > > GenbankFormat gf = new GenbankFormat(); > > > SimpleRichSequenceBuilder listener = new SimpleRichSequenceBuilder(); > > > RichSequence seq = null; > > > > > > gf.readRichSequence(fileIn, dnaTokenization, listener, nsGenbank); > > > seq = listener.makeRichSequence(); > > > > > > if( seq != null ) { > > > ? ? ? ?// check, if a sequence with this identifier is already in the > > > DB Query q = session.createQuery( > > > ? ? ? ? ? ? ? ?"select be from BioEntry as be where > > > identifier=:identifier"); > > > q.setString("identifier",seq.getIdentifier()); List entries = q.list(); > > > ? ? ? ?for( Object o : entries ) { > > > ? ? ? ? ? ? ? ?// delete the old sequence in the DB > > > ? ? ? ? ? ? ? ?BioEntry oldSeq = (BioEntry)o; > > > ? ? ? ? ? ? ? ?session.delete("BioEntry", oldSeq); > > > ? ? ? ?} > > > ? ? ? ?tx.commit(); > > > > > > ? ? ? ?tx = session.beginTransaction(); > > > ? ? ? ?session.save("Sequence", seq); > > > > > > ? ? ? ?System.out.println("Chromosome " + chrNo[j] + " was > > > updated.\n"); } else { > > > ? ? ? ?System.out.println("Chromosome " + chrNo[j] + " was NOT > > > updated.\n"); } > > > > > > tx.commit(); > > > --- snap --- > > > > > > > > > This is the generated output: > > > ---snip --- > > > Jul 16, 2009 4:33:53 PM - FINE: Starting update of chromosome 001807 > > > Updating chromosome 001807 ... > > > Chromosome 001807 was updated. > > > Jul 16, 2009 4:33:55 PM - FINE: Starting update of chromosome 000024 > > > Updating chromosome 000024 ... > > > Chromosome 000024 was updated. > > > Jul 16, 2009 4:35:27 PM - FINE: Starting update of chromosome 000023 > > > Updating chromosome 000023 ... > > > --- snap --- > > > > > > > > > The files for this are downloaded from Genbank and the file sizes are: > > > NC_001807.gbk ? 58.4 KB > > > NC_000024.gbk ? 70.8 MB > > > NC_000023.gbk ? 190.1 MB > > > > > > So, I don't see, why loading a 70.8 MB file took less than 2 minutes > > > and a 190.1 MB file isn't completed after 2 hours. But during this > > > time, the CPU load is almost 100% and there is no significant network > > > or harddisk activity. > > > > > > When I paused the program (I'm using Eclipse) and looked, where the > > > whole processing power is going to, I ended up with the following > > > stacktrace (sorry for the unreadable format): > > > > > > CharacterTokenization.tokenizeSymbolList(SymbolList) line: 214 > > > AlphabetManager$WellKnownTokenizationWrapper.tokenizeSymbolList(SymbolL > > >is t) line: 1460 > > > SimpleSymbolList(AbstractSymbolList).seqString() line: 102 > > > BioSQLRichSequenceHandler(DummyRichSequenceHandler).seqString(RichSeque > > >nc e) line: 115 > > > BioSQLRichSequenceHandler.seqString(RichSequence) line: 155 > > > SimpleRichSequence(ThinRichSequence).seqString() line: 203 > > > SimpleRichSequence.getStringSequence() line: 77 > > > GeneratedMethodAccessor132.invoke(Object, Object[]) line: not available > > > DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25 > > > Method.invoke(Object, Object...) line: 597 > > > BasicPropertyAccessor$BasicGetter.get(Object) line: 145 > > > PojoEntityTuplizer(AbstractEntityTuplizer).getPropertyValues(Object) > > > line: 249 PojoEntityTuplizer.getPropertyValues(Object) line: 244 > > > JoinedSubclassEntityPersister(AbstractEntityPersister).getPropertyValue > > >s( Object, EntityMode) line: 3567 > > > DefaultFlushEntityEventListener.getValues(Object, EntityEntry, > > > EntityMode, boolean, SessionImplementor) line: 167 > > > DefaultFlushEntityEventListener.onFlushEntity(FlushEntityEvent) line: > > > 120 > > > DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEntit > > >ie s(FlushEvent) line: 196 > > > DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEvery > > >th ingToExecutions(FlushEvent) line: 76 > > > DefaultAutoFlushEventListener.onAutoFlush(AutoFlushEvent) line: 35 > > > SessionImpl.autoFlushIfRequired(Set) line: 970 > > > SessionImpl.list(String, QueryParameters) line: 1115 > > > QueryImpl.list() line: 79 > > > QueryImpl(AbstractQueryImpl).uniqueResult() line: 811 > > > GeneratedMethodAccessor38.invoke(Object, Object[]) line: not available > > > DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25 > > > Method.invoke(Object, Object...) line: 597 > > > BioSQLRichObjectBuilder.buildObject(Class, List) line: 133 > > > RichObjectFactory.getObject(Class, Object[]) line: 107 > > > GenbankFormat.readRichSequence(BufferedReader, SymbolTokenization, > > > RichSeqIOListener, Namespace) line: 450 > > > UpdateDB_Main.updateChromosome() line: 542 > > > > > > > > > Now we go to GenbankFormat.readRichSequence(). It hangs at about line > > > 450, the line where it loads a CrossRef object, so I added debug > > > output: > > > > > > --- snip --- > > > // parameter on old feature > > > if (key.equals("db_xref")) { > > > ? ? ? ?Matcher m = dbxp.matcher(val); > > > ? ? ? ?if (m.matches()) { > > > ? ? ? ? ? ? ? ?String dbname = m.group(1); > > > ? ? ? ? ? ? ? ?String raccession = m.group(2); > > > ? ? ? ? ? ? ? ?if (dbname.equalsIgnoreCase("taxon")) { > > > ? ? ? ? ? ? ? ? ? ? ? ?[...] > > > ? ? ? ? ? ? ? ?} else { > > > ? ? ? ? ? ? ? ? ? ? ? ?try { > > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?long starttime = > > > System.currentTimeMillis(); CrossRef cr = > > > (CrossRef)RichObjectFactory.getObject(SimpleCrossRef.class,new Object[] > > > {dbname, raccession, new Integer(0)}); > > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?long duration = > > > System.currentTimeMillis() - starttime; if( duration > 100 ) { > > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("dbname: " + > > > dbname + ", raccession: " + raccession); System.out.println(" ?took " + > > > duration + "ms"); } > > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?RankedCrossRef rcr = new > > > SimpleRankedCrossRef(cr, ++rcrossrefCount); > > > rlistener.getCurrentFeature().addRankedCrossRef(rcr); --- snap --- > > > > > > Which leads to: > > > > > > --- snip --- > > > dbname: GeneID, raccession: 677739 > > > ?took 3291ms > > > dbname: HGNC, raccession: 31847 > > > ?took 2427ms > > > dbname: GeneID, raccession: 55344 > > > ?took 2932ms > > > dbname: HGNC, raccession: 23148 > > > ?took 2339ms > > > dbname: GI, raccession: 94158612 > > > ?took 2418ms > > > dbname: GI, raccession: 8922995 > > > ?took 2920ms > > > [...] > > > --- snap --- > > > > > > Which are all /db_xref properties of the NC_000023.gbk file. Searching > > > deeper, it looks like for every CrossRef object loaded, the whole > > > BioEntry object is built and the sequence parsed. But remember, this > > > only happens on chromosome 23, not on 24, which has /db_xref, too. > > > > > > I already spent some time on this, but I can't figure out, what could > > > be the cause. > > > > > > > > > Thanks > > > ? Florian > > > _______________________________________________ > > > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-l From florian.mittag at uni-tuebingen.de Fri Jul 24 13:29:08 2009 From: florian.mittag at uni-tuebingen.de (Florian Mittag) Date: Fri, 24 Jul 2009 19:29:08 +0200 Subject: [Biojava-l] How to parse large Genbank files? Message-ID: <200907241929.08768.florian.mittag@uni-tuebingen.de> Hi! I think this is a problem worth of its own thread, so I'll start one: I want to store all human chromosomes in a BioSQL database after I loaded the information from .gbk files. The files I get from NCBI with the following URIs, where the id ranges from nc_000001 to nc_000024 plus nc_001804: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=nc_000023&rettype=gbwithparts&retmode=text I then try to parse the files as described in http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting_files but it wont work. While there are no problems parsing 1804 and 24, chromosome 23 leads to a OutOfMemory exception although I gave it 2GB of heap space. Here is a stack trace (the line numbers might differ, because I already tried to improve GenbankFormat.java in memory efficiency): Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at org.biojava.bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbolListFactory.java:222) at org.biojavax.bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichSequenceBuilder.java:256) at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:535) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) at org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Main.java:537) at org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java:468) at org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:164) The line in GenbankFormat.java is: rlistener.addSymbols( symParser.getAlphabet(), (Symbol[])(sl.toList().toArray(new Symbol[0])), 0, sl.length()); Sometimes it fails at the sl.toList().toArray()-part, sometimes it fails later inside the addSymbols method, but it always fails. How can this be? I mean, the file is only 190MB in size, so 2GB of memory should be more than enough. Browsing through the source code, I discovered what I think of as very inefficient handling of sequences: 1) the sequence string is read from file into a StringBuffer 2) it is converted to a string (with whitespaces removed) 3) a SimpleSymbolList is created out of the string 4) the SymbolList is converted to a List of Symbols 5) the List is converted to an array of Symbols 6) the array is passed to addSymbols 7) there it is added to a ChunkedSymbolListFactory 8) if at some point the sequence is requested, a SymbolList is created and then converted to a string. You see, there is a lot of copying and converting, but in the end I have the same string I started with. Well, I had the string, if it ever reached the end, because it will crash before completing this process. Am I doing something wrong or is there a great potential of improving parsing of Genbank files? Regards, Florian From markjschreiber at gmail.com Fri Jul 24 22:20:14 2009 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sat, 25 Jul 2009 10:20:14 +0800 Subject: [Biojava-l] How to parse large Genbank files? In-Reply-To: <200907241929.08768.florian.mittag@uni-tuebingen.de> References: <200907241929.08768.florian.mittag@uni-tuebingen.de> Message-ID: <93b45ca50907241920r60c28931p1b43bf6b6a101b46@mail.gmail.com> Hi- I don't think anyone has done much or anything to optimize these parsers. The process you outline sounds extremely inefficient. It is also likely to lead to memory leaks due to the number of copy operations. As always with java, don't try and optimize without a profiler which will tell you which methods are taking a long time and which objects take the most memory. - Mark On 25 Jul 2009, 1:33 AM, "Florian Mittag" wrote: Hi! I think this is a problem worth of its own thread, so I'll start one: I want to store all human chromosomes in a BioSQL database after I loaded the information from .gbk files. The files I get from NCBI with the following URIs, where the id ranges from nc_000001 to nc_000024 plus nc_001804: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=nc_000023&rettype=gbwithparts&retmode=text I then try to parse the files as described in http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting_files but it wont work. While there are no problems parsing 1804 and 24, chromosome 23 leads to a OutOfMemory exception although I gave it 2GB of heap space. Here is a stack trace (the line numbers might differ, because I already tried to improve GenbankFormat.java in memory efficiency): Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at org.biojava.bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbolListFactory.java:222) at org.biojavax.bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichSequenceBuilder.java:256) at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:535) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) at org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Main.java:537) at org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java:468) at org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:164) The line in GenbankFormat.java is: rlistener.addSymbols( symParser.getAlphabet(), (Symbol[])(sl.toList().toArray(new Symbol[0])), 0, sl.length()); Sometimes it fails at the sl.toList().toArray()-part, sometimes it fails later inside the addSymbols method, but it always fails. How can this be? I mean, the file is only 190MB in size, so 2GB of memory should be more than enough. Browsing through the source code, I discovered what I think of as very inefficient handling of sequences: 1) the sequence string is read from file into a StringBuffer 2) it is converted to a string (with whitespaces removed) 3) a SimpleSymbolList is created out of the string 4) the SymbolList is converted to a List of Symbols 5) the List is converted to an array of Symbols 6) the array is passed to addSymbols 7) there it is added to a ChunkedSymbolListFactory 8) if at some point the sequence is requested, a SymbolList is created and then converted to a string. You see, there is a lot of copying and converting, but in the end I have the same string I started with. Well, I had the string, if it ever reached the end, because it will crash before completing this process. Am I doing something wrong or is there a great potential of improving parsing of Genbank files? Regards, Florian _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From andreas at sdsc.edu Sun Jul 26 23:56:50 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Sun, 26 Jul 2009 20:56:50 -0700 Subject: [Biojava-l] Retrieving AUTHOR information from PDB file In-Reply-To: References: <59a41c430907082121x23fceb7hdc7d51c861b84c9c@mail.gmail.com> Message-ID: <59a41c430907262056o1491175an22c4a82ef0d35d11@mail.gmail.com> Hi Andrew, the PDBHeader class now contains a field for the authors listed in the AUTHORS field. This works for PDB and mmCif files (where the corresponding field is audit_author). Available from SVN trunk.... Andreas On Thu, Jul 9, 2009 at 1:38 AM, Andrew Clegg wrote: > 2009/7/9 Andreas Prlic : >> Hi Andrew, >> >> The PdbFileParser at the present does not process the AUTHOR lines, yet, but >> should be easy to add.. If you need this urgently, you could quickly patch >> the PdbFileParser, otherwise I'll add it to SVN in the next couple of >> days... > > No, it's not urgent, I can wait til you have a chance to do it rather > than trying to figure it out myself :-) > > Could you let me know when you've added it in though please? > > Many thanks! > > Andrew. > From florian.mittag at uni-tuebingen.de Mon Jul 27 08:16:33 2009 From: florian.mittag at uni-tuebingen.de (Florian Mittag) Date: Mon, 27 Jul 2009 14:16:33 +0200 Subject: [Biojava-l] How to parse large Genbank files? In-Reply-To: <93b45ca50907241920r60c28931p1b43bf6b6a101b46@mail.gmail.com> References: <200907241929.08768.florian.mittag@uni-tuebingen.de> <93b45ca50907241920r60c28931p1b43bf6b6a101b46@mail.gmail.com> Message-ID: <200907271416.33485.florian.mittag@uni-tuebingen.de> Hi Mark! On Saturday, 25. July 2009 04:20, Mark Schreiber wrote: > I don't think anyone has done much or anything to optimize these parsers. > The process you outline sounds extremely inefficient. It is also likely to > lead to memory leaks due to the number of copy operations. I wouldn't necessarily say that it leads to memory leaks, but it definitively leads to a high memory consumption (2GB are not enough for a 200MB file). Also, my outline of the process is based on only 2 hours of viewing the code, so actually I expected to be corrected on this. Unfortunately, it seems like I did get the right idea and it IS extremely inefficient. I mean, I understand that this is a high level of abstraction that might come in handy in many situations, but it certainly is more of an obstacle in my specific case. > As always with java, don't try and optimize without a profiler which will > tell you which methods are taking a long time and which objects take the > most memory. I think we should continue this discussion on the biojava-dev list or in a private conversation, as it will probably get very detailed and technical. My question to this list again: Is there a way to achieve my goal of parsing a 200MB Genbank file with the current biojava version without code changes? - Florian > On 25 Jul 2009, 1:33 AM, "Florian Mittag" > wrote: > > Hi! > > I think this is a problem worth of its own thread, so I'll start one: > > I want to store all human chromosomes in a BioSQL database after I loaded > the > information from .gbk files. The files I get from NCBI with the following > URIs, where the id ranges from nc_000001 to nc_000024 plus nc_001804: > > http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=nc_0 >00023&rettype=gbwithparts&retmode=text > > I then try to parse the files as described in > http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting_fi >les but it wont work. While there are no problems parsing 1804 and 24, > chromosome > 23 leads to a OutOfMemory exception although I gave it 2GB of heap space. > > Here is a stack trace (the line numbers might differ, because I already > tried > to improve GenbankFormat.java in memory efficiency): > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at > org.biojava.bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbolLis >tFactory.java:222) at > org.biojavax.bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichSequ >enceBuilder.java:256) at > org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:5 >35) at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader. >java:110) at > org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Main. >java:537) at > org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java:46 >8) at > org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:164) > > The line in GenbankFormat.java is: > > rlistener.addSymbols( > symParser.getAlphabet(), > (Symbol[])(sl.toList().toArray(new Symbol[0])), > 0, sl.length()); > > Sometimes it fails at the sl.toList().toArray()-part, sometimes it fails > later > inside the addSymbols method, but it always fails. > > How can this be? I mean, the file is only 190MB in size, so 2GB of memory > should be more than enough. Browsing through the source code, I discovered > what I think of as very inefficient handling of sequences: > > 1) the sequence string is read from file into a StringBuffer > 2) it is converted to a string (with whitespaces removed) > 3) a SimpleSymbolList is created out of the string > 4) the SymbolList is converted to a List of Symbols > 5) the List is converted to an array of Symbols > 6) the array is passed to addSymbols > 7) there it is added to a ChunkedSymbolListFactory > 8) if at some point the sequence is requested, a SymbolList is created and > then converted to a string. > > You see, there is a lot of copying and converting, but in the end I have > the same string I started with. Well, I had the string, if it ever reached > the end, because it will crash before completing this process. > > > Am I doing something wrong or is there a great potential of improving > parsing > of Genbank files? > > > Regards, > Florian > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Dipl. Inf. Florian Mittag Universit?t Tuebingen WSI-RA, Sand 1 72076 Tuebingen, Germany Phone: +49 7071 / 29 78985 Fax: +49 7071 / 29 5091 From markjschreiber at gmail.com Mon Jul 27 23:05:55 2009 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 28 Jul 2009 11:05:55 +0800 Subject: [Biojava-l] How to parse large Genbank files? In-Reply-To: <200907271416.33485.florian.mittag@uni-tuebingen.de> References: <200907241929.08768.florian.mittag@uni-tuebingen.de> <93b45ca50907241920r60c28931p1b43bf6b6a101b46@mail.gmail.com> <200907271416.33485.florian.mittag@uni-tuebingen.de> Message-ID: <93b45ca50907272005r7c98c02ycb14e5b000d1aff0@mail.gmail.com> Hi - While you maybe can't do it without code changes you can probably do it within the existing framework. If you look at the readGenbank() code in RichSequence.IOTools you will find that the BioJava file parsing consists of many pluggable components which are all defined by interfaces. Anything that implements one of those interfaces can be plugged into the parsing frame work. So if you want you can change the Format object to one of your custom design (which implements Format), you can also change the event listeners and the SequenceBuilders. In your case the SequenceBuilder might be something to look at, it sounds like you don't need to create all the extra Sequence objects for every feature so you could modify that part. Also, in the Format objects there are often methods called elideXXX() which let you tell the Format object to skip over bits that you don't want. Finally, I suspect the problem with memory use is that the String, char[], SymbolList, Sequence copying is both inefficient and worse still is probably not releasing resources in a timely fashion. Eg once the parser framework converts a char[] to a SymbolList is probably no longer needs that char[] reference and might be able to null it. Then when memory gets low the GC can clean out all the cruft. If I have a chance I will run a profiler to see what is sucking up the memory (and what can be released) and also see if all that copying is making a significant impact on CPU cycles (if not it's probably more effort than it's worth to change). The memory thing definitely needs to change though. - Mark On Mon, Jul 27, 2009 at 8:16 PM, Florian Mittag wrote: > Hi Mark! > > On Saturday, 25. July 2009 04:20, Mark Schreiber wrote: >> I don't think anyone has done much or anything to optimize these parsers. >> The process you outline sounds extremely inefficient. It is also likely to >> lead to memory leaks due to the number of copy operations. > > I wouldn't necessarily say that it leads to memory leaks, but it definitively > leads to a high memory consumption (2GB are not enough for a 200MB file). > Also, my outline of the process is based on only 2 hours of viewing the code, > so actually I expected to be corrected on this. > Unfortunately, it seems like I did get the right idea and it IS extremely > inefficient. > > I mean, I understand that this is a high level of abstraction that might come > in handy in many situations, but it certainly is more of an obstacle in my > specific case. > > >> As always with java, don't try and optimize without a profiler which will >> tell you which methods are taking a long time and which objects take the >> most memory. > > I think we should continue this discussion on the biojava-dev list or in a > private conversation, as it will probably get very detailed and technical. > > > My question to this list again: > Is there a way to achieve my goal of parsing a 200MB Genbank file with the > current biojava version without code changes? > > > - Florian > > > >> On 25 Jul 2009, 1:33 AM, "Florian Mittag" >> wrote: >> >> Hi! >> >> I think this is a problem worth of its own thread, so I'll start one: >> >> I want to store all human chromosomes in a BioSQL database after I loaded >> the >> information from .gbk files. The files I get from NCBI with the following >> URIs, where the id ranges from nc_000001 to nc_000024 plus nc_001804: >> >> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=nc_0 >>00023&rettype=gbwithparts&retmode=text >> >> I then try to parse the files as described in >> http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting_fi >>les but it wont work. While there are no problems parsing 1804 and 24, >> chromosome >> 23 leads to a OutOfMemory exception although I gave it 2GB of heap space. >> >> Here is a stack trace (the line numbers might differ, because I already >> tried >> to improve GenbankFormat.java in memory efficiency): >> >> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space >> ? ? ? ?at >> org.biojava.bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbolLis >>tFactory.java:222) at >> org.biojavax.bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichSequ >>enceBuilder.java:256) at >> org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:5 >>35) at >> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader. >>java:110) at >> org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Main. >>java:537) at >> org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java:46 >>8) at >> org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:164) >> >> The line in GenbankFormat.java is: >> >> rlistener.addSymbols( >> ? ? ? ?symParser.getAlphabet(), >> ? ? ? ?(Symbol[])(sl.toList().toArray(new Symbol[0])), >> ? ? ? ?0, sl.length()); >> >> Sometimes it fails at the sl.toList().toArray()-part, sometimes it fails >> later >> inside the addSymbols method, but it always fails. >> >> How can this be? I mean, the file is only 190MB in size, so 2GB of memory >> should be more than enough. Browsing through the source code, I discovered >> what I think of as very inefficient handling of sequences: >> >> 1) the sequence string is read from file into a StringBuffer >> 2) it is converted to a string (with whitespaces removed) >> 3) a SimpleSymbolList is created out of the string >> 4) the SymbolList is converted to a List of Symbols >> 5) the List is converted to an array of Symbols >> 6) the array is passed to addSymbols >> 7) there it is added to a ChunkedSymbolListFactory >> 8) if at some point the sequence is requested, a SymbolList is created and >> then converted to a string. >> >> You see, there is a lot of copying and converting, but in the end I have >> the same string I started with. Well, I had the string, if it ever reached >> the end, because it will crash before completing this process. >> >> >> Am I doing something wrong or is there a great potential of improving >> parsing >> of Genbank files? >> >> >> Regards, >> ? Florian >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Dipl. Inf. Florian Mittag > Universit?t Tuebingen > WSI-RA, Sand 1 > 72076 Tuebingen, Germany > Phone: +49 7071 / 29 78985 ?Fax: +49 7071 / 29 5091 > From sheoran143 at gmail.com Mon Jul 27 23:50:34 2009 From: sheoran143 at gmail.com (Deepak sheoran) Date: Mon, 27 Jul 2009 22:50:34 -0500 Subject: [Biojava-l] Issues with taxon table not having taxon_id and ncbi_id equal under biosql schema Message-ID: <4A6E758A.2030200@gmail.com> Hi I am new to biojava, and made few application with biojava, I trying to make a database which can update itself with NCBI and only remain behind by a day only, I am almost successfully with this task but only problem i facing is when I run my genbank loader program to upload genbank file to biosql database with a updated taxon table, biojava insert a taxon_id (ie. taxon_id != ncib_taxon_id) in taxon table, if their is any record in file which don't have it ncbi_taxon_id in taxon table (becuase that taxon is being replaced by some other in ncbi update), then biojava insert a record in taxon table such that taxon_id is some random sequence and ncbi_taxon_id is the dbxref field from file, my problem is that they are not equale, is their any way to force hibernate or richsequence to put record such that taxon_id and ncbi_taxon_id are equale in table. thanks Deepak Sheoran North Dakota State University (Student) From florian.mittag at uni-tuebingen.de Tue Jul 28 08:14:54 2009 From: florian.mittag at uni-tuebingen.de (Florian Mittag) Date: Tue, 28 Jul 2009 14:14:54 +0200 Subject: [Biojava-l] How to parse large Genbank files? In-Reply-To: <93b45ca50907272005r7c98c02ycb14e5b000d1aff0@mail.gmail.com> References: <200907241929.08768.florian.mittag@uni-tuebingen.de> <200907271416.33485.florian.mittag@uni-tuebingen.de> <93b45ca50907272005r7c98c02ycb14e5b000d1aff0@mail.gmail.com> Message-ID: <200907281414.55156.florian.mittag@uni-tuebingen.de> Hi! On Tuesday, 28. July 2009 05:05, you wrote: > While you maybe can't do it without code changes you can probably do > it within the existing framework. If you look at the readGenbank() > code in RichSequence.IOTools you will find that the BioJava file > parsing consists of many pluggable components which are all defined by > interfaces. Anything that implements one of those interfaces can be > plugged into the parsing frame work. So if you want you can change > the Format object to one of your custom design (which implements > Format), you can also change the event listeners and the > SequenceBuilders. In your case the SequenceBuilder might be something > to look at, it sounds like you don't need to create all the extra > Sequence objects for every feature so you could modify that part. Yeah, I see what you mean. I wanted to start with something simple because I didn't want to code everything myself, but it seems like I won't get around it, if I want to optimize it. > Also, in the Format objects there are often methods called elideXXX() > which let you tell the Format object to skip over bits that you don't > want. I think I want everything, since I want to story everything in the BioSQL db afterwards. I don't think, I can skip something. > Finally, I suspect the problem with memory use is that the String, > char[], SymbolList, Sequence copying is both inefficient and worse > still is probably not releasing resources in a timely fashion. Eg once > the parser framework converts a char[] to a SymbolList is probably no > longer needs that char[] reference and might be able to null it. Then > when memory gets low the GC can clean out all the cruft. > > If I have a chance I will run a profiler to see what is sucking up the > memory (and what can be released) and also see if all that copying is > making a significant impact on CPU cycles (if not it's probably more > effort than it's worth to change). The memory thing definitely needs > to change though. It turned out our workgroup has a floating JProfiler license, so I did some tests and got some clues on where to optimize further. The NetBeans profiler reported that most of the memory was consumend by char[], but it only showed about 130MB of usage, contrary to the 2GB of heap being full. So our idea was that maybe the memory management overhead was another sink where all the memory vanished. JProfiler then returned more plausible results with nearly 800MB used by char[]. I tweaked both the way I call the parser and the GenbankFormat itself, and now all files except chromosome 1 (300MB) will parse successfully. To reduce the memory for SymbolLists, I did: PackedSymbolListFactory pslf = new PackedSymbolListFactory() SimpleRichSequenceBuilder listener = new SimpleRichSequenceBuilder(pslf); GenbankFormat gf = new GenbankFormat(); gf.readRichSequence(fileIn, dnaTokenization, listener, nsGenbank); RichSequence seq = listener.makeRichSequence(); The PackedSymbolListFactory seemed to help saving some memory, but it still wasn't enough. I then modified the readSection() method of GenbankFormat. What it usually does is to put each single line of nucleotide sequence into a String[] which it then puts into the ArrayList returned by the method. Since there are 60 nucleotides (so 60 bytes + whitespaces) per line, this was a big array. I modified it to build one large string containing only the nucleotide characters, instead of returning the array and then have the readRichSequence() method build this large String. This all still isn't enough, the program exits at sl.toArray(), so I agree with Richard here to keep the sequence as a String (maybe use the Symbol(List) mechanisms to check for invalid characters) and only convert it to Symbol objects if really necessary. Btw: Should we move this to Biojava-dev? And where do I sign up for BioJava3 development? ;-) - Florian > On Mon, Jul 27, 2009 at 8:16 PM, Florian > > Mittag wrote: > > Hi Mark! > > > > On Saturday, 25. July 2009 04:20, Mark Schreiber wrote: > >> I don't think anyone has done much or anything to optimize these > >> parsers. The process you outline sounds extremely inefficient. It is > >> also likely to lead to memory leaks due to the number of copy > >> operations. > > > > I wouldn't necessarily say that it leads to memory leaks, but it > > definitively leads to a high memory consumption (2GB are not enough for a > > 200MB file). Also, my outline of the process is based on only 2 hours of > > viewing the code, so actually I expected to be corrected on this. > > Unfortunately, it seems like I did get the right idea and it IS extremely > > inefficient. > > > > I mean, I understand that this is a high level of abstraction that might > > come in handy in many situations, but it certainly is more of an obstacle > > in my specific case. > > > >> As always with java, don't try and optimize without a profiler which > >> will tell you which methods are taking a long time and which objects > >> take the most memory. > > > > I think we should continue this discussion on the biojava-dev list or in > > a private conversation, as it will probably get very detailed and > > technical. > > > > > > My question to this list again: > > Is there a way to achieve my goal of parsing a 200MB Genbank file with > > the current biojava version without code changes? > > > > > > - Florian > > > >> On 25 Jul 2009, 1:33 AM, "Florian Mittag" > >> wrote: > >> > >> Hi! > >> > >> I think this is a problem worth of its own thread, so I'll start one: > >> > >> I want to store all human chromosomes in a BioSQL database after I > >> loaded the > >> information from .gbk files. The files I get from NCBI with the > >> following URIs, where the id ranges from nc_000001 to nc_000024 plus > >> nc_001804: > >> > >> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=n > >>c_0 00023&rettype=gbwithparts&retmode=text > >> > >> I then try to parse the files as described in > >> http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting > >>_fi les but it wont work. While there are no problems parsing 1804 and > >> 24, chromosome > >> 23 leads to a OutOfMemory exception although I gave it 2GB of heap > >> space. > >> > >> Here is a stack trace (the line numbers might differ, because I already > >> tried > >> to improve GenbankFormat.java in memory efficiency): > >> > >> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > >> ? ? ? ?at > >> org.biojava.bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbol > >>Lis tFactory.java:222) at > >> org.biojavax.bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichS > >>equ enceBuilder.java:256) at > >> org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.jav > >>a:5 35) at > >> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamRead > >>er. java:110) at > >> org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Ma > >>in. java:537) at > >> org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java > >>:46 8) at > >> org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:164) > >> > >> The line in GenbankFormat.java is: > >> > >> rlistener.addSymbols( > >> ? ? ? ?symParser.getAlphabet(), > >> ? ? ? ?(Symbol[])(sl.toList().toArray(new Symbol[0])), > >> ? ? ? ?0, sl.length()); > >> > >> Sometimes it fails at the sl.toList().toArray()-part, sometimes it fails > >> later > >> inside the addSymbols method, but it always fails. > >> > >> How can this be? I mean, the file is only 190MB in size, so 2GB of > >> memory should be more than enough. Browsing through the source code, I > >> discovered what I think of as very inefficient handling of sequences: > >> > >> 1) the sequence string is read from file into a StringBuffer > >> 2) it is converted to a string (with whitespaces removed) > >> 3) a SimpleSymbolList is created out of the string > >> 4) the SymbolList is converted to a List of Symbols > >> 5) the List is converted to an array of Symbols > >> 6) the array is passed to addSymbols > >> 7) there it is added to a ChunkedSymbolListFactory > >> 8) if at some point the sequence is requested, a SymbolList is created > >> and then converted to a string. > >> > >> You see, there is a lot of copying and converting, but in the end I have > >> the same string I started with. Well, I had the string, if it ever > >> reached the end, because it will crash before completing this process. > >> > >> > >> Am I doing something wrong or is there a great potential of improving > >> parsing > >> of Genbank files? > >> > >> > >> Regards, > >> ? Florian > >> _______________________________________________ > >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- > > Dipl. Inf. Florian Mittag > > Universit?t Tuebingen > > WSI-RA, Sand 1 > > 72076 Tuebingen, Germany > > Phone: +49 7071 / 29 78985 ?Fax: +49 7071 / 29 5091 -- Dipl. Inf. Florian Mittag Universit?t Tuebingen WSI-RA, Sand 1 72076 Tuebingen, Germany Phone: +49 7071 / 29 78985 Fax: +49 7071 / 29 5091 From holland at eaglegenomics.com Tue Jul 28 08:52:00 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Tue, 28 Jul 2009 13:52:00 +0100 Subject: [Biojava-l] How to parse large Genbank files? In-Reply-To: <200907281414.55156.florian.mittag@uni-tuebingen.de> References: <200907241929.08768.florian.mittag@uni-tuebingen.de> <200907271416.33485.florian.mittag@uni-tuebingen.de> <93b45ca50907272005r7c98c02ycb14e5b000d1aff0@mail.gmail.com> <200907281414.55156.florian.mittag@uni-tuebingen.de> Message-ID: > > > Btw: Should we move this to Biojava-dev? probably, yes! :) > And where do I sign up for BioJava3 development? ;-) Andreas Prlic has the keys to the project these days. BJ3 does already have some new code in place for handling sequences as strings but it's in an out-of-the-way bit of the repository and is not part of the main roadmap for the project at present. The current focus is on modularising the existing bits, so that individual components can be refactored to behave better at a future date. If you want to explore my ideas for a replacement Sequence model, the code and docs are here (sequence handling is in the 'core' module with DNA-specifics in the 'dna' module): http://biojava.org/wiki/BioJava3:HowTo http://www.biojava.org/wiki/BioJava3_project (Methods such as file parsers would request Strings (or ideally CharSequence - more flexible, and String extends it) as parameters whenever they don't care about content - if they care about content but don't care in advance about size or random access then they should request Iterator which can be used to wrap a String and parse on demand, and if they need full functionality then they should request List which the default implementation of uses ArrayLists but there's no reason a String-backed one could be written as well). cheers, Richard > > - Florian > >> On Mon, Jul 27, 2009 at 8:16 PM, Florian >> >> Mittag wrote: >>> Hi Mark! >>> >>> On Saturday, 25. July 2009 04:20, Mark Schreiber wrote: >>>> I don't think anyone has done much or anything to optimize these >>>> parsers. The process you outline sounds extremely inefficient. It >>>> is >>>> also likely to lead to memory leaks due to the number of copy >>>> operations. >>> >>> I wouldn't necessarily say that it leads to memory leaks, but it >>> definitively leads to a high memory consumption (2GB are not >>> enough for a >>> 200MB file). Also, my outline of the process is based on only 2 >>> hours of >>> viewing the code, so actually I expected to be corrected on this. >>> Unfortunately, it seems like I did get the right idea and it IS >>> extremely >>> inefficient. >>> >>> I mean, I understand that this is a high level of abstraction that >>> might >>> come in handy in many situations, but it certainly is more of an >>> obstacle >>> in my specific case. >>> >>>> As always with java, don't try and optimize without a profiler >>>> which >>>> will tell you which methods are taking a long time and which >>>> objects >>>> take the most memory. >>> >>> I think we should continue this discussion on the biojava-dev list >>> or in >>> a private conversation, as it will probably get very detailed and >>> technical. >>> >>> >>> My question to this list again: >>> Is there a way to achieve my goal of parsing a 200MB Genbank file >>> with >>> the current biojava version without code changes? >>> >>> >>> - Florian >>> >>>> On 25 Jul 2009, 1:33 AM, "Florian Mittag" >>>> wrote: >>>> >>>> Hi! >>>> >>>> I think this is a problem worth of its own thread, so I'll start >>>> one: >>>> >>>> I want to store all human chromosomes in a BioSQL database after I >>>> loaded the >>>> information from .gbk files. The files I get from NCBI with the >>>> following URIs, where the id ranges from nc_000001 to nc_000024 >>>> plus >>>> nc_001804: >>>> >>>> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=n >>>> c_0 00023&rettype=gbwithparts&retmode=text >>>> >>>> I then try to parse the files as described in >>>> http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting >>>> _fi les but it wont work. While there are no problems parsing >>>> 1804 and >>>> 24, chromosome >>>> 23 leads to a OutOfMemory exception although I gave it 2GB of heap >>>> space. >>>> >>>> Here is a stack trace (the line numbers might differ, because I >>>> already >>>> tried >>>> to improve GenbankFormat.java in memory efficiency): >>>> >>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap >>>> space >>>> at >>>> org >>>> .biojava >>>> .bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbol >>>> Lis tFactory.java:222) at >>>> org >>>> .biojavax >>>> .bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichS >>>> equ enceBuilder.java:256) at >>>> org >>>> .biojavax >>>> .bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.jav >>>> a:5 35) at >>>> org >>>> .biojavax >>>> .bio.seq.io.RichStreamReader.nextRichSequence(RichStreamRead >>>> er. java:110) at >>>> org >>>> .prodge >>>> .sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Ma >>>> in. java:537) at >>>> org >>>> .prodge >>>> .sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java >>>> :46 8) at >>>> org >>>> .prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java: >>>> 164) >>>> >>>> The line in GenbankFormat.java is: >>>> >>>> rlistener.addSymbols( >>>> symParser.getAlphabet(), >>>> (Symbol[])(sl.toList().toArray(new Symbol[0])), >>>> 0, sl.length()); >>>> >>>> Sometimes it fails at the sl.toList().toArray()-part, sometimes >>>> it fails >>>> later >>>> inside the addSymbols method, but it always fails. >>>> >>>> How can this be? I mean, the file is only 190MB in size, so 2GB of >>>> memory should be more than enough. Browsing through the source >>>> code, I >>>> discovered what I think of as very inefficient handling of >>>> sequences: >>>> >>>> 1) the sequence string is read from file into a StringBuffer >>>> 2) it is converted to a string (with whitespaces removed) >>>> 3) a SimpleSymbolList is created out of the string >>>> 4) the SymbolList is converted to a List of Symbols >>>> 5) the List is converted to an array of Symbols >>>> 6) the array is passed to addSymbols >>>> 7) there it is added to a ChunkedSymbolListFactory >>>> 8) if at some point the sequence is requested, a SymbolList is >>>> created >>>> and then converted to a string. >>>> >>>> You see, there is a lot of copying and converting, but in the end >>>> I have >>>> the same string I started with. Well, I had the string, if it ever >>>> reached the end, because it will crash before completing this >>>> process. >>>> >>>> >>>> Am I doing something wrong or is there a great potential of >>>> improving >>>> parsing >>>> of Genbank files? >>>> >>>> >>>> Regards, >>>> Florian >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >>> -- >>> Dipl. Inf. Florian Mittag >>> Universit?t Tuebingen >>> WSI-RA, Sand 1 >>> 72076 Tuebingen, Germany >>> Phone: +49 7071 / 29 78985 Fax: +49 7071 / 29 5091 > > -- > Dipl. Inf. Florian Mittag > Universit?t Tuebingen > WSI-RA, Sand 1 > 72076 Tuebingen, Germany > Phone: +49 7071 / 29 78985 Fax: +49 7071 / 29 5091 From andreas at sdsc.edu Tue Jul 28 13:28:30 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 28 Jul 2009 10:28:30 -0700 Subject: [Biojava-l] How to parse large Genbank files? In-Reply-To: References: <200907241929.08768.florian.mittag@uni-tuebingen.de> <200907271416.33485.florian.mittag@uni-tuebingen.de> <93b45ca50907272005r7c98c02ycb14e5b000d1aff0@mail.gmail.com> <200907281414.55156.florian.mittag@uni-tuebingen.de> Message-ID: <59a41c430907281028j5bc42c26p86fe64dd0dad14bb@mail.gmail.com> >> And where do I sign up for BioJava3 development? ;-) I think we still need a module -lead for the biojava-biosql module... you are welcome to volunteer! > Andreas Prlic has the keys to the project these days. BJ3 does already have > some new code in place for handling sequences as strings but it's in an > out-of-the-way bit of the repository and is not part of the main roadmap for > the project at present. The current focus is on modularising the existing > bits, so that individual components can be refactored to behave better at a > future date. Just to add to this: I think the BJ3 and sequence related work would fit nicely as a new biojava-sequence module... Andreas From bernd.jagla at pasteur.fr Wed Jul 29 09:16:04 2009 From: bernd.jagla at pasteur.fr (Bernd Jagla) Date: Wed, 29 Jul 2009 15:16:04 +0200 Subject: [Biojava-l] blast parsing question Message-ID: <3DD69071F4A8490D9D1D7EEB172FEFC0@zillumina> Hi, I am new to BioJava. I want to test what is going on here in order to potentially integrate it with KNIME. My first project is parsing BLAST output for large files. The example in the codebook is very good and I had no problems integrating everything in Eclipse and geting it to work. Now here is my problem: I am interested in parsing the summary table in the beginning of the blast-output, and I haven't found a way to get at this information. I am blasting short sequences (20nt - 300nt) against genomic databases (mouse/human/refseq/miRBase). I want to know if a given sequence (out of a set of sequences) aligns to a specific genome with high identity. I want to then separate the input source fasta file into a set that aligns to the genome and one that doesn't (potentially another list of dubious sequences where there is no clear answer). For this I only need the length of the query sequence and score and the first few characters of the header line. At least that's the way I am currently doing it. I have set the blast parameters to only give me the first alignment, but the first 50 or so in the summary. Any help, comments are appreciated. Thanks, Bernd Bernd Jagla Bioinformatician Institut Pasteur Plate-forme puces a ADN Genopole / Institut Pasteur 28 rue du Docteur Roux 75724 Paris Cedex 15 France bernd.jagla at pasteur.fr tel: +33 (0) 140 61 35 13 From andreas at sdsc.edu Wed Jul 29 15:00:25 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 29 Jul 2009 12:00:25 -0700 Subject: [Biojava-l] FASTQ in BioJava, BioRuby (and Biopython, BioPerl & EMBOSS) In-Reply-To: <320fb6e00907290310k16b78e72iae34f01de680ca76@mail.gmail.com> References: <320fb6e00907290310k16b78e72iae34f01de680ca76@mail.gmail.com> Message-ID: <59a41c430907291200o3ace6faj4b3455b8f3237cad@mail.gmail.com> Hi Peter, I would be happy to have support for the FASTQ file format in BioJava. We had an increased number of requests for parsing the output of sequencing machines in the last weeks, but nobody has stepped up to be module-lead for this as of yet. I am currently not working with sequencers myself, so I can't really provide support for this. I am happy to help anybody who wants to be module lead for this to get this going. Andreas On Wed, Jul 29, 2009 at 3:10 AM, Peter Cock wrote: > Dear Andreas (& Richard) and Goto-san, > > Are BioJava or BioRuby interested in supporting the FASTQ file format > used in next generation sequencing for storing sequencing reads with > associated quality scores? > > I have been working on FASTQ support in Biopython, and coordinating > with Peter Rice at EMBOSS and Chris Fields at BioPerl to ensure we > are consistent on our interpretation of these files, interconversion, and > naming. We'd like to get BioJava and BioRuby involved too. > > Please could you (or whomever at BioJava/BioRuby would be doing > FASTQ code) please sign up to the cross-project OBF mailing list? > http://lists.open-bio.org/mailman/listinfo/open-bio-l > > Thank you, > > Peter > -- > Dr Peter Cock, Biopython project > From andreas at sdsc.edu Wed Jul 8 06:20:59 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 7 Jul 2009 23:20:59 -0700 Subject: [Biojava-l] summary biojava user meeting Message-ID: <59a41c430907072320k3d5a4415u962d59a10d286beb@mail.gmail.com> Hi, Here a quick summary of the BioJava user meeting we had last week at the BOSC conference: The following people were present: Mattias Piipari Martijn Devisscher Frederik Decouttere Richard Holland Andreas Prlic The new modularized code base will allow for individual people to take over responsibility of some of the sub-modules as well as the contribution of new modules., which I both welcome greatly. As such it was great to have Mattias, Martijn and Frederik there and expressing their interest in this. Mattias is interested in contributing a new module related to machine learning. Martijn and Frederik are interested in providing a new GUI module (seqpad). Due to this our discussions were mainly related to how to organize the contribution of new modules and their maintainance: * Before starting a new module the code should undergo public code review * New modules need docu (wiki cookbook) and junit tests. * A Module Maintainer (MM) is the main responsible for everything related to the module. * MM coordinates patches and other user contributions for the module * MM can write papers related to the code in the module without having to cite all of the other BioJava contributors. * A MM volunteers to support the module for (at least) a year. * All MMs will be listed by name on a wiki page in order to clarify responsibilities Andreas From andrew at nervechannel.com Wed Jul 8 18:38:53 2009 From: andrew at nervechannel.com (Andrew Clegg) Date: Wed, 8 Jul 2009 19:38:53 +0100 Subject: [Biojava-l] Retrieving AUTHOR information from PDB file Message-ID: Hopefully this isn't a FAQ but I couldn't find it via Google. I'm parsing PDB files with PDBFileReader. I want to extract the AUTHOR line(s) but I can't find a way to do this. You can get the corresponding JRNL field with Structure#getJournalArticle() but the authors of this aren't necessarily the same as the authors of the structure itself. Any ideas? I'm probably just being shortsighted and overlooking something... Thanks! Andrew. -- :: http://biotext.org.uk/ :: From andreas at sdsc.edu Thu Jul 9 04:21:49 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 8 Jul 2009 21:21:49 -0700 Subject: [Biojava-l] Retrieving AUTHOR information from PDB file In-Reply-To: References: Message-ID: <59a41c430907082121x23fceb7hdc7d51c861b84c9c@mail.gmail.com> Hi Andrew, The PdbFileParser at the present does not process the AUTHOR lines, yet, but should be easy to add.. If you need this urgently, you could quickly patch the PdbFileParser, otherwise I'll add it to SVN in the next couple of days... Andreas On Wed, Jul 8, 2009 at 11:38 AM, Andrew Clegg wrote: > Hopefully this isn't a FAQ but I couldn't find it via Google. > > I'm parsing PDB files with PDBFileReader. I want to extract the AUTHOR > line(s) but I can't find a way to do this. > > You can get the corresponding JRNL field with > Structure#getJournalArticle() but the authors of this aren't > necessarily the same as the authors of the structure itself. > > Any ideas? I'm probably just being shortsighted and overlooking > something... > > Thanks! > > Andrew. > > -- > :: http://biotext.org.uk/ :: > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From andrew at nervechannel.com Thu Jul 9 08:38:08 2009 From: andrew at nervechannel.com (Andrew Clegg) Date: Thu, 9 Jul 2009 09:38:08 +0100 Subject: [Biojava-l] Retrieving AUTHOR information from PDB file In-Reply-To: <59a41c430907082121x23fceb7hdc7d51c861b84c9c@mail.gmail.com> References: <59a41c430907082121x23fceb7hdc7d51c861b84c9c@mail.gmail.com> Message-ID: 2009/7/9 Andreas Prlic : > Hi Andrew, > > The PdbFileParser at the present does not process the AUTHOR lines, yet, but > should be easy to add.. If you need this urgently, you could quickly patch > the PdbFileParser, otherwise I'll add it to SVN in the next couple of > days... No, it's not urgent, I can wait til you have a chance to do it rather than trying to figure it out myself :-) Could you let me know when you've added it in though please? Many thanks! Andrew. From andrew at nervechannel.com Thu Jul 9 11:12:16 2009 From: andrew at nervechannel.com (Andrew Clegg) Date: Thu, 9 Jul 2009 12:12:16 +0100 Subject: [Biojava-l] Retrieving AUTHOR information from PDB file In-Reply-To: References: <59a41c430907082121x23fceb7hdc7d51c861b84c9c@mail.gmail.com> Message-ID: By the way -- does anyone have any documentation on which PDB fields correspond to which methods on the Structure and PDBHeader objects? In some cases, this is obvious, but in others it's not so clear -- for example PDBHeader#getDescription(), since there isn't a DESCRIPTION field in a PDB file. If not, I can start putting one together for the wiki, but I don't want to duplicate work... Andrew. 2009/7/9 Andrew Clegg : > 2009/7/9 Andreas Prlic : >> Hi Andrew, >> >> The PdbFileParser at the present does not process the AUTHOR lines, yet, but >> should be easy to add.. If you need this urgently, you could quickly patch >> the PdbFileParser, otherwise I'll add it to SVN in the next couple of >> days... > > No, it's not urgent, I can wait til you have a chance to do it rather > than trying to figure it out myself :-) > > Could you let me know when you've added it in though please? > > Many thanks! > > Andrew. > -- :: http://biotext.org.uk/ :: From paolo.pavan at gmail.com Thu Jul 9 15:58:07 2009 From: paolo.pavan at gmail.com (Paolo Pavan) Date: Thu, 9 Jul 2009 17:58:07 +0200 Subject: [Biojava-l] Assembly data reading Message-ID: <56be91b60907090858t41f2c72cwf7db057e6390d6db@mail.gmail.com> Hi everybody, I'm almost new to this topic, I would like to know if there is something can help me to load in my java program data from a large 454 contig. I need to retain in memory and access data from the single reads forming the contig too. If it is not possible to load a *.gff data file it should be ok to load a *.ace data file too. Many thanks for any suggestion you can give me! Greetings, Paolo From wzhao6898 at gmail.com Thu Jul 9 17:16:37 2009 From: wzhao6898 at gmail.com (David Zhao) Date: Thu, 9 Jul 2009 17:16:37 +0000 (UTC) Subject: [Biojava-l] Pairwise alignment of protein sequences Message-ID: Hi there, I'm new to biojava, and trying to generate a pairwise alignment of 2 protein sequences following the example here (http://www.biojava.org/wiki/BioJava:CookBook:DP:PairWise2). However, I got: java.lang.StringIndexOutOfBoundsException: String index out of range: 0 at java.lang.String.charAt(Unknown Source) at org.biojava.bio.alignment.SubstitutionMatrix.parseMatrix(SubstitutionMatrix.java :304) at org.biojava.bio.alignment.SubstitutionMatrix.(SubstitutionMatrix.java:100) at com.activx.lims.util.ms.tests.TargetListGenerationUtilsTest.testBioJava(TargetLi stGenerationUtilsTest.java:94) ... error, and here is my code: import ... FiniteAlphabet alphabet = (FiniteAlphabet) AlphabetManager.alphabetForName("PROTEIN"); File matrixFile = new File(FULL_PATH_NUC44); SubstitutionMatrix matrix = new SubstitutionMatrix(alphabet,matrixFile); SequenceAlignment aligner = new NeedlemanWunsch(new Short("0"),new Short("3"),new Short("2"),new Short("2"),new Short("1") ,matrix); Sequence query = ProteinTools.createProteinSequence(PeptidePeer.retrieveByPK(10126404).getSequenc e(), "query"); Sequence target = ProteinTools.createProteinSequence(PeptidePeer.retrieveByPK(10109235).getSequenc e(), "target"); // Perform an alignment and save the results. aligner.pairwiseAlignment( query, // first sequence target // second one ); // Print the alignment to the screen System.out.println("Global alignment with Needleman- Wunsch:\n" + aligner.getAlignmentString()); Thanks in advance for any help! David From wzhao6898 at gmail.com Thu Jul 9 18:13:45 2009 From: wzhao6898 at gmail.com (David Zhao) Date: Thu, 9 Jul 2009 18:13:45 +0000 (UTC) Subject: [Biojava-l] Pairwise alignment of protein sequences References: Message-ID: David Zhao gmail.com> writes: I found the answer from a post here(http://portal.open- bio.org/pipermail/biojava-l/2007-February.txt#). In short, I need to use AlphabetManager.alphabetForName("PROTEIN-TERM"); instead of AlphabetManager.alphabetForName("PROTEIN"); to parse the matrix file. Another question though, now I have the alignment, how do I retrieve score and mismatch, gap information from it? Time (ms): 62 Length: 25 Score: 180 Query: query, Length: 25 Target: target, Length: 25 Query: 1 KLFVGGIKEDTEEHHLRDYFEEYGK 25 | ||||||||||||||||||| ||| Target: 1 KIFVGGIKEDTEEHHLRDYFEQYGK 25 Thanks! David From wzhao6898 at gmail.com Thu Jul 9 20:22:47 2009 From: wzhao6898 at gmail.com (David Zhao) Date: Thu, 9 Jul 2009 20:22:47 +0000 (UTC) Subject: [Biojava-l] How to parse pairwise alignment result string Message-ID: Hi there, I've successfully aligned 2 peptide sequence using Needleman-Wunsch algorithm: SequenceAlignment aligner.pairwiseAlignment( query, // first sequence target // second one ); Now, the only output I can get is this string returned by aligner.getAlignmentString(); showing below: Time (ms): 63 Length: 25 Score: 180 Query: query, Length: 25 Target: target, Length: 25 Query: 1 KLFVGGIKEDTEEHHLRDYFEEYGK 25 | ||||||||||||||||||| ||| Target: 1 KIFVGGIKEDTEEHHLRDYFEQYGK 25 How can create an gapped alignment object in biojava from this, so I can retrieve score, gap information, etc. from the object? Thanks in advance! David From andreas at sdsc.edu Fri Jul 10 00:16:19 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 9 Jul 2009 17:16:19 -0700 Subject: [Biojava-l] Retrieving AUTHOR information from PDB file In-Reply-To: References: <59a41c430907082121x23fceb7hdc7d51c861b84c9c@mail.gmail.com> Message-ID: <59a41c430907091716m30251d7as30ca73b8c620bb61@mail.gmail.com> Such documentation does not exist yet, so please go ahead and add something to the wiki. Since the datamodel works with PDB files and MMCIF files ideally the documentation would cover both ;-) A On Thu, Jul 9, 2009 at 4:12 AM, Andrew Clegg wrote: > By the way -- does anyone have any documentation on which PDB fields > correspond to which methods on the Structure and PDBHeader objects? > > In some cases, this is obvious, but in others it's not so clear -- for > example PDBHeader#getDescription(), since there isn't a DESCRIPTION > field in a PDB file. > > If not, I can start putting one together for the wiki, but I don't > want to duplicate work... > > Andrew. > > 2009/7/9 Andrew Clegg : >> 2009/7/9 Andreas Prlic : >>> Hi Andrew, >>> >>> The PdbFileParser at the present does not process the AUTHOR lines, yet, but >>> should be easy to add.. If you need this urgently, you could quickly patch >>> the PdbFileParser, otherwise I'll add it to SVN in the next couple of >>> days... >> >> No, it's not urgent, I can wait til you have a chance to do it rather >> than trying to figure it out myself :-) >> >> Could you let me know when you've added it in though please? >> >> Many thanks! >> >> Andrew. >> > > > > -- > :: http://biotext.org.uk/ :: > From nathan.genome at gmail.com Thu Jul 16 15:14:09 2009 From: nathan.genome at gmail.com (nathan genome) Date: Thu, 16 Jul 2009 17:14:09 +0200 Subject: [Biojava-l] visualizing pair alignment to a genome Message-ID: hi i am working on genomic variations. i am interested to initially visualize mappings from clone-ends to reference sequence. i have data in the following tab. format : CAX100A01FOR1 1 748 scaffold_116 507411 508161 CAX100A01REV1 1 702 scaffold_116 512322 511611 1st & 7th cols. represent clone-ends, the numbers are positions, 4th col. indicates the reference. How do i visualize this pair alignment in a window with biojava. thanks Natthan From florian.mittag at uni-tuebingen.de Thu Jul 16 15:38:25 2009 From: florian.mittag at uni-tuebingen.de (Florian Mittag) Date: Thu, 16 Jul 2009 17:38:25 +0200 Subject: [Biojava-l] Load Genbank files takes ages Message-ID: <200907161738.29913.florian.mittag@uni-tuebingen.de> Hi all! We try to load Genbank files into our bioseqdb database using BioJava. I copy-pasted the code together from tutorials and previous posts on this mailinglist. My problems: 1) It eats huge amounts of memory, so that I needed to increase the heap size to 2GB. 2) Loading the first two files works great, but the third one ran for one two hours without completion. Here is my code: --- snip --- // loop over all downloaded *.gbk files starting with the highest number System.out.println("Updating chromosome " + chrNo[j] + " ..."); BufferedReader fileIn = new BufferedReader(new FileReader(localFile)); tx = session.beginTransaction(); GenbankFormat gf = new GenbankFormat(); SimpleRichSequenceBuilder listener = new SimpleRichSequenceBuilder(); RichSequence seq = null; gf.readRichSequence(fileIn, dnaTokenization, listener, nsGenbank); seq = listener.makeRichSequence(); if( seq != null ) { // check, if a sequence with this identifier is already in the DB Query q = session.createQuery( "select be from BioEntry as be where identifier=:identifier"); q.setString("identifier",seq.getIdentifier()); List entries = q.list(); for( Object o : entries ) { // delete the old sequence in the DB BioEntry oldSeq = (BioEntry)o; session.delete("BioEntry", oldSeq); } tx.commit(); tx = session.beginTransaction(); session.save("Sequence", seq); System.out.println("Chromosome " + chrNo[j] + " was updated.\n"); } else { System.out.println("Chromosome " + chrNo[j] + " was NOT updated.\n"); } tx.commit(); --- snap --- This is the generated output: ---snip --- Jul 16, 2009 4:33:53 PM - FINE: Starting update of chromosome 001807 Updating chromosome 001807 ... Chromosome 001807 was updated. Jul 16, 2009 4:33:55 PM - FINE: Starting update of chromosome 000024 Updating chromosome 000024 ... Chromosome 000024 was updated. Jul 16, 2009 4:35:27 PM - FINE: Starting update of chromosome 000023 Updating chromosome 000023 ... --- snap --- The files for this are downloaded from Genbank and the file sizes are: NC_001807.gbk 58.4 KB NC_000024.gbk 70.8 MB NC_000023.gbk 190.1 MB So, I don't see, why loading a 70.8 MB file took less than 2 minutes and a 190.1 MB file isn't completed after 2 hours. But during this time, the CPU load is almost 100% and there is no significant network or harddisk activity. When I paused the program (I'm using Eclipse) and looked, where the whole processing power is going to, I ended up with the following stacktrace (sorry for the unreadable format): CharacterTokenization.tokenizeSymbolList(SymbolList) line: 214 AlphabetManager$WellKnownTokenizationWrapper.tokenizeSymbolList(SymbolList) line: 1460 SimpleSymbolList(AbstractSymbolList).seqString() line: 102 BioSQLRichSequenceHandler(DummyRichSequenceHandler).seqString(RichSequence) line: 115 BioSQLRichSequenceHandler.seqString(RichSequence) line: 155 SimpleRichSequence(ThinRichSequence).seqString() line: 203 SimpleRichSequence.getStringSequence() line: 77 GeneratedMethodAccessor132.invoke(Object, Object[]) line: not available DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25 Method.invoke(Object, Object...) line: 597 BasicPropertyAccessor$BasicGetter.get(Object) line: 145 PojoEntityTuplizer(AbstractEntityTuplizer).getPropertyValues(Object) line: 249 PojoEntityTuplizer.getPropertyValues(Object) line: 244 JoinedSubclassEntityPersister(AbstractEntityPersister).getPropertyValues(Object, EntityMode) line: 3567 DefaultFlushEntityEventListener.getValues(Object, EntityEntry, EntityMode, boolean, SessionImplementor) line: 167 DefaultFlushEntityEventListener.onFlushEntity(FlushEntityEvent) line: 120 DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEntities(FlushEvent) line: 196 DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEverythingToExecutions(FlushEvent) line: 76 DefaultAutoFlushEventListener.onAutoFlush(AutoFlushEvent) line: 35 SessionImpl.autoFlushIfRequired(Set) line: 970 SessionImpl.list(String, QueryParameters) line: 1115 QueryImpl.list() line: 79 QueryImpl(AbstractQueryImpl).uniqueResult() line: 811 GeneratedMethodAccessor38.invoke(Object, Object[]) line: not available DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25 Method.invoke(Object, Object...) line: 597 BioSQLRichObjectBuilder.buildObject(Class, List) line: 133 RichObjectFactory.getObject(Class, Object[]) line: 107 GenbankFormat.readRichSequence(BufferedReader, SymbolTokenization, RichSeqIOListener, Namespace) line: 450 UpdateDB_Main.updateChromosome() line: 542 Now we go to GenbankFormat.readRichSequence(). It hangs at about line 450, the line where it loads a CrossRef object, so I added debug output: --- snip --- // parameter on old feature if (key.equals("db_xref")) { Matcher m = dbxp.matcher(val); if (m.matches()) { String dbname = m.group(1); String raccession = m.group(2); if (dbname.equalsIgnoreCase("taxon")) { [...] } else { try { long starttime = System.currentTimeMillis(); CrossRef cr = (CrossRef)RichObjectFactory.getObject(SimpleCrossRef.class,new Object[] {dbname, raccession, new Integer(0)}); long duration = System.currentTimeMillis() - starttime; if( duration > 100 ) { System.out.println("dbname: " + dbname + ", raccession: " + raccession); System.out.println(" took " + duration + "ms"); } RankedCrossRef rcr = new SimpleRankedCrossRef(cr, ++rcrossrefCount); rlistener.getCurrentFeature().addRankedCrossRef(rcr); --- snap --- Which leads to: --- snip --- dbname: GeneID, raccession: 677739 took 3291ms dbname: HGNC, raccession: 31847 took 2427ms dbname: GeneID, raccession: 55344 took 2932ms dbname: HGNC, raccession: 23148 took 2339ms dbname: GI, raccession: 94158612 took 2418ms dbname: GI, raccession: 8922995 took 2920ms [...] --- snap --- Which are all /db_xref properties of the NC_000023.gbk file. Searching deeper, it looks like for every CrossRef object loaded, the whole BioEntry object is built and the sequence parsed. But remember, this only happens on chromosome 23, not on 24, which has /db_xref, too. I already spent some time on this, but I can't figure out, what could be the cause. Thanks Florian From markjschreiber at gmail.com Fri Jul 17 01:33:58 2009 From: markjschreiber at gmail.com (Mark Schreiber) Date: Fri, 17 Jul 2009 09:33:58 +0800 Subject: [Biojava-l] Load Genbank files takes ages In-Reply-To: <200907161738.29913.florian.mittag@uni-tuebingen.de> References: <200907161738.29913.florian.mittag@uni-tuebingen.de> Message-ID: <93b45ca50907161833v1d6fed1fyd0030bf37271889d@mail.gmail.com> I wonder if there is some kind of memory leak or infinite loop? Have you considered running a profiler? Also, are you able to parse that sequence when you don't put it into BioSQL. It could be the parser not the BioSQL binding. - Mark On Thu, Jul 16, 2009 at 11:38 PM, Florian Mittag wrote: > > Hi all! > > We try to load Genbank files into our bioseqdb database using BioJava. I > copy-pasted the code together from tutorials and previous posts on this > mailinglist. My problems: > > 1) It eats huge amounts of memory, so that I needed to increase the heap size > to 2GB. > > 2) Loading the first two files works great, but the third one ran for one two > hours without completion. Here is my code: > > --- snip --- > // loop over all downloaded *.gbk files starting with the highest number > System.out.println("Updating chromosome " + chrNo[j] + " ..."); > > BufferedReader fileIn = new BufferedReader(new FileReader(localFile)); > > tx = session.beginTransaction(); > GenbankFormat gf = new GenbankFormat(); > SimpleRichSequenceBuilder listener = new SimpleRichSequenceBuilder(); > RichSequence seq = null; > > gf.readRichSequence(fileIn, dnaTokenization, listener, nsGenbank); > seq = listener.makeRichSequence(); > > if( seq != null ) { > ? ? ? ?// check, if a sequence with this identifier is already in the DB > ? ? ? ?Query q = session.createQuery( > ? ? ? ? ? ? ? ?"select be from BioEntry as be where identifier=:identifier"); > ? ? ? ?q.setString("identifier",seq.getIdentifier()); > ? ? ? ?List entries = q.list(); > ? ? ? ?for( Object o : entries ) { > ? ? ? ? ? ? ? ?// delete the old sequence in the DB > ? ? ? ? ? ? ? ?BioEntry oldSeq = (BioEntry)o; > ? ? ? ? ? ? ? ?session.delete("BioEntry", oldSeq); > ? ? ? ?} > ? ? ? ?tx.commit(); > > ? ? ? ?tx = session.beginTransaction(); > ? ? ? ?session.save("Sequence", seq); > > ? ? ? ?System.out.println("Chromosome " + chrNo[j] + " was updated.\n"); > } else { > ? ? ? ?System.out.println("Chromosome " + chrNo[j] + " was NOT updated.\n"); > } > > tx.commit(); > --- snap --- > > > This is the generated output: > ---snip --- > Jul 16, 2009 4:33:53 PM - FINE: Starting update of chromosome 001807 > Updating chromosome 001807 ... > Chromosome 001807 was updated. > Jul 16, 2009 4:33:55 PM - FINE: Starting update of chromosome 000024 > Updating chromosome 000024 ... > Chromosome 000024 was updated. > Jul 16, 2009 4:35:27 PM - FINE: Starting update of chromosome 000023 > Updating chromosome 000023 ... > --- snap --- > > > The files for this are downloaded from Genbank and the file sizes are: > NC_001807.gbk ? 58.4 KB > NC_000024.gbk ? 70.8 MB > NC_000023.gbk ? 190.1 MB > > So, I don't see, why loading a 70.8 MB file took less than 2 minutes and a > 190.1 MB file isn't completed after 2 hours. But during this time, the CPU > load is almost 100% and there is no significant network or harddisk activity. > > When I paused the program (I'm using Eclipse) and looked, where the whole > processing power is going to, I ended up with the following stacktrace (sorry > for the unreadable format): > > CharacterTokenization.tokenizeSymbolList(SymbolList) line: 214 > AlphabetManager$WellKnownTokenizationWrapper.tokenizeSymbolList(SymbolList) > line: 1460 > SimpleSymbolList(AbstractSymbolList).seqString() line: 102 > BioSQLRichSequenceHandler(DummyRichSequenceHandler).seqString(RichSequence) > line: 115 > BioSQLRichSequenceHandler.seqString(RichSequence) line: 155 > SimpleRichSequence(ThinRichSequence).seqString() line: 203 > SimpleRichSequence.getStringSequence() line: 77 > GeneratedMethodAccessor132.invoke(Object, Object[]) line: not available > DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25 > Method.invoke(Object, Object...) line: 597 > BasicPropertyAccessor$BasicGetter.get(Object) line: 145 > PojoEntityTuplizer(AbstractEntityTuplizer).getPropertyValues(Object) line: 249 > PojoEntityTuplizer.getPropertyValues(Object) line: 244 > JoinedSubclassEntityPersister(AbstractEntityPersister).getPropertyValues(Object, > EntityMode) line: 3567 > DefaultFlushEntityEventListener.getValues(Object, EntityEntry, EntityMode, > boolean, SessionImplementor) line: 167 > DefaultFlushEntityEventListener.onFlushEntity(FlushEntityEvent) line: 120 > DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEntities(FlushEvent) > line: 196 > DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEverythingToExecutions(FlushEvent) > line: 76 > DefaultAutoFlushEventListener.onAutoFlush(AutoFlushEvent) line: 35 > SessionImpl.autoFlushIfRequired(Set) line: 970 > SessionImpl.list(String, QueryParameters) line: 1115 > QueryImpl.list() line: 79 > QueryImpl(AbstractQueryImpl).uniqueResult() line: 811 > GeneratedMethodAccessor38.invoke(Object, Object[]) line: not available > DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25 > Method.invoke(Object, Object...) line: 597 > BioSQLRichObjectBuilder.buildObject(Class, List) line: 133 > RichObjectFactory.getObject(Class, Object[]) line: 107 > GenbankFormat.readRichSequence(BufferedReader, SymbolTokenization, > RichSeqIOListener, Namespace) line: 450 > UpdateDB_Main.updateChromosome() line: 542 > > > Now we go to GenbankFormat.readRichSequence(). It hangs at about line 450, the > line where it loads a CrossRef object, so I added debug output: > > --- snip --- > // parameter on old feature > if (key.equals("db_xref")) { > ? ? ? ?Matcher m = dbxp.matcher(val); > ? ? ? ?if (m.matches()) { > ? ? ? ? ? ? ? ?String dbname = m.group(1); > ? ? ? ? ? ? ? ?String raccession = m.group(2); > ? ? ? ? ? ? ? ?if (dbname.equalsIgnoreCase("taxon")) { > ? ? ? ? ? ? ? ? ? ? ? ?[...] > ? ? ? ? ? ? ? ?} else { > ? ? ? ? ? ? ? ? ? ? ? ?try { > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?long starttime = System.currentTimeMillis(); > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?CrossRef cr = > (CrossRef)RichObjectFactory.getObject(SimpleCrossRef.class,new Object[] > {dbname, raccession, new Integer(0)}); > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?long duration = System.currentTimeMillis() - starttime; > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if( duration > 100 ) { > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("dbname: " + dbname + ", raccession: " + raccession); > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(" ?took " + duration + "ms"); > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?} > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?RankedCrossRef rcr = new SimpleRankedCrossRef(cr, ++rcrossrefCount); > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?rlistener.getCurrentFeature().addRankedCrossRef(rcr); > --- snap --- > > Which leads to: > > --- snip --- > dbname: GeneID, raccession: 677739 > ?took 3291ms > dbname: HGNC, raccession: 31847 > ?took 2427ms > dbname: GeneID, raccession: 55344 > ?took 2932ms > dbname: HGNC, raccession: 23148 > ?took 2339ms > dbname: GI, raccession: 94158612 > ?took 2418ms > dbname: GI, raccession: 8922995 > ?took 2920ms > [...] > --- snap --- > > Which are all /db_xref properties of the NC_000023.gbk file. Searching deeper, > it looks like for every CrossRef object loaded, the whole BioEntry object is > built and the sequence parsed. But remember, this only happens on chromosome > 23, not on 24, which has /db_xref, too. > > I already spent some time on this, but I can't figure out, what could be the > cause. > > > Thanks > ? Florian > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From florian.mittag at uni-tuebingen.de Fri Jul 17 12:03:54 2009 From: florian.mittag at uni-tuebingen.de (Florian Mittag) Date: Fri, 17 Jul 2009 14:03:54 +0200 Subject: [Biojava-l] Load Genbank files takes ages In-Reply-To: <93b45ca50907161833v1d6fed1fyd0030bf37271889d@mail.gmail.com> References: <200907161738.29913.florian.mittag@uni-tuebingen.de> <93b45ca50907161833v1d6fed1fyd0030bf37271889d@mail.gmail.com> Message-ID: <200907171403.55631.florian.mittag@uni-tuebingen.de> Thanks for the quick answer! On Friday 17 July 2009 03:33, Mark Schreiber wrote: > I wonder if there is some kind of memory leak or infinite loop? As far as I can tell, there is no infinite loop, just a loop that takes very, very long because for every parsed feature the whole RichSequence object is reprocessed (especially the conversion of the sequence to characters), which takes about 3 seconds. > Have you considered running a profiler? Yes, I have considered this, but the profilers I know for Eclipse are a pain in the a** and don't work, so I will have to use NetBeans or something to do profiling. I noticed another funny thing: When I run let our program skip the first two chromosomes (1807 and 24), then this is the output: Jul 17, 2009 1:50:36 PM - FINE: Starting update of chromosome 000023 dbname: GeneID, raccession: 100132775 took 273ms dbname: CCDS, raccession: CCDS35344.1 took 452ms dbname: GeneID, raccession: 644403 took 283ms Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.(String.java:216) at java.lang.StringBuffer.toString(StringBuffer.java:585) at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:526) at org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Main.java:542) at org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java:473) at org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:169) Harddisk activity is pretty high (probably reading the sequence) and the OutOfMemoryError occurs after about 2 minutes. It seems like loading the other chromosomes before this one somehow changes the behavior. > Also, are you able to parse that sequence when you don't put it into > BioSQL. It could be the parser not the BioSQL binding. > > - Mark That's a good idea, I will try this. I don't know if I will have time for this today, but I should be able to give an update next week. Florian > On Thu, Jul 16, 2009 at 11:38 PM, Florian Mittag > > wrote: > > Hi all! > > > > We try to load Genbank files into our bioseqdb database using BioJava. I > > copy-pasted the code together from tutorials and previous posts on this > > mailinglist. My problems: > > > > 1) It eats huge amounts of memory, so that I needed to increase the heap > > size to 2GB. > > > > 2) Loading the first two files works great, but the third one ran for one > > two hours without completion. Here is my code: > > > > --- snip --- > > // loop over all downloaded *.gbk files starting with the highest number > > System.out.println("Updating chromosome " + chrNo[j] + " ..."); > > > > BufferedReader fileIn = new BufferedReader(new FileReader(localFile)); > > > > tx = session.beginTransaction(); > > GenbankFormat gf = new GenbankFormat(); > > SimpleRichSequenceBuilder listener = new SimpleRichSequenceBuilder(); > > RichSequence seq = null; > > > > gf.readRichSequence(fileIn, dnaTokenization, listener, nsGenbank); > > seq = listener.makeRichSequence(); > > > > if( seq != null ) { > > ? ? ? ?// check, if a sequence with this identifier is already in the DB > > ? ? ? ?Query q = session.createQuery( > > ? ? ? ? ? ? ? ?"select be from BioEntry as be where > > identifier=:identifier"); q.setString("identifier",seq.getIdentifier()); > > ? ? ? ?List entries = q.list(); > > ? ? ? ?for( Object o : entries ) { > > ? ? ? ? ? ? ? ?// delete the old sequence in the DB > > ? ? ? ? ? ? ? ?BioEntry oldSeq = (BioEntry)o; > > ? ? ? ? ? ? ? ?session.delete("BioEntry", oldSeq); > > ? ? ? ?} > > ? ? ? ?tx.commit(); > > > > ? ? ? ?tx = session.beginTransaction(); > > ? ? ? ?session.save("Sequence", seq); > > > > ? ? ? ?System.out.println("Chromosome " + chrNo[j] + " was updated.\n"); > > } else { > > ? ? ? ?System.out.println("Chromosome " + chrNo[j] + " was NOT > > updated.\n"); } > > > > tx.commit(); > > --- snap --- > > > > > > This is the generated output: > > ---snip --- > > Jul 16, 2009 4:33:53 PM - FINE: Starting update of chromosome 001807 > > Updating chromosome 001807 ... > > Chromosome 001807 was updated. > > Jul 16, 2009 4:33:55 PM - FINE: Starting update of chromosome 000024 > > Updating chromosome 000024 ... > > Chromosome 000024 was updated. > > Jul 16, 2009 4:35:27 PM - FINE: Starting update of chromosome 000023 > > Updating chromosome 000023 ... > > --- snap --- > > > > > > The files for this are downloaded from Genbank and the file sizes are: > > NC_001807.gbk ? 58.4 KB > > NC_000024.gbk ? 70.8 MB > > NC_000023.gbk ? 190.1 MB > > > > So, I don't see, why loading a 70.8 MB file took less than 2 minutes and > > a 190.1 MB file isn't completed after 2 hours. But during this time, the > > CPU load is almost 100% and there is no significant network or harddisk > > activity. > > > > When I paused the program (I'm using Eclipse) and looked, where the whole > > processing power is going to, I ended up with the following stacktrace > > (sorry for the unreadable format): > > > > CharacterTokenization.tokenizeSymbolList(SymbolList) line: 214 > > AlphabetManager$WellKnownTokenizationWrapper.tokenizeSymbolList(SymbolLis > >t) line: 1460 > > SimpleSymbolList(AbstractSymbolList).seqString() line: 102 > > BioSQLRichSequenceHandler(DummyRichSequenceHandler).seqString(RichSequenc > >e) line: 115 > > BioSQLRichSequenceHandler.seqString(RichSequence) line: 155 > > SimpleRichSequence(ThinRichSequence).seqString() line: 203 > > SimpleRichSequence.getStringSequence() line: 77 > > GeneratedMethodAccessor132.invoke(Object, Object[]) line: not available > > DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25 > > Method.invoke(Object, Object...) line: 597 > > BasicPropertyAccessor$BasicGetter.get(Object) line: 145 > > PojoEntityTuplizer(AbstractEntityTuplizer).getPropertyValues(Object) > > line: 249 PojoEntityTuplizer.getPropertyValues(Object) line: 244 > > JoinedSubclassEntityPersister(AbstractEntityPersister).getPropertyValues( > >Object, EntityMode) line: 3567 > > DefaultFlushEntityEventListener.getValues(Object, EntityEntry, > > EntityMode, boolean, SessionImplementor) line: 167 > > DefaultFlushEntityEventListener.onFlushEntity(FlushEntityEvent) line: 120 > > DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEntitie > >s(FlushEvent) line: 196 > > DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEveryth > >ingToExecutions(FlushEvent) line: 76 > > DefaultAutoFlushEventListener.onAutoFlush(AutoFlushEvent) line: 35 > > SessionImpl.autoFlushIfRequired(Set) line: 970 > > SessionImpl.list(String, QueryParameters) line: 1115 > > QueryImpl.list() line: 79 > > QueryImpl(AbstractQueryImpl).uniqueResult() line: 811 > > GeneratedMethodAccessor38.invoke(Object, Object[]) line: not available > > DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25 > > Method.invoke(Object, Object...) line: 597 > > BioSQLRichObjectBuilder.buildObject(Class, List) line: 133 > > RichObjectFactory.getObject(Class, Object[]) line: 107 > > GenbankFormat.readRichSequence(BufferedReader, SymbolTokenization, > > RichSeqIOListener, Namespace) line: 450 > > UpdateDB_Main.updateChromosome() line: 542 > > > > > > Now we go to GenbankFormat.readRichSequence(). It hangs at about line > > 450, the line where it loads a CrossRef object, so I added debug output: > > > > --- snip --- > > // parameter on old feature > > if (key.equals("db_xref")) { > > ? ? ? ?Matcher m = dbxp.matcher(val); > > ? ? ? ?if (m.matches()) { > > ? ? ? ? ? ? ? ?String dbname = m.group(1); > > ? ? ? ? ? ? ? ?String raccession = m.group(2); > > ? ? ? ? ? ? ? ?if (dbname.equalsIgnoreCase("taxon")) { > > ? ? ? ? ? ? ? ? ? ? ? ?[...] > > ? ? ? ? ? ? ? ?} else { > > ? ? ? ? ? ? ? ? ? ? ? ?try { > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?long starttime = > > System.currentTimeMillis(); CrossRef cr = > > (CrossRef)RichObjectFactory.getObject(SimpleCrossRef.class,new Object[] > > {dbname, raccession, new Integer(0)}); > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?long duration = System.currentTimeMillis() > > - starttime; if( duration > 100 ) { > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("dbname: " + > > dbname + ", raccession: " + raccession); System.out.println(" ?took " + > > duration + "ms"); } > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?RankedCrossRef rcr = new > > SimpleRankedCrossRef(cr, ++rcrossrefCount); > > rlistener.getCurrentFeature().addRankedCrossRef(rcr); --- snap --- > > > > Which leads to: > > > > --- snip --- > > dbname: GeneID, raccession: 677739 > > ?took 3291ms > > dbname: HGNC, raccession: 31847 > > ?took 2427ms > > dbname: GeneID, raccession: 55344 > > ?took 2932ms > > dbname: HGNC, raccession: 23148 > > ?took 2339ms > > dbname: GI, raccession: 94158612 > > ?took 2418ms > > dbname: GI, raccession: 8922995 > > ?took 2920ms > > [...] > > --- snap --- > > > > Which are all /db_xref properties of the NC_000023.gbk file. Searching > > deeper, it looks like for every CrossRef object loaded, the whole > > BioEntry object is built and the sequence parsed. But remember, this only > > happens on chromosome 23, not on 24, which has /db_xref, too. > > > > I already spent some time on this, but I can't figure out, what could be > > the cause. > > > > > > Thanks > > ? Florian > > _______________________________________________ > > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Dipl. Inf. Florian Mittag Universit?t Tuebingen WSI-RA, Sand 1 72076 Tuebingen, Germany Phone: +49 7071 / 29 78985 Fax: +49 7071 / 29 5091 From ola.spjuth at farmbio.uu.se Sat Jul 18 14:22:03 2009 From: ola.spjuth at farmbio.uu.se (Ola Spjuth) Date: Sat, 18 Jul 2009 16:22:03 +0200 Subject: [Biojava-l] Load Genbank files takes ages In-Reply-To: <200907171403.55631.florian.mittag@uni-tuebingen.de> References: <200907161738.29913.florian.mittag@uni-tuebingen.de> <93b45ca50907161833v1d6fed1fyd0030bf37271889d@mail.gmail.com> <200907171403.55631.florian.mittag@uni-tuebingen.de> Message-ID: On 17 jul 2009, at 14.03, Florian Mittag wrote: > Thanks for the quick answer! > > On Friday 17 July 2009 03:33, Mark Schreiber wrote: >> Have you considered running a profiler? > > Yes, I have considered this, but the profilers I know for Eclipse > are a pain > in the a** and don't work, so I will have to use NetBeans or > something to do > profiling. I just wanted to comment on this since I had the same experience, and just a few days ago I tried YourKit (http://yourkit.com/). It integrated very nicely with Eclipse and worked extremely well. Best is that they offer free licenses for open source projects. Recommended! /Ola From koen.bruynseels at cropdesign.com Sat Jul 18 16:14:23 2009 From: koen.bruynseels at cropdesign.com (koen.bruynseels at cropdesign.com) Date: Sat, 18 Jul 2009 18:14:23 +0200 Subject: [Biojava-l] Koen Bruynseels is out of the office. Message-ID: I will be out of the office starting 07/17/2009 and will not return until 07/23/2009. I will respond to your message when I return. From pzgyuanf at gmail.com Sun Jul 19 03:41:27 2009 From: pzgyuanf at gmail.com (pprun) Date: Sun, 19 Jul 2009 11:41:27 +0800 Subject: [Biojava-l] Can I get the additional information, such as chromosome number from "source" feature from genbank file? Message-ID: Hi, Given a genbank file with feature as this: FEATURES Location/Qualifiers source 1..412 /chromosome="12" /map="12q22-qter" /organism="Homo sapiens" /db_xref="taxon:9606" How can I get the chromosome="12" information? I need it to sort out the sequence by chromosome. Appreciate your help. Pprun From holland at eaglegenomics.com Sun Jul 19 10:16:11 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Sun, 19 Jul 2009 11:16:11 +0100 Subject: [Biojava-l] Can I get the additional information, such as chromosome number from "source" feature from genbank file? In-Reply-To: References: Message-ID: <1247998571.28340.1.camel@buzzybee> It's in the RichAnnotation object associated with the RichFeature inside the parsed RichSequence object (if you're using the BioJavaX GenbankFormat parser). The RichAnnotation is a key/value map - the keys are term objects, which you can find by requesting the term for "chromosome" from the default ontology. You can then search the map for the matching key/value pair. cheers, Richard On Sun, 2009-07-19 at 11:41 +0800, pprun wrote: > Hi, > > Given a genbank file with feature as this: > > FEATURES Location/Qualifiers > source 1..412 > /chromosome="12" > /map="12q22-qter" > /organism="Homo sapiens" > /db_xref="taxon:9606" > > > How can I get the chromosome="12" information? I need it to sort out the > sequence by chromosome. > > Appreciate your help. > Pprun > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From pzgyuanf at gmail.com Sun Jul 19 14:26:05 2009 From: pzgyuanf at gmail.com (pprun) Date: Sun, 19 Jul 2009 22:26:05 +0800 Subject: [Biojava-l] Can I get the additional information, such as chromosome number from "source" feature from genbank file? In-Reply-To: <1247998571.28340.1.camel@buzzybee> References: <1247998571.28340.1.camel@buzzybee> Message-ID: <4A632CFD.70200@gmail.com> Thanks Richard for supplying this detail process. Also nice to know you are still working back-stage of Biojava. :) Pprun Richard Holland wrote: > It's in the RichAnnotation object associated with the RichFeature inside > the parsed RichSequence object (if you're using the BioJavaX > GenbankFormat parser). The RichAnnotation is a key/value map - the keys > are term objects, which you can find by requesting the term for > "chromosome" from the default ontology. You can then search the map for > the matching key/value pair. > > cheers, > Richard > > On Sun, 2009-07-19 at 11:41 +0800, pprun wrote: > >>Hi, >> >>Given a genbank file with feature as this: >> >>FEATURES Location/Qualifiers >> source 1..412 >> /chromosome="12" >> /map="12q22-qter" >> /organism="Homo sapiens" >> /db_xref="taxon:9606" >> >> >>How can I get the chromosome="12" information? I need it to sort out the >>sequence by chromosome. >> >>Appreciate your help. >>Pprun >> >>_______________________________________________ >>Biojava-l mailing list - Biojava-l at lists.open-bio.org >>http://lists.open-bio.org/mailman/listinfo/biojava-l From pzgyuanf at gmail.com Sun Jul 19 14:26:05 2009 From: pzgyuanf at gmail.com (pprun) Date: Sun, 19 Jul 2009 22:26:05 +0800 Subject: [Biojava-l] Can I get the additional information, such as chromosome number from "source" feature from genbank file? In-Reply-To: <1247998571.28340.1.camel@buzzybee> References: <1247998571.28340.1.camel@buzzybee> Message-ID: <4A632CFD.70200@gmail.com> Thanks Richard for supplying this detail process. Also nice to know you are still working back-stage of Biojava. :) Pprun Richard Holland wrote: > It's in the RichAnnotation object associated with the RichFeature inside > the parsed RichSequence object (if you're using the BioJavaX > GenbankFormat parser). The RichAnnotation is a key/value map - the keys > are term objects, which you can find by requesting the term for > "chromosome" from the default ontology. You can then search the map for > the matching key/value pair. > > cheers, > Richard > > On Sun, 2009-07-19 at 11:41 +0800, pprun wrote: > >>Hi, >> >>Given a genbank file with feature as this: >> >>FEATURES Location/Qualifiers >> source 1..412 >> /chromosome="12" >> /map="12q22-qter" >> /organism="Homo sapiens" >> /db_xref="taxon:9606" >> >> >>How can I get the chromosome="12" information? I need it to sort out the >>sequence by chromosome. >> >>Appreciate your help. >>Pprun >> >>_______________________________________________ >>Biojava-l mailing list - Biojava-l at lists.open-bio.org >>http://lists.open-bio.org/mailman/listinfo/biojava-l From andreas at sdsc.edu Tue Jul 21 02:20:35 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 20 Jul 2009 19:20:35 -0700 Subject: [Biojava-l] Pairwise alignment of protein sequences In-Reply-To: References: Message-ID: <59a41c430907201920m3b696aa9yda9c9d3a7f3d4cfa@mail.gmail.com> Hi David, I patched the SequenceAlignent class in svn. It now displays more scores in the produced alignment image. Also you can now request the strings for the aligned sequences from the outside. Alignment score is the return value from the pairwiseAlignment method.... Hope that helps... Andreas On Thu, Jul 9, 2009 at 11:13 AM, David Zhao wrote: > David Zhao gmail.com> writes: > I found the answer from a post here(http://portal.open- > bio.org/pipermail/biojava-l/2007-February.txt#). In short, I need to use > AlphabetManager.alphabetForName("PROTEIN-TERM"); instead of > AlphabetManager.alphabetForName("PROTEIN"); to parse the matrix file. > Another question though, now I have the alignment, > how do I retrieve score and > mismatch, gap information from it? > ?Time (ms): ? ? 62 > ?Length: ? ? ? ?25 > ?Score: ? ? ? ?180 > ?Query: ? ? ? ?query, ?Length: 25 > ?Target: ? ? ? target, Length: 25 > > > Query: ? ?1 KLFVGGIKEDTEEHHLRDYFEEYGK 25 > ? ? ? ? ? ?| ||||||||||||||||||| ||| > Target: ? 1 KIFVGGIKEDTEEHHLRDYFEQYGK 25 > > Thanks! > > David > > > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From andreas at sdsc.edu Tue Jul 21 04:26:40 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 20 Jul 2009 21:26:40 -0700 Subject: [Biojava-l] feature request Sequence Alignment Message-ID: <59a41c430907202126y66716314n5d82b5a47890c9a2@mail.gmail.com> Hi Andreas, I was working with the Sequence Alignment package today. In particular I would be interested to have an alignment display that looks nice in HTML. Playing around with the code it seems the alignment image generation is closely tied to the actual alignment implementation. Do you think it would be possible to change this a bit and provide a way so there could be multiple ways to print out (display) an alignment? Ideally the core of an alignment would be just a data-container (a bean?) and the alignment calculation would operate on this bean. Then after the alignment has been calculated other objects could be used to provide a print out based on the data in the container-bean. Would that make sense and do you think this would be difficult to implement? Thanks, Andreas (the other) From andreas.draeger at uni-tuebingen.de Tue Jul 21 05:35:22 2009 From: andreas.draeger at uni-tuebingen.de (Andreas Draeger) Date: Tue, 21 Jul 2009 07:35:22 +0200 Subject: [Biojava-l] feature request Sequence Alignment In-Reply-To: <59a41c430907202126y66716314n5d82b5a47890c9a2@mail.gmail.com> References: <59a41c430907202126y66716314n5d82b5a47890c9a2@mail.gmail.com> Message-ID: <4A65539A.4010805@uni-tuebingen.de> Hi Andreas, What you suggest makes sense. I put it on my todo list. I think some derivatives of the existing Alignment interface should be used as objects that can be displayed in HTML. I'll play around with it a bit. Cheers Andreas (the other) -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From andreas.draeger at uni-tuebingen.de Tue Jul 21 08:25:27 2009 From: andreas.draeger at uni-tuebingen.de (Andreas Draeger) Date: Tue, 21 Jul 2009 10:25:27 +0200 Subject: [Biojava-l] How to parse pairwise alignment result string In-Reply-To: References: Message-ID: <4A657B77.5030904@uni-tuebingen.de> Hi David! Yes, the NeedlemanWunsch class and also the SmithWaterman class should not produce these Strings but rather dedicated Alignment objects that can be further processed more easily. I think they already do produce some alignment object but this seems to be sub-optimal. I am going to improve this but this will take some time. Cheers Andreas -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From pzgyuanf at gmail.com Wed Jul 22 14:03:04 2009 From: pzgyuanf at gmail.com (pprun) Date: Wed, 22 Jul 2009 22:03:04 +0800 Subject: [Biojava-l] Request For Feature: let SimpleNamespace or Namespace be Serializable Message-ID: Hi, You know all the rich sequence format parsers, such as readGenbankDNA, has a Namespace parameter. Currently it prevents the related code to be used in RMI framework. What do you think about it? Thanks, Pprun From pzgyuanf at gmail.com Wed Jul 22 14:25:59 2009 From: pzgyuanf at gmail.com (pprun) Date: Wed, 22 Jul 2009 22:25:59 +0800 Subject: [Biojava-l] Pairwise alignment of protein sequences In-Reply-To: <59a41c430907201920m3b696aa9yda9c9d3a7f3d4cfa@mail.gmail.com> References: <59a41c430907201920m3b696aa9yda9c9d3a7f3d4cfa@mail.gmail.com> Message-ID: <4A672177.5040505@gmail.com> Andreas, How about also add the 'quality', 'Percent Identity' and 'Percent Similarity' values into these alignment result as the GAP does? Thanks, Pprun Andreas Prlic wrote: > Hi David, > > I patched the SequenceAlignent class in svn. It now displays more > scores in the produced alignment image. Also you can now request the > strings for the aligned sequences from the outside. Alignment score is > the return value from the pairwiseAlignment method.... > > Hope that helps... > Andreas > > > On Thu, Jul 9, 2009 at 11:13 AM, David Zhao wrote: > >>David Zhao gmail.com> writes: >>I found the answer from a post here(http://portal.open- >>bio.org/pipermail/biojava-l/2007-February.txt#). In short, I need to use >>AlphabetManager.alphabetForName("PROTEIN-TERM"); instead of >>AlphabetManager.alphabetForName("PROTEIN"); to parse the matrix file. >>Another question though, now I have the alignment, >>how do I retrieve score and >>mismatch, gap information from it? >> Time (ms): 62 >> Length: 25 >> Score: 180 >> Query: query, Length: 25 >> Target: target, Length: 25 >> >> >>Query: 1 KLFVGGIKEDTEEHHLRDYFEEYGK 25 >> | ||||||||||||||||||| ||| >>Target: 1 KIFVGGIKEDTEEHHLRDYFEQYGK 25 >> >>Thanks! >> >>David >> >> >>_______________________________________________ >>Biojava-l mailing list - Biojava-l at lists.open-bio.org >>http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From pzgyuanf at gmail.com Wed Jul 22 14:25:59 2009 From: pzgyuanf at gmail.com (pprun) Date: Wed, 22 Jul 2009 22:25:59 +0800 Subject: [Biojava-l] Pairwise alignment of protein sequences In-Reply-To: <59a41c430907201920m3b696aa9yda9c9d3a7f3d4cfa@mail.gmail.com> References: <59a41c430907201920m3b696aa9yda9c9d3a7f3d4cfa@mail.gmail.com> Message-ID: <4A672177.5040505@gmail.com> Andreas, How about also add the 'quality', 'Percent Identity' and 'Percent Similarity' values into these alignment result as the GAP does? Thanks, Pprun Andreas Prlic wrote: > Hi David, > > I patched the SequenceAlignent class in svn. It now displays more > scores in the produced alignment image. Also you can now request the > strings for the aligned sequences from the outside. Alignment score is > the return value from the pairwiseAlignment method.... > > Hope that helps... > Andreas > > > On Thu, Jul 9, 2009 at 11:13 AM, David Zhao wrote: > >>David Zhao gmail.com> writes: >>I found the answer from a post here(http://portal.open- >>bio.org/pipermail/biojava-l/2007-February.txt#). In short, I need to use >>AlphabetManager.alphabetForName("PROTEIN-TERM"); instead of >>AlphabetManager.alphabetForName("PROTEIN"); to parse the matrix file. >>Another question though, now I have the alignment, >>how do I retrieve score and >>mismatch, gap information from it? >> Time (ms): 62 >> Length: 25 >> Score: 180 >> Query: query, Length: 25 >> Target: target, Length: 25 >> >> >>Query: 1 KLFVGGIKEDTEEHHLRDYFEEYGK 25 >> | ||||||||||||||||||| ||| >>Target: 1 KIFVGGIKEDTEEHHLRDYFEQYGK 25 >> >>Thanks! >> >>David >> >> >>_______________________________________________ >>Biojava-l mailing list - Biojava-l at lists.open-bio.org >>http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From pzgyuanf at gmail.com Wed Jul 22 14:34:49 2009 From: pzgyuanf at gmail.com (pprun) Date: Wed, 22 Jul 2009 22:34:49 +0800 Subject: [Biojava-l] Compile error: unmappable character for encoding UTF-8 Message-ID: Hi there, It has been a long time(years), I got this compile error when I trying to compile source code: GUITools.java:14: unmappable character for encoding UTF-8 * @author Kalle N?slund StructureException.java:29: unmappable character for encoding UTF-8 * @author Andreas Prlic, Thomas Down, Benjamin Schuster-B?ckler I trust you are not UTF-8 for your develpment environments, I'm also awared that other global opern source projects are adopting a convention to solve this problem: by using the '\uxxxx' escape to US-ASCII characters. Sorry! N?slund and Schuster-B?ckler. Pprun From andreas at sdsc.edu Wed Jul 22 15:16:05 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 22 Jul 2009 08:16:05 -0700 Subject: [Biojava-l] Pairwise alignment of protein sequences In-Reply-To: <4A672177.5040505@gmail.com> References: <59a41c430907201920m3b696aa9yda9c9d3a7f3d4cfa@mail.gmail.com> <4A672177.5040505@gmail.com> Message-ID: <59a41c430907220816y33336d8bx76d4f999101476df@mail.gmail.com> Hi PPrun, not sure about how to calculate quality, but the other scores are there now. Andreas On Wed, Jul 22, 2009 at 7:25 AM, pprun wrote: > Andreas, > > How about also add the 'quality', 'Percent Identity' and 'Percent > Similarity' values into these alignment result as the GAP does? > > Thanks, > Pprun > > > Andreas Prlic wrote: > >> Hi David, >> >> I patched the SequenceAlignent class in svn. It now displays more >> scores in the produced alignment image. Also you can now request the >> strings for the aligned sequences from the outside. Alignment score is >> the return value from the pairwiseAlignment method.... >> >> Hope that helps... >> Andreas >> >> >> On Thu, Jul 9, 2009 at 11:13 AM, David Zhao wrote: >> >>> David Zhao gmail.com> writes: >>> I found the answer from a post here(http://portal.open- >>> bio.org/pipermail/biojava-l/2007-February.txt#). In short, I need to use >>> AlphabetManager.alphabetForName("PROTEIN-TERM"); instead of >>> AlphabetManager.alphabetForName("PROTEIN"); to parse the matrix file. >>> Another question though, now I have the alignment, >>> how do I retrieve score and >>> mismatch, gap information from it? >>> Time (ms): ? ? 62 >>> Length: ? ? ? ?25 >>> Score: ? ? ? ?180 >>> Query: ? ? ? ?query, ?Length: 25 >>> Target: ? ? ? target, Length: 25 >>> >>> >>> Query: ? ?1 KLFVGGIKEDTEEHHLRDYFEEYGK 25 >>> ? ? ? ? ?| ||||||||||||||||||| ||| >>> Target: ? 1 KIFVGGIKEDTEEHHLRDYFEQYGK 25 >>> >>> Thanks! >>> >>> David >>> >>> >>> _______________________________________________ >>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> >> >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > From holland at eaglegenomics.com Wed Jul 22 15:20:43 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 22 Jul 2009 16:20:43 +0100 Subject: [Biojava-l] Request For Feature: let SimpleNamespace or Namespace be Serializable In-Reply-To: References: Message-ID: <1248276043.28124.1.camel@buzzybee> It's there because all sequences have to belong to a namespace (to prevent duplicate identifiers from different sources from clashing). On Wed, 2009-07-22 at 22:03 +0800, pprun wrote: > Hi, > > You know all the rich sequence format parsers, such as readGenbankDNA, > has a Namespace parameter. Currently it prevents the related code to be > used in RMI framework. > > What do you think about it? > > Thanks, > Pprun > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From pzgyuanf at gmail.com Wed Jul 22 15:36:49 2009 From: pzgyuanf at gmail.com (pprun) Date: Wed, 22 Jul 2009 23:36:49 +0800 Subject: [Biojava-l] Pairwise alignment of protein sequences In-Reply-To: <59a41c430907220816y33336d8bx76d4f999101476df@mail.gmail.com> References: <59a41c430907201920m3b696aa9yda9c9d3a7f3d4cfa@mail.gmail.com> <4A672177.5040505@gmail.com> <59a41c430907220816y33336d8bx76d4f999101476df@mail.gmail.com> Message-ID: <4A673211.20503@gmail.com> Great! Just saw the new add-in after using the latest code in Trunk. Cheers Pprun Andreas Prlic wrote: > Hi PPrun, > > not sure about how to calculate quality, but the other scores are there now. > > Andreas > > On Wed, Jul 22, 2009 at 7:25 AM, pprun wrote: > >>Andreas, >> >>How about also add the 'quality', 'Percent Identity' and 'Percent >>Similarity' values into these alignment result as the GAP does? >> >>Thanks, >>Pprun >> >> >>Andreas Prlic wrote: >> >> >>>Hi David, >>> >>>I patched the SequenceAlignent class in svn. It now displays more >>>scores in the produced alignment image. Also you can now request the >>>strings for the aligned sequences from the outside. Alignment score is >>>the return value from the pairwiseAlignment method.... >>> >>>Hope that helps... >>>Andreas >>> >>> >>>On Thu, Jul 9, 2009 at 11:13 AM, David Zhao wrote: >>> >>> >>>>David Zhao gmail.com> writes: >>>>I found the answer from a post here(http://portal.open- >>>>bio.org/pipermail/biojava-l/2007-February.txt#). In short, I need to use >>>>AlphabetManager.alphabetForName("PROTEIN-TERM"); instead of >>>>AlphabetManager.alphabetForName("PROTEIN"); to parse the matrix file. >>>>Another question though, now I have the alignment, >>>>how do I retrieve score and >>>>mismatch, gap information from it? >>>>Time (ms): 62 >>>>Length: 25 >>>>Score: 180 >>>>Query: query, Length: 25 >>>>Target: target, Length: 25 >>>> >>>> >>>>Query: 1 KLFVGGIKEDTEEHHLRDYFEEYGK 25 >>>> | ||||||||||||||||||| ||| >>>>Target: 1 KIFVGGIKEDTEEHHLRDYFEQYGK 25 >>>> >>>>Thanks! >>>> >>>>David >>>> >>>> >>>>_______________________________________________ >>>>Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>> >>> >>>_______________________________________________ >>>Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> >> > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From pzgyuanf at gmail.com Wed Jul 22 15:36:49 2009 From: pzgyuanf at gmail.com (pprun) Date: Wed, 22 Jul 2009 23:36:49 +0800 Subject: [Biojava-l] Pairwise alignment of protein sequences In-Reply-To: <59a41c430907220816y33336d8bx76d4f999101476df@mail.gmail.com> References: <59a41c430907201920m3b696aa9yda9c9d3a7f3d4cfa@mail.gmail.com> <4A672177.5040505@gmail.com> <59a41c430907220816y33336d8bx76d4f999101476df@mail.gmail.com> Message-ID: <4A673211.20503@gmail.com> Great! Just saw the new add-in after using the latest code in Trunk. Cheers Pprun Andreas Prlic wrote: > Hi PPrun, > > not sure about how to calculate quality, but the other scores are there now. > > Andreas > > On Wed, Jul 22, 2009 at 7:25 AM, pprun wrote: > >>Andreas, >> >>How about also add the 'quality', 'Percent Identity' and 'Percent >>Similarity' values into these alignment result as the GAP does? >> >>Thanks, >>Pprun >> >> >>Andreas Prlic wrote: >> >> >>>Hi David, >>> >>>I patched the SequenceAlignent class in svn. It now displays more >>>scores in the produced alignment image. Also you can now request the >>>strings for the aligned sequences from the outside. Alignment score is >>>the return value from the pairwiseAlignment method.... >>> >>>Hope that helps... >>>Andreas >>> >>> >>>On Thu, Jul 9, 2009 at 11:13 AM, David Zhao wrote: >>> >>> >>>>David Zhao gmail.com> writes: >>>>I found the answer from a post here(http://portal.open- >>>>bio.org/pipermail/biojava-l/2007-February.txt#). In short, I need to use >>>>AlphabetManager.alphabetForName("PROTEIN-TERM"); instead of >>>>AlphabetManager.alphabetForName("PROTEIN"); to parse the matrix file. >>>>Another question though, now I have the alignment, >>>>how do I retrieve score and >>>>mismatch, gap information from it? >>>>Time (ms): 62 >>>>Length: 25 >>>>Score: 180 >>>>Query: query, Length: 25 >>>>Target: target, Length: 25 >>>> >>>> >>>>Query: 1 KLFVGGIKEDTEEHHLRDYFEEYGK 25 >>>> | ||||||||||||||||||| ||| >>>>Target: 1 KIFVGGIKEDTEEHHLRDYFEQYGK 25 >>>> >>>>Thanks! >>>> >>>>David >>>> >>>> >>>>_______________________________________________ >>>>Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>> >>> >>>_______________________________________________ >>>Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> >> > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From pzgyuanf at gmail.com Wed Jul 22 15:39:15 2009 From: pzgyuanf at gmail.com (pprun) Date: Wed, 22 Jul 2009 23:39:15 +0800 Subject: [Biojava-l] Request For Feature: let SimpleNamespace or Namespace be Serializable In-Reply-To: <1248276043.28124.1.camel@buzzybee> References: <1248276043.28124.1.camel@buzzybee> Message-ID: <4A6732A3.7040203@gmail.com> I don't want to remove this parameter from the API, I proposed the Namespace implements Serializable interface. Then the code can be used in RMI framework. Pprun Richard Holland wrote: > It's there because all sequences have to belong to a namespace (to > prevent duplicate identifiers from different sources from clashing). > > On Wed, 2009-07-22 at 22:03 +0800, pprun wrote: > >>Hi, >> >>You know all the rich sequence format parsers, such as readGenbankDNA, >>has a Namespace parameter. Currently it prevents the related code to be >>used in RMI framework. >> >>What do you think about it? >> >>Thanks, >>Pprun >> >>_______________________________________________ >>Biojava-l mailing list - Biojava-l at lists.open-bio.org >>http://lists.open-bio.org/mailman/listinfo/biojava-l From pzgyuanf at gmail.com Wed Jul 22 15:39:15 2009 From: pzgyuanf at gmail.com (pprun) Date: Wed, 22 Jul 2009 23:39:15 +0800 Subject: [Biojava-l] Request For Feature: let SimpleNamespace or Namespace be Serializable In-Reply-To: <1248276043.28124.1.camel@buzzybee> References: <1248276043.28124.1.camel@buzzybee> Message-ID: <4A6732A3.7040203@gmail.com> I don't want to remove this parameter from the API, I proposed the Namespace implements Serializable interface. Then the code can be used in RMI framework. Pprun Richard Holland wrote: > It's there because all sequences have to belong to a namespace (to > prevent duplicate identifiers from different sources from clashing). > > On Wed, 2009-07-22 at 22:03 +0800, pprun wrote: > >>Hi, >> >>You know all the rich sequence format parsers, such as readGenbankDNA, >>has a Namespace parameter. Currently it prevents the related code to be >>used in RMI framework. >> >>What do you think about it? >> >>Thanks, >>Pprun >> >>_______________________________________________ >>Biojava-l mailing list - Biojava-l at lists.open-bio.org >>http://lists.open-bio.org/mailman/listinfo/biojava-l From holland at eaglegenomics.com Wed Jul 22 15:53:01 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 22 Jul 2009 15:53:01 +0000 Subject: [Biojava-l] Request For Feature: let SimpleNamespace or Namespace be Serializable In-Reply-To: <4A6732A3.7040203@gmail.com> References: <1248276043.28124.1.camel@buzzybee> <4A6732A3.7040203@gmail.com> Message-ID: <1248277981.28124.18.camel@buzzybee> OK sounds good. On Wed, 2009-07-22 at 23:39 +0800, pprun wrote: > I don't want to remove this parameter from the API, > I proposed the Namespace implements Serializable interface. > Then the code can be used in RMI framework. > > Pprun > > > Richard Holland wrote: > > > It's there because all sequences have to belong to a namespace (to > > prevent duplicate identifiers from different sources from clashing). > > > > On Wed, 2009-07-22 at 22:03 +0800, pprun wrote: > > > >>Hi, > >> > >>You know all the rich sequence format parsers, such as readGenbankDNA, > >>has a Namespace parameter. Currently it prevents the related code to be > >>used in RMI framework. > >> > >>What do you think about it? > >> > >>Thanks, > >>Pprun > >> > >>_______________________________________________ > >>Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From andreas at sdsc.edu Wed Jul 22 17:25:12 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 22 Jul 2009 10:25:12 -0700 Subject: [Biojava-l] Compile error: unmappable character for encoding UTF-8 In-Reply-To: References: Message-ID: <59a41c430907221025vc6163c0mf39e0fb4276a3d2@mail.gmail.com> Hi, I have not encountered this problem before, I suppose this is because my standard encoding is ISO-8859-1. Any suggestions for how to set the default encoding for the files? Andreas On Wed, Jul 22, 2009 at 7:34 AM, pprun wrote: > Hi there, > > It has been a long time(years), I got this compile error when I trying to > compile source code: > > GUITools.java:14: unmappable character for encoding UTF-8 > ?* @author Kalle N?slund > > StructureException.java:29: unmappable character for encoding UTF-8 > ?* @author Andreas Prlic, Thomas Down, Benjamin Schuster-B?ckler > > > I trust you are not UTF-8 for your develpment environments, > I'm also awared that other global opern source projects are adopting a > convention to solve this problem: by using the '\uxxxx' escape to US-ASCII > characters. > > > Sorry! N?slund and Schuster-B?ckler. > > Pprun > > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From holland at eaglegenomics.com Wed Jul 22 17:32:46 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 22 Jul 2009 18:32:46 +0100 Subject: [Biojava-l] Compile error: unmappable character for encoding UTF-8 In-Reply-To: <59a41c430907221025vc6163c0mf39e0fb4276a3d2@mail.gmail.com> References: <59a41c430907221025vc6163c0mf39e0fb4276a3d2@mail.gmail.com> Message-ID: <1248283966.28124.41.camel@buzzybee> It's usually an operating system thing. The encoding used relates entirely to the way the user has chosen to save/read the file on disk after they've transferred it from our repository (which is encoded correctly). In this case, I expect the user's OS default is UTF-8 and unless they specify otherwise, all files get saved in that encoding. Having said that, the \uxxxx suggestion is not a bad idea. On Wed, 2009-07-22 at 10:25 -0700, Andreas Prlic wrote: > Hi, > > I have not encountered this problem before, I suppose this is because > my standard encoding is ISO-8859-1. Any suggestions for how to set the > default encoding for the files? > > Andreas > > > On Wed, Jul 22, 2009 at 7:34 AM, pprun wrote: > > Hi there, > > > > It has been a long time(years), I got this compile error when I trying to > > compile source code: > > > > GUITools.java:14: unmappable character for encoding UTF-8 > > * @author Kalle N?slund > > > > StructureException.java:29: unmappable character for encoding UTF-8 > > * @author Andreas Prlic, Thomas Down, Benjamin Schuster-B?ckler > > > > > > I trust you are not UTF-8 for your develpment environments, > > I'm also awared that other global opern source projects are adopting a > > convention to solve this problem: by using the '\uxxxx' escape to US-ASCII > > characters. > > > > > > Sorry! N?slund and Schuster-B?ckler. > > > > Pprun > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From andreas.draeger at uni-tuebingen.de Wed Jul 22 21:25:25 2009 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Wed, 22 Jul 2009 23:25:25 +0200 Subject: [Biojava-l] Pairwise alignment of protein sequences In-Reply-To: <59a41c430907220816y33336d8bx76d4f999101476df@mail.gmail.com> References: <59a41c430907201920m3b696aa9yda9c9d3a7f3d4cfa@mail.gmail.com> <4A672177.5040505@gmail.com> <59a41c430907220816y33336d8bx76d4f999101476df@mail.gmail.com> Message-ID: <4A6783C5.7040409@uni-tuebingen.de> Hi guys, > not sure about how to calculate quality, but the other scores are there now. > >> How about also add the 'quality', 'Percent Identity' and 'Percent >> Similarity' values into these alignment result as the GAP does >> Yes, calculating "quality" or "similarity" is a bit unclear. The score is actually the measurement for similarity and therefore also kind of quality. What can be done is to add a feature for the percent identity. However, we have to distinguish between two different things: An alphabet may contain matching symbols, i.e., symbols that are considered equivalent, and identical symbols. This fact makes it a bit harder to calculate the identity because there is the question if we should consider matching symbols identical. Cheers, Andreas -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From andreas.draeger at uni-tuebingen.de Wed Jul 22 21:30:12 2009 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Wed, 22 Jul 2009 23:30:12 +0200 Subject: [Biojava-l] Compile error: unmappable character for encoding UTF-8 In-Reply-To: <1248283966.28124.41.camel@buzzybee> References: <59a41c430907221025vc6163c0mf39e0fb4276a3d2@mail.gmail.com> <1248283966.28124.41.camel@buzzybee> Message-ID: <4A6784E4.3010105@uni-tuebingen.de> Richard Holland schrieb: > It's usually an operating system thing. The encoding used relates > entirely to the way the user has chosen to save/read the file on disk > after they've transferred it from our repository (which is encoded > correctly). In this case, I expect the user's OS default is UTF-8 and > unless they specify otherwise, all files get saved in that encoding. > > Having said that, the \uxxxx suggestion is not a bad idea. > > >>> StructureException.java:29: unmappable character for encoding UTF-8 >>> * @author Andreas Prlic, Thomas Down, Benjamin Schuster-B?ckler >>> Hi guys, Actually, in the comment fields appropriate HTML codes should be applied to encode special characters. In this case I guess it should read "Benjamin Schuster-Bückler" to obtain the German u-umlaut. To avoid such problems, people should not simply insert any special character but look it up in designated HTML code tables and use the correct code. Cheers, Andreas -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From jp at javaclass.co.uk Thu Jul 23 11:15:47 2009 From: jp at javaclass.co.uk (JP) Date: Thu, 23 Jul 2009 12:15:47 +0100 Subject: [Biojava-l] Ontology OBO is_a TERMs Message-ID: <4adc29060907230415t555baf8ala9ca20286c8e1b1e@mail.gmail.com> Hi there at Biojava, I have an ontology file (from www.geneontology.org, gene_ontology.1_2.obo). A typical entry for a term is: [Term] id: GO:0000025 name: maltose catabolic process namespace: biological_process def: "The chemical reactions and pathways resulting in the breakdown of the disaccharide maltose (4-O-alpha-D-glucopyranosyl-D-glucopyranose)." [GOC:jl, ISBN:0198506732 "Oxford Dictionary of Biochemistry and Molecular Biology"] subset: gosubset_prok synonym: "malt sugar catabolic process" EXACT [] synonym: "malt sugar catabolism" EXACT [] synonym: "maltose breakdown" EXACT [] synonym: "maltose degradation" EXACT [] synonym: "maltose hydrolysis" NARROW [] xref: MetaCyc:MALTOSECAT-PWY is_a: GO:0000023 ! maltose metabolic process is_a: GO:0046352 ! disaccharide catabolic process I am reading this with the code suggested in: http://biojava.open-bio.org/wiki/BioJava:CookBook:OBO:parse I would like to get the is_a entries (as Term) - is this possible ? I tried to find this everywhere (annotations?) but find it (google searches included). Many Thanks JP From peter.midford at gmail.com Thu Jul 23 15:19:10 2009 From: peter.midford at gmail.com (Peter Midford) Date: Thu, 23 Jul 2009 11:19:10 -0400 Subject: [Biojava-l] Ontology OBO is_a TERMs In-Reply-To: <4adc29060907230415t555baf8ala9ca20286c8e1b1e@mail.gmail.com> References: <4adc29060907230415t555baf8ala9ca20286c8e1b1e@mail.gmail.com> Message-ID: JP, Looking at the code for OboFileHandler.java, fresh from svn, it looks like they're presently being dropped on the floor. Perhaps someone should either implement obo restrictions or links or build some triples here, as they seem to be used for the rest of the ontology code. } else if (key.equals(IS_A) || key.equals(RELATIONSHIP) || key.equals(DISJOINT_FROM) || key.equals(INTERSECTION_OF) || key.equals(SUBSET)) { //TODO: deal with relationships } else if (key.equals(COMMENT)){ Peter On Jul 23, 2009, at 7:15, JP wrote: > Hi there at Biojava, > > I have an ontology file (from www.geneontology.org, gene_ontology. > 1_2.obo). > A typical entry for a term is: > > [Term] > id: GO:0000025 > name: maltose catabolic process > namespace: biological_process > def: "The chemical reactions and pathways resulting in the breakdown > of the > disaccharide maltose (4-O-alpha-D-glucopyranosyl-D- > glucopyranose)." [GOC:jl, > ISBN:0198506732 "Oxford Dictionary of Biochemistry and Molecular > Biology"] > subset: gosubset_prok > synonym: "malt sugar catabolic process" EXACT [] > synonym: "malt sugar catabolism" EXACT [] > synonym: "maltose breakdown" EXACT [] > synonym: "maltose degradation" EXACT [] > synonym: "maltose hydrolysis" NARROW [] > xref: MetaCyc:MALTOSECAT-PWY > is_a: GO:0000023 ! maltose metabolic process > is_a: GO:0046352 ! disaccharide catabolic process > > I am reading this with the code suggested in: > http://biojava.open-bio.org/wiki/BioJava:CookBook:OBO:parse > I would like to get the is_a entries (as Term) - is this possible ? > I tried > to find this everywhere (annotations?) but find it (google searches > included). > > Many Thanks > JP > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l Peter E. Midford Mesquite Developer Peter.Midford at gmail.com From jp at javaclass.co.uk Thu Jul 23 15:29:02 2009 From: jp at javaclass.co.uk (JP) Date: Thu, 23 Jul 2009 16:29:02 +0100 Subject: [Biojava-l] Ontology OBO is_a TERMs In-Reply-To: References: <4adc29060907230415t555baf8ala9ca20286c8e1b1e@mail.gmail.com> Message-ID: <4adc29060907230829t60ec715h31b7b8b22c041e94@mail.gmail.com> I never quite got this Peter, what is a triple ? Could these simply be considered as annotations ? Or are you thinking in the lines of building hierarchies out of these (I take it this is the most common task). These relationships are *fundamental* for any work of ontology. 2009/7/23 Peter Midford > JP, Looking at the code for OboFileHandler.java, fresh from svn, it > looks like they're presently being dropped on the floor. Perhaps someone > should either implement obo restrictions or links or build some triples > here, as they seem to be used for the rest of the ontology code. > > > > } else if (key.equals(IS_A) || > key.equals(RELATIONSHIP) || > key.equals(DISJOINT_FROM) || > key.equals(INTERSECTION_OF) || > key.equals(SUBSET)) { > //TODO: deal with relationships > > > > } else if (key.equals(COMMENT)){ > > > Peter > > On Jul 23, 2009, at 7:15, JP wrote: > > Hi there at Biojava, > > I have an ontology file (from www.geneontology.org, > gene_ontology.1_2.obo). > A typical entry for a term is: > > [Term] > id: GO:0000025 > name: maltose catabolic process > namespace: biological_process > def: "The chemical reactions and pathways resulting in the breakdown of the > disaccharide maltose (4-O-alpha-D-glucopyranosyl-D-glucopyranose)." > [GOC:jl, > ISBN:0198506732 "Oxford Dictionary of Biochemistry and Molecular Biology"] > subset: gosubset_prok > synonym: "malt sugar catabolic process" EXACT [] > synonym: "malt sugar catabolism" EXACT [] > synonym: "maltose breakdown" EXACT [] > synonym: "maltose degradation" EXACT [] > synonym: "maltose hydrolysis" NARROW [] > xref: MetaCyc:MALTOSECAT-PWY > is_a: GO:0000023 ! maltose metabolic process > is_a: GO:0046352 ! disaccharide catabolic process > > I am reading this with the code suggested in: > http://biojava.open-bio.org/wiki/BioJava:CookBook:OBO:parse > I would like to get the is_a entries (as Term) - is this possible ? I tried > to find this everywhere (annotations?) but find it (google searches > included). > > Many Thanks > JP > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > Peter E. Midford > Mesquite Developer > Peter.Midford at gmail.com > > > > > From peter.midford at gmail.com Thu Jul 23 15:39:56 2009 From: peter.midford at gmail.com (Peter Midford) Date: Thu, 23 Jul 2009 11:39:56 -0400 Subject: [Biojava-l] Ontology OBO is_a TERMs In-Reply-To: <4adc29060907230829t60ec715h31b7b8b22c041e94@mail.gmail.com> References: <4adc29060907230415t555baf8ala9ca20286c8e1b1e@mail.gmail.com> <4adc29060907230829t60ec715h31b7b8b22c041e94@mail.gmail.com> Message-ID: <0903E83C-81BA-427C-9176-0554E9642094@gmail.com> JP, No, these are not just annotations to terms, and the code jumbles together several things that should be separated. To properly handle the IS_A key, you will have to build the hierarchy, which you can do OBO style using links or restrictions (which I believe are a subclass of links in OBO, rather than adding an intermediate class to the ontology) or OWL style using triples (Subject, Predicate, Object) where is_a would be your predicate. I assume the other key values that look like set operations are for building restrictions. Peter On Jul 23, 2009, at 11:29, JP wrote: > I never quite got this Peter, what is a triple ? > Could these simply be considered as annotations ? Or are you > thinking in the lines of building hierarchies out of these (I take > it this is the most common task). > > These relationships are *fundamental* for any work of ontology. > > 2009/7/23 Peter Midford > JP, > Looking at the code for OboFileHandler.java, fresh from svn, > it looks like they're presently being dropped on the floor. Perhaps > someone should either implement obo restrictions or links or build > some triples here, as they seem to be used for the rest of the > ontology code. > > > > } else if (key.equals(IS_A) || > key.equals(RELATIONSHIP) || > key.equals(DISJOINT_FROM) || > key.equals(INTERSECTION_OF) || > key.equals(SUBSET)) { > //TODO: deal with relationships > > > } else if (key.equals(COMMENT)){ > > > Peter > > On Jul 23, 2009, at 7:15, JP wrote: > >> Hi there at Biojava, >> >> I have an ontology file (from www.geneontology.org, gene_ontology. >> 1_2.obo). >> A typical entry for a term is: >> >> [Term] >> id: GO:0000025 >> name: maltose catabolic process >> namespace: biological_process >> def: "The chemical reactions and pathways resulting in the >> breakdown of the >> disaccharide maltose (4-O-alpha-D-glucopyranosyl-D- >> glucopyranose)." [GOC:jl, >> ISBN:0198506732 "Oxford Dictionary of Biochemistry and Molecular >> Biology"] >> subset: gosubset_prok >> synonym: "malt sugar catabolic process" EXACT [] >> synonym: "malt sugar catabolism" EXACT [] >> synonym: "maltose breakdown" EXACT [] >> synonym: "maltose degradation" EXACT [] >> synonym: "maltose hydrolysis" NARROW [] >> xref: MetaCyc:MALTOSECAT-PWY >> is_a: GO:0000023 ! maltose metabolic process >> is_a: GO:0046352 ! disaccharide catabolic process >> >> I am reading this with the code suggested in: >> http://biojava.open-bio.org/wiki/BioJava:CookBook:OBO:parse >> I would like to get the is_a entries (as Term) - is this possible ? >> I tried >> to find this everywhere (annotations?) but find it (google searches >> included). >> >> Many Thanks >> JP >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > Peter E. Midford > Mesquite Developer > Peter.Midford at gmail.com > > > > > Peter E. Midford Mesquite Developer Peter.Midford at gmail.com From andreas at sdsc.edu Thu Jul 23 17:37:27 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 23 Jul 2009 10:37:27 -0700 Subject: [Biojava-l] Ontology OBO is_a TERMs In-Reply-To: <0903E83C-81BA-427C-9176-0554E9642094@gmail.com> References: <4adc29060907230415t555baf8ala9ca20286c8e1b1e@mail.gmail.com> <4adc29060907230829t60ec715h31b7b8b22c041e94@mail.gmail.com> <0903E83C-81BA-427C-9176-0554E9642094@gmail.com> Message-ID: <59a41c430907231037n26d3f760v1642c69ba660a987@mail.gmail.com> I am the one who added the OboFileHandler (based on some original code from obo-edit). I was not sure how best to build up the datastructure representing the relationships in a memory efficient way at that time, so I left it out . Does anybody already have a solution from another project for that, that we could use here? Both links and triples sound like reasonable approaches to me. I think the original ideas in the Ontology framework were to support triples. I can have a look how difficult it would be to build up the hierarchy using that. (after I added some feature requests for the structure modules...) Andreas On Thu, Jul 23, 2009 at 8:39 AM, Peter Midford wrote: > JP, > ? ? ?No, these are not just annotations to terms, and the code jumbles > together several things that should be separated. ?To properly handle the > IS_A key, you will have to build the hierarchy, which you can do OBO style > using links or restrictions (which I believe are a subclass of links in OBO, > rather than adding an intermediate class to the ontology) or OWL style using > triples (Subject, Predicate, Object) where is_a would be your predicate. ?I > assume the other key values that look like set operations are for building > restrictions. > > Peter > > > On Jul 23, 2009, at 11:29, JP wrote: > >> I never quite got this Peter, what is a triple ? >> Could these simply be considered as annotations ? ?Or are you thinking in >> the lines of building hierarchies out of these (I take it this is the most >> common task). >> >> These relationships are *fundamental* for any work of ontology. >> >> 2009/7/23 Peter Midford >> JP, >> ? ? ? Looking at the code for OboFileHandler.java, fresh from svn, it >> looks like they're presently being dropped on the floor. ?Perhaps someone >> should either implement obo restrictions or links or build some triples >> here, as they seem to be used for the rest of the ontology code. >> >> >> >> ? ? ? ? ? ? ? ? ? ? ? ?} else if (key.equals(IS_A) || >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?key.equals(RELATIONSHIP) || >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?key.equals(DISJOINT_FROM) || >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?key.equals(INTERSECTION_OF) || >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?key.equals(SUBSET)) { >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?//TODO: deal with relationships >> >> >> ? ? ? ? ? ? ? ? ? ? ? ?} else if (key.equals(COMMENT)){ >> >> >> Peter >> >> On Jul 23, 2009, at 7:15, JP wrote: >> >>> Hi there at Biojava, >>> >>> I have an ontology file (from www.geneontology.org, >>> gene_ontology.1_2.obo). >>> A typical entry for a term is: >>> >>> [Term] >>> id: GO:0000025 >>> name: maltose catabolic process >>> namespace: biological_process >>> def: "The chemical reactions and pathways resulting in the breakdown of >>> the >>> disaccharide maltose (4-O-alpha-D-glucopyranosyl-D-glucopyranose)." >>> [GOC:jl, >>> ISBN:0198506732 "Oxford Dictionary of Biochemistry and Molecular >>> Biology"] >>> subset: gosubset_prok >>> synonym: "malt sugar catabolic process" EXACT [] >>> synonym: "malt sugar catabolism" EXACT [] >>> synonym: "maltose breakdown" EXACT [] >>> synonym: "maltose degradation" EXACT [] >>> synonym: "maltose hydrolysis" NARROW [] >>> xref: MetaCyc:MALTOSECAT-PWY >>> is_a: GO:0000023 ! maltose metabolic process >>> is_a: GO:0046352 ! disaccharide catabolic process >>> >>> I am reading this with the code suggested in: >>> http://biojava.open-bio.org/wiki/BioJava:CookBook:OBO:parse >>> I would like to get the is_a entries (as Term) - is this possible ? I >>> tried >>> to find this everywhere (annotations?) but find it (google searches >>> included). >>> >>> Many Thanks >>> JP >>> _______________________________________________ >>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> Peter E. Midford >> Mesquite Developer >> Peter.Midford at gmail.com >> >> >> >> >> > > Peter E. Midford > Mesquite Developer > Peter.Midford at gmail.com > > > > > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From florian.mittag at uni-tuebingen.de Fri Jul 24 09:58:30 2009 From: florian.mittag at uni-tuebingen.de (Florian Mittag) Date: Fri, 24 Jul 2009 11:58:30 +0200 Subject: [Biojava-l] Ontology OBO is_a TERMs In-Reply-To: <0903E83C-81BA-427C-9176-0554E9642094@gmail.com> References: <4adc29060907230415t555baf8ala9ca20286c8e1b1e@mail.gmail.com> <4adc29060907230829t60ec715h31b7b8b22c041e94@mail.gmail.com> <0903E83C-81BA-427C-9176-0554E9642094@gmail.com> Message-ID: <200907241158.30719.florian.mittag@uni-tuebingen.de> On Thursday, 23. July 2009 17:39, Peter Midford wrote: > [...] To properly handle the IS_A key, you will have to build the hierarchy, > which you can do OBO style using links or restrictions (which I believe are > a subclass of links in OBO, rather than adding an intermediate class to > the ontology) or OWL style using triples (Subject, Predicate, Object) > where is_a would be your predicate. [...] I think you mean RDF style triples, no need to make it more complicated than necessary ;-) Although there are some restriction you need OWL for expressing them. Regards, Florian > On Jul 23, 2009, at 11:29, JP wrote: > > I never quite got this Peter, what is a triple ? > > Could these simply be considered as annotations ? Or are you > > thinking in the lines of building hierarchies out of these (I take > > it this is the most common task). > > > > These relationships are *fundamental* for any work of ontology. > > > > 2009/7/23 Peter Midford > > JP, > > Looking at the code for OboFileHandler.java, fresh from svn, > > it looks like they're presently being dropped on the floor. Perhaps > > someone should either implement obo restrictions or links or build > > some triples here, as they seem to be used for the rest of the > > ontology code. > > > > > > > > } else if (key.equals(IS_A) || > > key.equals(RELATIONSHIP) || > > key.equals(DISJOINT_FROM) || > > key.equals(INTERSECTION_OF) || > > key.equals(SUBSET)) { > > //TODO: deal with relationships > > > > > > } else if (key.equals(COMMENT)){ > > > > > > Peter > > > > On Jul 23, 2009, at 7:15, JP wrote: > >> Hi there at Biojava, > >> > >> I have an ontology file (from www.geneontology.org, gene_ontology. > >> 1_2.obo). > >> A typical entry for a term is: > >> > >> [Term] > >> id: GO:0000025 > >> name: maltose catabolic process > >> namespace: biological_process > >> def: "The chemical reactions and pathways resulting in the > >> breakdown of the > >> disaccharide maltose (4-O-alpha-D-glucopyranosyl-D- > >> glucopyranose)." [GOC:jl, > >> ISBN:0198506732 "Oxford Dictionary of Biochemistry and Molecular > >> Biology"] > >> subset: gosubset_prok > >> synonym: "malt sugar catabolic process" EXACT [] > >> synonym: "malt sugar catabolism" EXACT [] > >> synonym: "maltose breakdown" EXACT [] > >> synonym: "maltose degradation" EXACT [] > >> synonym: "maltose hydrolysis" NARROW [] > >> xref: MetaCyc:MALTOSECAT-PWY > >> is_a: GO:0000023 ! maltose metabolic process > >> is_a: GO:0046352 ! disaccharide catabolic process > >> > >> I am reading this with the code suggested in: > >> http://biojava.open-bio.org/wiki/BioJava:CookBook:OBO:parse > >> I would like to get the is_a entries (as Term) - is this possible ? > >> I tried > >> to find this everywhere (annotations?) but find it (google searches > >> included). > >> > >> Many Thanks > >> JP > >> _______________________________________________ > >> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > Peter E. Midford > > Mesquite Developer > > Peter.Midford at gmail.com > > Peter E. Midford > Mesquite Developer > Peter.Midford at gmail.com > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Dipl. Inf. Florian Mittag Universit?t Tuebingen WSI-RA, Sand 1 72076 Tuebingen, Germany Phone: +49 7071 / 29 78985 Fax: +49 7071 / 29 5091 From koen.bruynseels at cropdesign.com Fri Jul 24 16:49:34 2009 From: koen.bruynseels at cropdesign.com (koen.bruynseels at cropdesign.com) Date: Fri, 24 Jul 2009 18:49:34 +0200 Subject: [Biojava-l] Koen Bruynseels is out of the office. Message-ID: I will be out of the office starting 07/24/2009 and will not return until 08/09/2009. I will respond to your message when I return. From florian.mittag at uni-tuebingen.de Fri Jul 24 17:05:22 2009 From: florian.mittag at uni-tuebingen.de (Florian Mittag) Date: Fri, 24 Jul 2009 19:05:22 +0200 Subject: [Biojava-l] Load Genbank files takes ages In-Reply-To: <200907171403.55631.florian.mittag@uni-tuebingen.de> References: <200907161738.29913.florian.mittag@uni-tuebingen.de> <93b45ca50907161833v1d6fed1fyd0030bf37271889d@mail.gmail.com> <200907171403.55631.florian.mittag@uni-tuebingen.de> Message-ID: <200907241905.22390.florian.mittag@uni-tuebingen.de> Hi all, this topic gets a little bit complicated, so I will try to summarize the status quo: If I parse the .gbk files without storing the resulting RichSequence objects into the BioSQL database, the program will crash with an OutOfMemory exception at chromosome 23, but this will happen fast, so no problem with the parsing itself. If I parse the .gbk files and store them in the DB, the program will enter an almost-infinite loop, where it rebuilds objects over and over again (without reading any files). Since this process is likely to consume more memory than without storing it in the DB, I expect it to crash after a long time with the same OutOfMemory exception. The profiler didn't reveal anything new or helpful. It confirmed my observation that the method to construct sequence objects is called over and over again, but it didn't reveal why. The memory profiling showed nothing either, but since it only occurres when two other chromosomes were parsed and stored in the DB before that, I assume it is a problem with Hibernate and its caching behavior. Because of the memory problems, I'll postpone the investigation of the almost-infinite loop until I have resolved the memory problem (for which I will open a new thread). Unless anybody has another idea ;-) Florian On Friday, 17. July 2009 14:03, Florian Mittag wrote: > On Friday 17 July 2009 03:33, Mark Schreiber wrote: > > Have you considered running a profiler? > > Yes, I have considered this, but the profilers I know for Eclipse are a > pain in the a** and don't work, so I will have to use NetBeans or something > to do profiling. > > I noticed another funny thing: > When I run let our program skip the first two chromosomes (1807 and 24), > then this is the output: > > Jul 17, 2009 1:50:36 PM - FINE: Starting update of chromosome 000023 > dbname: GeneID, raccession: 100132775 > took 273ms > dbname: CCDS, raccession: CCDS35344.1 > took 452ms > dbname: GeneID, raccession: 644403 > took 283ms > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOfRange(Arrays.java:3209) > at java.lang.String.(String.java:216) > at java.lang.StringBuffer.toString(StringBuffer.java:585) > at > org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:5 >26) at > org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Main. >java:542) at > org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java:47 >3) at > org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:169) > > Harddisk activity is pretty high (probably reading the sequence) and the > OutOfMemoryError occurs after about 2 minutes. It seems like loading the > other chromosomes before this one somehow changes the behavior. > > > Also, are you able to parse that sequence when you don't put it into > > BioSQL. It could be the parser not the BioSQL binding. > > > > - Mark > > That's a good idea, I will try this. I don't know if I will have time for > this today, but I should be able to give an update next week. > > > Florian > > > On Thu, Jul 16, 2009 at 11:38 PM, Florian Mittag > > > > wrote: > > > Hi all! > > > > > > We try to load Genbank files into our bioseqdb database using BioJava. > > > I copy-pasted the code together from tutorials and previous posts on > > > this mailinglist. My problems: > > > > > > 1) It eats huge amounts of memory, so that I needed to increase the > > > heap size to 2GB. > > > > > > 2) Loading the first two files works great, but the third one ran for > > > one two hours without completion. Here is my code: > > > > > > --- snip --- > > > // loop over all downloaded *.gbk files starting with the highest > > > number System.out.println("Updating chromosome " + chrNo[j] + " ..."); > > > > > > BufferedReader fileIn = new BufferedReader(new FileReader(localFile)); > > > > > > tx = session.beginTransaction(); > > > GenbankFormat gf = new GenbankFormat(); > > > SimpleRichSequenceBuilder listener = new SimpleRichSequenceBuilder(); > > > RichSequence seq = null; > > > > > > gf.readRichSequence(fileIn, dnaTokenization, listener, nsGenbank); > > > seq = listener.makeRichSequence(); > > > > > > if( seq != null ) { > > > ? ? ? ?// check, if a sequence with this identifier is already in the > > > DB Query q = session.createQuery( > > > ? ? ? ? ? ? ? ?"select be from BioEntry as be where > > > identifier=:identifier"); > > > q.setString("identifier",seq.getIdentifier()); List entries = q.list(); > > > ? ? ? ?for( Object o : entries ) { > > > ? ? ? ? ? ? ? ?// delete the old sequence in the DB > > > ? ? ? ? ? ? ? ?BioEntry oldSeq = (BioEntry)o; > > > ? ? ? ? ? ? ? ?session.delete("BioEntry", oldSeq); > > > ? ? ? ?} > > > ? ? ? ?tx.commit(); > > > > > > ? ? ? ?tx = session.beginTransaction(); > > > ? ? ? ?session.save("Sequence", seq); > > > > > > ? ? ? ?System.out.println("Chromosome " + chrNo[j] + " was > > > updated.\n"); } else { > > > ? ? ? ?System.out.println("Chromosome " + chrNo[j] + " was NOT > > > updated.\n"); } > > > > > > tx.commit(); > > > --- snap --- > > > > > > > > > This is the generated output: > > > ---snip --- > > > Jul 16, 2009 4:33:53 PM - FINE: Starting update of chromosome 001807 > > > Updating chromosome 001807 ... > > > Chromosome 001807 was updated. > > > Jul 16, 2009 4:33:55 PM - FINE: Starting update of chromosome 000024 > > > Updating chromosome 000024 ... > > > Chromosome 000024 was updated. > > > Jul 16, 2009 4:35:27 PM - FINE: Starting update of chromosome 000023 > > > Updating chromosome 000023 ... > > > --- snap --- > > > > > > > > > The files for this are downloaded from Genbank and the file sizes are: > > > NC_001807.gbk ? 58.4 KB > > > NC_000024.gbk ? 70.8 MB > > > NC_000023.gbk ? 190.1 MB > > > > > > So, I don't see, why loading a 70.8 MB file took less than 2 minutes > > > and a 190.1 MB file isn't completed after 2 hours. But during this > > > time, the CPU load is almost 100% and there is no significant network > > > or harddisk activity. > > > > > > When I paused the program (I'm using Eclipse) and looked, where the > > > whole processing power is going to, I ended up with the following > > > stacktrace (sorry for the unreadable format): > > > > > > CharacterTokenization.tokenizeSymbolList(SymbolList) line: 214 > > > AlphabetManager$WellKnownTokenizationWrapper.tokenizeSymbolList(SymbolL > > >is t) line: 1460 > > > SimpleSymbolList(AbstractSymbolList).seqString() line: 102 > > > BioSQLRichSequenceHandler(DummyRichSequenceHandler).seqString(RichSeque > > >nc e) line: 115 > > > BioSQLRichSequenceHandler.seqString(RichSequence) line: 155 > > > SimpleRichSequence(ThinRichSequence).seqString() line: 203 > > > SimpleRichSequence.getStringSequence() line: 77 > > > GeneratedMethodAccessor132.invoke(Object, Object[]) line: not available > > > DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25 > > > Method.invoke(Object, Object...) line: 597 > > > BasicPropertyAccessor$BasicGetter.get(Object) line: 145 > > > PojoEntityTuplizer(AbstractEntityTuplizer).getPropertyValues(Object) > > > line: 249 PojoEntityTuplizer.getPropertyValues(Object) line: 244 > > > JoinedSubclassEntityPersister(AbstractEntityPersister).getPropertyValue > > >s( Object, EntityMode) line: 3567 > > > DefaultFlushEntityEventListener.getValues(Object, EntityEntry, > > > EntityMode, boolean, SessionImplementor) line: 167 > > > DefaultFlushEntityEventListener.onFlushEntity(FlushEntityEvent) line: > > > 120 > > > DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEntit > > >ie s(FlushEvent) line: 196 > > > DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEvery > > >th ingToExecutions(FlushEvent) line: 76 > > > DefaultAutoFlushEventListener.onAutoFlush(AutoFlushEvent) line: 35 > > > SessionImpl.autoFlushIfRequired(Set) line: 970 > > > SessionImpl.list(String, QueryParameters) line: 1115 > > > QueryImpl.list() line: 79 > > > QueryImpl(AbstractQueryImpl).uniqueResult() line: 811 > > > GeneratedMethodAccessor38.invoke(Object, Object[]) line: not available > > > DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25 > > > Method.invoke(Object, Object...) line: 597 > > > BioSQLRichObjectBuilder.buildObject(Class, List) line: 133 > > > RichObjectFactory.getObject(Class, Object[]) line: 107 > > > GenbankFormat.readRichSequence(BufferedReader, SymbolTokenization, > > > RichSeqIOListener, Namespace) line: 450 > > > UpdateDB_Main.updateChromosome() line: 542 > > > > > > > > > Now we go to GenbankFormat.readRichSequence(). It hangs at about line > > > 450, the line where it loads a CrossRef object, so I added debug > > > output: > > > > > > --- snip --- > > > // parameter on old feature > > > if (key.equals("db_xref")) { > > > ? ? ? ?Matcher m = dbxp.matcher(val); > > > ? ? ? ?if (m.matches()) { > > > ? ? ? ? ? ? ? ?String dbname = m.group(1); > > > ? ? ? ? ? ? ? ?String raccession = m.group(2); > > > ? ? ? ? ? ? ? ?if (dbname.equalsIgnoreCase("taxon")) { > > > ? ? ? ? ? ? ? ? ? ? ? ?[...] > > > ? ? ? ? ? ? ? ?} else { > > > ? ? ? ? ? ? ? ? ? ? ? ?try { > > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?long starttime = > > > System.currentTimeMillis(); CrossRef cr = > > > (CrossRef)RichObjectFactory.getObject(SimpleCrossRef.class,new Object[] > > > {dbname, raccession, new Integer(0)}); > > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?long duration = > > > System.currentTimeMillis() - starttime; if( duration > 100 ) { > > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("dbname: " + > > > dbname + ", raccession: " + raccession); System.out.println(" ?took " + > > > duration + "ms"); } > > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?RankedCrossRef rcr = new > > > SimpleRankedCrossRef(cr, ++rcrossrefCount); > > > rlistener.getCurrentFeature().addRankedCrossRef(rcr); --- snap --- > > > > > > Which leads to: > > > > > > --- snip --- > > > dbname: GeneID, raccession: 677739 > > > ?took 3291ms > > > dbname: HGNC, raccession: 31847 > > > ?took 2427ms > > > dbname: GeneID, raccession: 55344 > > > ?took 2932ms > > > dbname: HGNC, raccession: 23148 > > > ?took 2339ms > > > dbname: GI, raccession: 94158612 > > > ?took 2418ms > > > dbname: GI, raccession: 8922995 > > > ?took 2920ms > > > [...] > > > --- snap --- > > > > > > Which are all /db_xref properties of the NC_000023.gbk file. Searching > > > deeper, it looks like for every CrossRef object loaded, the whole > > > BioEntry object is built and the sequence parsed. But remember, this > > > only happens on chromosome 23, not on 24, which has /db_xref, too. > > > > > > I already spent some time on this, but I can't figure out, what could > > > be the cause. > > > > > > > > > Thanks > > > ? Florian > > > _______________________________________________ > > > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-l From florian.mittag at uni-tuebingen.de Fri Jul 24 17:29:08 2009 From: florian.mittag at uni-tuebingen.de (Florian Mittag) Date: Fri, 24 Jul 2009 19:29:08 +0200 Subject: [Biojava-l] How to parse large Genbank files? Message-ID: <200907241929.08768.florian.mittag@uni-tuebingen.de> Hi! I think this is a problem worth of its own thread, so I'll start one: I want to store all human chromosomes in a BioSQL database after I loaded the information from .gbk files. The files I get from NCBI with the following URIs, where the id ranges from nc_000001 to nc_000024 plus nc_001804: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=nc_000023&rettype=gbwithparts&retmode=text I then try to parse the files as described in http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting_files but it wont work. While there are no problems parsing 1804 and 24, chromosome 23 leads to a OutOfMemory exception although I gave it 2GB of heap space. Here is a stack trace (the line numbers might differ, because I already tried to improve GenbankFormat.java in memory efficiency): Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at org.biojava.bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbolListFactory.java:222) at org.biojavax.bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichSequenceBuilder.java:256) at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:535) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) at org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Main.java:537) at org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java:468) at org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:164) The line in GenbankFormat.java is: rlistener.addSymbols( symParser.getAlphabet(), (Symbol[])(sl.toList().toArray(new Symbol[0])), 0, sl.length()); Sometimes it fails at the sl.toList().toArray()-part, sometimes it fails later inside the addSymbols method, but it always fails. How can this be? I mean, the file is only 190MB in size, so 2GB of memory should be more than enough. Browsing through the source code, I discovered what I think of as very inefficient handling of sequences: 1) the sequence string is read from file into a StringBuffer 2) it is converted to a string (with whitespaces removed) 3) a SimpleSymbolList is created out of the string 4) the SymbolList is converted to a List of Symbols 5) the List is converted to an array of Symbols 6) the array is passed to addSymbols 7) there it is added to a ChunkedSymbolListFactory 8) if at some point the sequence is requested, a SymbolList is created and then converted to a string. You see, there is a lot of copying and converting, but in the end I have the same string I started with. Well, I had the string, if it ever reached the end, because it will crash before completing this process. Am I doing something wrong or is there a great potential of improving parsing of Genbank files? Regards, Florian From markjschreiber at gmail.com Sat Jul 25 02:20:14 2009 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sat, 25 Jul 2009 10:20:14 +0800 Subject: [Biojava-l] How to parse large Genbank files? In-Reply-To: <200907241929.08768.florian.mittag@uni-tuebingen.de> References: <200907241929.08768.florian.mittag@uni-tuebingen.de> Message-ID: <93b45ca50907241920r60c28931p1b43bf6b6a101b46@mail.gmail.com> Hi- I don't think anyone has done much or anything to optimize these parsers. The process you outline sounds extremely inefficient. It is also likely to lead to memory leaks due to the number of copy operations. As always with java, don't try and optimize without a profiler which will tell you which methods are taking a long time and which objects take the most memory. - Mark On 25 Jul 2009, 1:33 AM, "Florian Mittag" wrote: Hi! I think this is a problem worth of its own thread, so I'll start one: I want to store all human chromosomes in a BioSQL database after I loaded the information from .gbk files. The files I get from NCBI with the following URIs, where the id ranges from nc_000001 to nc_000024 plus nc_001804: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=nc_000023&rettype=gbwithparts&retmode=text I then try to parse the files as described in http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting_files but it wont work. While there are no problems parsing 1804 and 24, chromosome 23 leads to a OutOfMemory exception although I gave it 2GB of heap space. Here is a stack trace (the line numbers might differ, because I already tried to improve GenbankFormat.java in memory efficiency): Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at org.biojava.bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbolListFactory.java:222) at org.biojavax.bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichSequenceBuilder.java:256) at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:535) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) at org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Main.java:537) at org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java:468) at org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:164) The line in GenbankFormat.java is: rlistener.addSymbols( symParser.getAlphabet(), (Symbol[])(sl.toList().toArray(new Symbol[0])), 0, sl.length()); Sometimes it fails at the sl.toList().toArray()-part, sometimes it fails later inside the addSymbols method, but it always fails. How can this be? I mean, the file is only 190MB in size, so 2GB of memory should be more than enough. Browsing through the source code, I discovered what I think of as very inefficient handling of sequences: 1) the sequence string is read from file into a StringBuffer 2) it is converted to a string (with whitespaces removed) 3) a SimpleSymbolList is created out of the string 4) the SymbolList is converted to a List of Symbols 5) the List is converted to an array of Symbols 6) the array is passed to addSymbols 7) there it is added to a ChunkedSymbolListFactory 8) if at some point the sequence is requested, a SymbolList is created and then converted to a string. You see, there is a lot of copying and converting, but in the end I have the same string I started with. Well, I had the string, if it ever reached the end, because it will crash before completing this process. Am I doing something wrong or is there a great potential of improving parsing of Genbank files? Regards, Florian _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From andreas at sdsc.edu Mon Jul 27 03:56:50 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Sun, 26 Jul 2009 20:56:50 -0700 Subject: [Biojava-l] Retrieving AUTHOR information from PDB file In-Reply-To: References: <59a41c430907082121x23fceb7hdc7d51c861b84c9c@mail.gmail.com> Message-ID: <59a41c430907262056o1491175an22c4a82ef0d35d11@mail.gmail.com> Hi Andrew, the PDBHeader class now contains a field for the authors listed in the AUTHORS field. This works for PDB and mmCif files (where the corresponding field is audit_author). Available from SVN trunk.... Andreas On Thu, Jul 9, 2009 at 1:38 AM, Andrew Clegg wrote: > 2009/7/9 Andreas Prlic : >> Hi Andrew, >> >> The PdbFileParser at the present does not process the AUTHOR lines, yet, but >> should be easy to add.. If you need this urgently, you could quickly patch >> the PdbFileParser, otherwise I'll add it to SVN in the next couple of >> days... > > No, it's not urgent, I can wait til you have a chance to do it rather > than trying to figure it out myself :-) > > Could you let me know when you've added it in though please? > > Many thanks! > > Andrew. > From florian.mittag at uni-tuebingen.de Mon Jul 27 12:16:33 2009 From: florian.mittag at uni-tuebingen.de (Florian Mittag) Date: Mon, 27 Jul 2009 14:16:33 +0200 Subject: [Biojava-l] How to parse large Genbank files? In-Reply-To: <93b45ca50907241920r60c28931p1b43bf6b6a101b46@mail.gmail.com> References: <200907241929.08768.florian.mittag@uni-tuebingen.de> <93b45ca50907241920r60c28931p1b43bf6b6a101b46@mail.gmail.com> Message-ID: <200907271416.33485.florian.mittag@uni-tuebingen.de> Hi Mark! On Saturday, 25. July 2009 04:20, Mark Schreiber wrote: > I don't think anyone has done much or anything to optimize these parsers. > The process you outline sounds extremely inefficient. It is also likely to > lead to memory leaks due to the number of copy operations. I wouldn't necessarily say that it leads to memory leaks, but it definitively leads to a high memory consumption (2GB are not enough for a 200MB file). Also, my outline of the process is based on only 2 hours of viewing the code, so actually I expected to be corrected on this. Unfortunately, it seems like I did get the right idea and it IS extremely inefficient. I mean, I understand that this is a high level of abstraction that might come in handy in many situations, but it certainly is more of an obstacle in my specific case. > As always with java, don't try and optimize without a profiler which will > tell you which methods are taking a long time and which objects take the > most memory. I think we should continue this discussion on the biojava-dev list or in a private conversation, as it will probably get very detailed and technical. My question to this list again: Is there a way to achieve my goal of parsing a 200MB Genbank file with the current biojava version without code changes? - Florian > On 25 Jul 2009, 1:33 AM, "Florian Mittag" > wrote: > > Hi! > > I think this is a problem worth of its own thread, so I'll start one: > > I want to store all human chromosomes in a BioSQL database after I loaded > the > information from .gbk files. The files I get from NCBI with the following > URIs, where the id ranges from nc_000001 to nc_000024 plus nc_001804: > > http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=nc_0 >00023&rettype=gbwithparts&retmode=text > > I then try to parse the files as described in > http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting_fi >les but it wont work. While there are no problems parsing 1804 and 24, > chromosome > 23 leads to a OutOfMemory exception although I gave it 2GB of heap space. > > Here is a stack trace (the line numbers might differ, because I already > tried > to improve GenbankFormat.java in memory efficiency): > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at > org.biojava.bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbolLis >tFactory.java:222) at > org.biojavax.bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichSequ >enceBuilder.java:256) at > org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:5 >35) at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader. >java:110) at > org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Main. >java:537) at > org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java:46 >8) at > org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:164) > > The line in GenbankFormat.java is: > > rlistener.addSymbols( > symParser.getAlphabet(), > (Symbol[])(sl.toList().toArray(new Symbol[0])), > 0, sl.length()); > > Sometimes it fails at the sl.toList().toArray()-part, sometimes it fails > later > inside the addSymbols method, but it always fails. > > How can this be? I mean, the file is only 190MB in size, so 2GB of memory > should be more than enough. Browsing through the source code, I discovered > what I think of as very inefficient handling of sequences: > > 1) the sequence string is read from file into a StringBuffer > 2) it is converted to a string (with whitespaces removed) > 3) a SimpleSymbolList is created out of the string > 4) the SymbolList is converted to a List of Symbols > 5) the List is converted to an array of Symbols > 6) the array is passed to addSymbols > 7) there it is added to a ChunkedSymbolListFactory > 8) if at some point the sequence is requested, a SymbolList is created and > then converted to a string. > > You see, there is a lot of copying and converting, but in the end I have > the same string I started with. Well, I had the string, if it ever reached > the end, because it will crash before completing this process. > > > Am I doing something wrong or is there a great potential of improving > parsing > of Genbank files? > > > Regards, > Florian > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Dipl. Inf. Florian Mittag Universit?t Tuebingen WSI-RA, Sand 1 72076 Tuebingen, Germany Phone: +49 7071 / 29 78985 Fax: +49 7071 / 29 5091 From markjschreiber at gmail.com Tue Jul 28 03:05:55 2009 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 28 Jul 2009 11:05:55 +0800 Subject: [Biojava-l] How to parse large Genbank files? In-Reply-To: <200907271416.33485.florian.mittag@uni-tuebingen.de> References: <200907241929.08768.florian.mittag@uni-tuebingen.de> <93b45ca50907241920r60c28931p1b43bf6b6a101b46@mail.gmail.com> <200907271416.33485.florian.mittag@uni-tuebingen.de> Message-ID: <93b45ca50907272005r7c98c02ycb14e5b000d1aff0@mail.gmail.com> Hi - While you maybe can't do it without code changes you can probably do it within the existing framework. If you look at the readGenbank() code in RichSequence.IOTools you will find that the BioJava file parsing consists of many pluggable components which are all defined by interfaces. Anything that implements one of those interfaces can be plugged into the parsing frame work. So if you want you can change the Format object to one of your custom design (which implements Format), you can also change the event listeners and the SequenceBuilders. In your case the SequenceBuilder might be something to look at, it sounds like you don't need to create all the extra Sequence objects for every feature so you could modify that part. Also, in the Format objects there are often methods called elideXXX() which let you tell the Format object to skip over bits that you don't want. Finally, I suspect the problem with memory use is that the String, char[], SymbolList, Sequence copying is both inefficient and worse still is probably not releasing resources in a timely fashion. Eg once the parser framework converts a char[] to a SymbolList is probably no longer needs that char[] reference and might be able to null it. Then when memory gets low the GC can clean out all the cruft. If I have a chance I will run a profiler to see what is sucking up the memory (and what can be released) and also see if all that copying is making a significant impact on CPU cycles (if not it's probably more effort than it's worth to change). The memory thing definitely needs to change though. - Mark On Mon, Jul 27, 2009 at 8:16 PM, Florian Mittag wrote: > Hi Mark! > > On Saturday, 25. July 2009 04:20, Mark Schreiber wrote: >> I don't think anyone has done much or anything to optimize these parsers. >> The process you outline sounds extremely inefficient. It is also likely to >> lead to memory leaks due to the number of copy operations. > > I wouldn't necessarily say that it leads to memory leaks, but it definitively > leads to a high memory consumption (2GB are not enough for a 200MB file). > Also, my outline of the process is based on only 2 hours of viewing the code, > so actually I expected to be corrected on this. > Unfortunately, it seems like I did get the right idea and it IS extremely > inefficient. > > I mean, I understand that this is a high level of abstraction that might come > in handy in many situations, but it certainly is more of an obstacle in my > specific case. > > >> As always with java, don't try and optimize without a profiler which will >> tell you which methods are taking a long time and which objects take the >> most memory. > > I think we should continue this discussion on the biojava-dev list or in a > private conversation, as it will probably get very detailed and technical. > > > My question to this list again: > Is there a way to achieve my goal of parsing a 200MB Genbank file with the > current biojava version without code changes? > > > - Florian > > > >> On 25 Jul 2009, 1:33 AM, "Florian Mittag" >> wrote: >> >> Hi! >> >> I think this is a problem worth of its own thread, so I'll start one: >> >> I want to store all human chromosomes in a BioSQL database after I loaded >> the >> information from .gbk files. The files I get from NCBI with the following >> URIs, where the id ranges from nc_000001 to nc_000024 plus nc_001804: >> >> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=nc_0 >>00023&rettype=gbwithparts&retmode=text >> >> I then try to parse the files as described in >> http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting_fi >>les but it wont work. While there are no problems parsing 1804 and 24, >> chromosome >> 23 leads to a OutOfMemory exception although I gave it 2GB of heap space. >> >> Here is a stack trace (the line numbers might differ, because I already >> tried >> to improve GenbankFormat.java in memory efficiency): >> >> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space >> ? ? ? ?at >> org.biojava.bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbolLis >>tFactory.java:222) at >> org.biojavax.bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichSequ >>enceBuilder.java:256) at >> org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:5 >>35) at >> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader. >>java:110) at >> org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Main. >>java:537) at >> org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java:46 >>8) at >> org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:164) >> >> The line in GenbankFormat.java is: >> >> rlistener.addSymbols( >> ? ? ? ?symParser.getAlphabet(), >> ? ? ? ?(Symbol[])(sl.toList().toArray(new Symbol[0])), >> ? ? ? ?0, sl.length()); >> >> Sometimes it fails at the sl.toList().toArray()-part, sometimes it fails >> later >> inside the addSymbols method, but it always fails. >> >> How can this be? I mean, the file is only 190MB in size, so 2GB of memory >> should be more than enough. Browsing through the source code, I discovered >> what I think of as very inefficient handling of sequences: >> >> 1) the sequence string is read from file into a StringBuffer >> 2) it is converted to a string (with whitespaces removed) >> 3) a SimpleSymbolList is created out of the string >> 4) the SymbolList is converted to a List of Symbols >> 5) the List is converted to an array of Symbols >> 6) the array is passed to addSymbols >> 7) there it is added to a ChunkedSymbolListFactory >> 8) if at some point the sequence is requested, a SymbolList is created and >> then converted to a string. >> >> You see, there is a lot of copying and converting, but in the end I have >> the same string I started with. Well, I had the string, if it ever reached >> the end, because it will crash before completing this process. >> >> >> Am I doing something wrong or is there a great potential of improving >> parsing >> of Genbank files? >> >> >> Regards, >> ? Florian >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Dipl. Inf. Florian Mittag > Universit?t Tuebingen > WSI-RA, Sand 1 > 72076 Tuebingen, Germany > Phone: +49 7071 / 29 78985 ?Fax: +49 7071 / 29 5091 > From sheoran143 at gmail.com Tue Jul 28 03:50:34 2009 From: sheoran143 at gmail.com (Deepak sheoran) Date: Mon, 27 Jul 2009 22:50:34 -0500 Subject: [Biojava-l] Issues with taxon table not having taxon_id and ncbi_id equal under biosql schema Message-ID: <4A6E758A.2030200@gmail.com> Hi I am new to biojava, and made few application with biojava, I trying to make a database which can update itself with NCBI and only remain behind by a day only, I am almost successfully with this task but only problem i facing is when I run my genbank loader program to upload genbank file to biosql database with a updated taxon table, biojava insert a taxon_id (ie. taxon_id != ncib_taxon_id) in taxon table, if their is any record in file which don't have it ncbi_taxon_id in taxon table (becuase that taxon is being replaced by some other in ncbi update), then biojava insert a record in taxon table such that taxon_id is some random sequence and ncbi_taxon_id is the dbxref field from file, my problem is that they are not equale, is their any way to force hibernate or richsequence to put record such that taxon_id and ncbi_taxon_id are equale in table. thanks Deepak Sheoran North Dakota State University (Student) From florian.mittag at uni-tuebingen.de Tue Jul 28 12:14:54 2009 From: florian.mittag at uni-tuebingen.de (Florian Mittag) Date: Tue, 28 Jul 2009 14:14:54 +0200 Subject: [Biojava-l] How to parse large Genbank files? In-Reply-To: <93b45ca50907272005r7c98c02ycb14e5b000d1aff0@mail.gmail.com> References: <200907241929.08768.florian.mittag@uni-tuebingen.de> <200907271416.33485.florian.mittag@uni-tuebingen.de> <93b45ca50907272005r7c98c02ycb14e5b000d1aff0@mail.gmail.com> Message-ID: <200907281414.55156.florian.mittag@uni-tuebingen.de> Hi! On Tuesday, 28. July 2009 05:05, you wrote: > While you maybe can't do it without code changes you can probably do > it within the existing framework. If you look at the readGenbank() > code in RichSequence.IOTools you will find that the BioJava file > parsing consists of many pluggable components which are all defined by > interfaces. Anything that implements one of those interfaces can be > plugged into the parsing frame work. So if you want you can change > the Format object to one of your custom design (which implements > Format), you can also change the event listeners and the > SequenceBuilders. In your case the SequenceBuilder might be something > to look at, it sounds like you don't need to create all the extra > Sequence objects for every feature so you could modify that part. Yeah, I see what you mean. I wanted to start with something simple because I didn't want to code everything myself, but it seems like I won't get around it, if I want to optimize it. > Also, in the Format objects there are often methods called elideXXX() > which let you tell the Format object to skip over bits that you don't > want. I think I want everything, since I want to story everything in the BioSQL db afterwards. I don't think, I can skip something. > Finally, I suspect the problem with memory use is that the String, > char[], SymbolList, Sequence copying is both inefficient and worse > still is probably not releasing resources in a timely fashion. Eg once > the parser framework converts a char[] to a SymbolList is probably no > longer needs that char[] reference and might be able to null it. Then > when memory gets low the GC can clean out all the cruft. > > If I have a chance I will run a profiler to see what is sucking up the > memory (and what can be released) and also see if all that copying is > making a significant impact on CPU cycles (if not it's probably more > effort than it's worth to change). The memory thing definitely needs > to change though. It turned out our workgroup has a floating JProfiler license, so I did some tests and got some clues on where to optimize further. The NetBeans profiler reported that most of the memory was consumend by char[], but it only showed about 130MB of usage, contrary to the 2GB of heap being full. So our idea was that maybe the memory management overhead was another sink where all the memory vanished. JProfiler then returned more plausible results with nearly 800MB used by char[]. I tweaked both the way I call the parser and the GenbankFormat itself, and now all files except chromosome 1 (300MB) will parse successfully. To reduce the memory for SymbolLists, I did: PackedSymbolListFactory pslf = new PackedSymbolListFactory() SimpleRichSequenceBuilder listener = new SimpleRichSequenceBuilder(pslf); GenbankFormat gf = new GenbankFormat(); gf.readRichSequence(fileIn, dnaTokenization, listener, nsGenbank); RichSequence seq = listener.makeRichSequence(); The PackedSymbolListFactory seemed to help saving some memory, but it still wasn't enough. I then modified the readSection() method of GenbankFormat. What it usually does is to put each single line of nucleotide sequence into a String[] which it then puts into the ArrayList returned by the method. Since there are 60 nucleotides (so 60 bytes + whitespaces) per line, this was a big array. I modified it to build one large string containing only the nucleotide characters, instead of returning the array and then have the readRichSequence() method build this large String. This all still isn't enough, the program exits at sl.toArray(), so I agree with Richard here to keep the sequence as a String (maybe use the Symbol(List) mechanisms to check for invalid characters) and only convert it to Symbol objects if really necessary. Btw: Should we move this to Biojava-dev? And where do I sign up for BioJava3 development? ;-) - Florian > On Mon, Jul 27, 2009 at 8:16 PM, Florian > > Mittag wrote: > > Hi Mark! > > > > On Saturday, 25. July 2009 04:20, Mark Schreiber wrote: > >> I don't think anyone has done much or anything to optimize these > >> parsers. The process you outline sounds extremely inefficient. It is > >> also likely to lead to memory leaks due to the number of copy > >> operations. > > > > I wouldn't necessarily say that it leads to memory leaks, but it > > definitively leads to a high memory consumption (2GB are not enough for a > > 200MB file). Also, my outline of the process is based on only 2 hours of > > viewing the code, so actually I expected to be corrected on this. > > Unfortunately, it seems like I did get the right idea and it IS extremely > > inefficient. > > > > I mean, I understand that this is a high level of abstraction that might > > come in handy in many situations, but it certainly is more of an obstacle > > in my specific case. > > > >> As always with java, don't try and optimize without a profiler which > >> will tell you which methods are taking a long time and which objects > >> take the most memory. > > > > I think we should continue this discussion on the biojava-dev list or in > > a private conversation, as it will probably get very detailed and > > technical. > > > > > > My question to this list again: > > Is there a way to achieve my goal of parsing a 200MB Genbank file with > > the current biojava version without code changes? > > > > > > - Florian > > > >> On 25 Jul 2009, 1:33 AM, "Florian Mittag" > >> wrote: > >> > >> Hi! > >> > >> I think this is a problem worth of its own thread, so I'll start one: > >> > >> I want to store all human chromosomes in a BioSQL database after I > >> loaded the > >> information from .gbk files. The files I get from NCBI with the > >> following URIs, where the id ranges from nc_000001 to nc_000024 plus > >> nc_001804: > >> > >> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=n > >>c_0 00023&rettype=gbwithparts&retmode=text > >> > >> I then try to parse the files as described in > >> http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting > >>_fi les but it wont work. While there are no problems parsing 1804 and > >> 24, chromosome > >> 23 leads to a OutOfMemory exception although I gave it 2GB of heap > >> space. > >> > >> Here is a stack trace (the line numbers might differ, because I already > >> tried > >> to improve GenbankFormat.java in memory efficiency): > >> > >> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > >> ? ? ? ?at > >> org.biojava.bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbol > >>Lis tFactory.java:222) at > >> org.biojavax.bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichS > >>equ enceBuilder.java:256) at > >> org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.jav > >>a:5 35) at > >> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamRead > >>er. java:110) at > >> org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Ma > >>in. java:537) at > >> org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java > >>:46 8) at > >> org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:164) > >> > >> The line in GenbankFormat.java is: > >> > >> rlistener.addSymbols( > >> ? ? ? ?symParser.getAlphabet(), > >> ? ? ? ?(Symbol[])(sl.toList().toArray(new Symbol[0])), > >> ? ? ? ?0, sl.length()); > >> > >> Sometimes it fails at the sl.toList().toArray()-part, sometimes it fails > >> later > >> inside the addSymbols method, but it always fails. > >> > >> How can this be? I mean, the file is only 190MB in size, so 2GB of > >> memory should be more than enough. Browsing through the source code, I > >> discovered what I think of as very inefficient handling of sequences: > >> > >> 1) the sequence string is read from file into a StringBuffer > >> 2) it is converted to a string (with whitespaces removed) > >> 3) a SimpleSymbolList is created out of the string > >> 4) the SymbolList is converted to a List of Symbols > >> 5) the List is converted to an array of Symbols > >> 6) the array is passed to addSymbols > >> 7) there it is added to a ChunkedSymbolListFactory > >> 8) if at some point the sequence is requested, a SymbolList is created > >> and then converted to a string. > >> > >> You see, there is a lot of copying and converting, but in the end I have > >> the same string I started with. Well, I had the string, if it ever > >> reached the end, because it will crash before completing this process. > >> > >> > >> Am I doing something wrong or is there a great potential of improving > >> parsing > >> of Genbank files? > >> > >> > >> Regards, > >> ? Florian > >> _______________________________________________ > >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- > > Dipl. Inf. Florian Mittag > > Universit?t Tuebingen > > WSI-RA, Sand 1 > > 72076 Tuebingen, Germany > > Phone: +49 7071 / 29 78985 ?Fax: +49 7071 / 29 5091 -- Dipl. Inf. Florian Mittag Universit?t Tuebingen WSI-RA, Sand 1 72076 Tuebingen, Germany Phone: +49 7071 / 29 78985 Fax: +49 7071 / 29 5091 From holland at eaglegenomics.com Tue Jul 28 12:52:00 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Tue, 28 Jul 2009 13:52:00 +0100 Subject: [Biojava-l] How to parse large Genbank files? In-Reply-To: <200907281414.55156.florian.mittag@uni-tuebingen.de> References: <200907241929.08768.florian.mittag@uni-tuebingen.de> <200907271416.33485.florian.mittag@uni-tuebingen.de> <93b45ca50907272005r7c98c02ycb14e5b000d1aff0@mail.gmail.com> <200907281414.55156.florian.mittag@uni-tuebingen.de> Message-ID: > > > Btw: Should we move this to Biojava-dev? probably, yes! :) > And where do I sign up for BioJava3 development? ;-) Andreas Prlic has the keys to the project these days. BJ3 does already have some new code in place for handling sequences as strings but it's in an out-of-the-way bit of the repository and is not part of the main roadmap for the project at present. The current focus is on modularising the existing bits, so that individual components can be refactored to behave better at a future date. If you want to explore my ideas for a replacement Sequence model, the code and docs are here (sequence handling is in the 'core' module with DNA-specifics in the 'dna' module): http://biojava.org/wiki/BioJava3:HowTo http://www.biojava.org/wiki/BioJava3_project (Methods such as file parsers would request Strings (or ideally CharSequence - more flexible, and String extends it) as parameters whenever they don't care about content - if they care about content but don't care in advance about size or random access then they should request Iterator which can be used to wrap a String and parse on demand, and if they need full functionality then they should request List which the default implementation of uses ArrayLists but there's no reason a String-backed one could be written as well). cheers, Richard > > - Florian > >> On Mon, Jul 27, 2009 at 8:16 PM, Florian >> >> Mittag wrote: >>> Hi Mark! >>> >>> On Saturday, 25. July 2009 04:20, Mark Schreiber wrote: >>>> I don't think anyone has done much or anything to optimize these >>>> parsers. The process you outline sounds extremely inefficient. It >>>> is >>>> also likely to lead to memory leaks due to the number of copy >>>> operations. >>> >>> I wouldn't necessarily say that it leads to memory leaks, but it >>> definitively leads to a high memory consumption (2GB are not >>> enough for a >>> 200MB file). Also, my outline of the process is based on only 2 >>> hours of >>> viewing the code, so actually I expected to be corrected on this. >>> Unfortunately, it seems like I did get the right idea and it IS >>> extremely >>> inefficient. >>> >>> I mean, I understand that this is a high level of abstraction that >>> might >>> come in handy in many situations, but it certainly is more of an >>> obstacle >>> in my specific case. >>> >>>> As always with java, don't try and optimize without a profiler >>>> which >>>> will tell you which methods are taking a long time and which >>>> objects >>>> take the most memory. >>> >>> I think we should continue this discussion on the biojava-dev list >>> or in >>> a private conversation, as it will probably get very detailed and >>> technical. >>> >>> >>> My question to this list again: >>> Is there a way to achieve my goal of parsing a 200MB Genbank file >>> with >>> the current biojava version without code changes? >>> >>> >>> - Florian >>> >>>> On 25 Jul 2009, 1:33 AM, "Florian Mittag" >>>> wrote: >>>> >>>> Hi! >>>> >>>> I think this is a problem worth of its own thread, so I'll start >>>> one: >>>> >>>> I want to store all human chromosomes in a BioSQL database after I >>>> loaded the >>>> information from .gbk files. The files I get from NCBI with the >>>> following URIs, where the id ranges from nc_000001 to nc_000024 >>>> plus >>>> nc_001804: >>>> >>>> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=n >>>> c_0 00023&rettype=gbwithparts&retmode=text >>>> >>>> I then try to parse the files as described in >>>> http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting >>>> _fi les but it wont work. While there are no problems parsing >>>> 1804 and >>>> 24, chromosome >>>> 23 leads to a OutOfMemory exception although I gave it 2GB of heap >>>> space. >>>> >>>> Here is a stack trace (the line numbers might differ, because I >>>> already >>>> tried >>>> to improve GenbankFormat.java in memory efficiency): >>>> >>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap >>>> space >>>> at >>>> org >>>> .biojava >>>> .bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbol >>>> Lis tFactory.java:222) at >>>> org >>>> .biojavax >>>> .bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichS >>>> equ enceBuilder.java:256) at >>>> org >>>> .biojavax >>>> .bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.jav >>>> a:5 35) at >>>> org >>>> .biojavax >>>> .bio.seq.io.RichStreamReader.nextRichSequence(RichStreamRead >>>> er. java:110) at >>>> org >>>> .prodge >>>> .sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Ma >>>> in. java:537) at >>>> org >>>> .prodge >>>> .sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java >>>> :46 8) at >>>> org >>>> .prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java: >>>> 164) >>>> >>>> The line in GenbankFormat.java is: >>>> >>>> rlistener.addSymbols( >>>> symParser.getAlphabet(), >>>> (Symbol[])(sl.toList().toArray(new Symbol[0])), >>>> 0, sl.length()); >>>> >>>> Sometimes it fails at the sl.toList().toArray()-part, sometimes >>>> it fails >>>> later >>>> inside the addSymbols method, but it always fails. >>>> >>>> How can this be? I mean, the file is only 190MB in size, so 2GB of >>>> memory should be more than enough. Browsing through the source >>>> code, I >>>> discovered what I think of as very inefficient handling of >>>> sequences: >>>> >>>> 1) the sequence string is read from file into a StringBuffer >>>> 2) it is converted to a string (with whitespaces removed) >>>> 3) a SimpleSymbolList is created out of the string >>>> 4) the SymbolList is converted to a List of Symbols >>>> 5) the List is converted to an array of Symbols >>>> 6) the array is passed to addSymbols >>>> 7) there it is added to a ChunkedSymbolListFactory >>>> 8) if at some point the sequence is requested, a SymbolList is >>>> created >>>> and then converted to a string. >>>> >>>> You see, there is a lot of copying and converting, but in the end >>>> I have >>>> the same string I started with. Well, I had the string, if it ever >>>> reached the end, because it will crash before completing this >>>> process. >>>> >>>> >>>> Am I doing something wrong or is there a great potential of >>>> improving >>>> parsing >>>> of Genbank files? >>>> >>>> >>>> Regards, >>>> Florian >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >>> -- >>> Dipl. Inf. Florian Mittag >>> Universit?t Tuebingen >>> WSI-RA, Sand 1 >>> 72076 Tuebingen, Germany >>> Phone: +49 7071 / 29 78985 Fax: +49 7071 / 29 5091 > > -- > Dipl. Inf. Florian Mittag > Universit?t Tuebingen > WSI-RA, Sand 1 > 72076 Tuebingen, Germany > Phone: +49 7071 / 29 78985 Fax: +49 7071 / 29 5091 From andreas at sdsc.edu Tue Jul 28 17:28:30 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 28 Jul 2009 10:28:30 -0700 Subject: [Biojava-l] How to parse large Genbank files? In-Reply-To: References: <200907241929.08768.florian.mittag@uni-tuebingen.de> <200907271416.33485.florian.mittag@uni-tuebingen.de> <93b45ca50907272005r7c98c02ycb14e5b000d1aff0@mail.gmail.com> <200907281414.55156.florian.mittag@uni-tuebingen.de> Message-ID: <59a41c430907281028j5bc42c26p86fe64dd0dad14bb@mail.gmail.com> >> And where do I sign up for BioJava3 development? ;-) I think we still need a module -lead for the biojava-biosql module... you are welcome to volunteer! > Andreas Prlic has the keys to the project these days. BJ3 does already have > some new code in place for handling sequences as strings but it's in an > out-of-the-way bit of the repository and is not part of the main roadmap for > the project at present. The current focus is on modularising the existing > bits, so that individual components can be refactored to behave better at a > future date. Just to add to this: I think the BJ3 and sequence related work would fit nicely as a new biojava-sequence module... Andreas From bernd.jagla at pasteur.fr Wed Jul 29 13:16:04 2009 From: bernd.jagla at pasteur.fr (Bernd Jagla) Date: Wed, 29 Jul 2009 15:16:04 +0200 Subject: [Biojava-l] blast parsing question Message-ID: <3DD69071F4A8490D9D1D7EEB172FEFC0@zillumina> Hi, I am new to BioJava. I want to test what is going on here in order to potentially integrate it with KNIME. My first project is parsing BLAST output for large files. The example in the codebook is very good and I had no problems integrating everything in Eclipse and geting it to work. Now here is my problem: I am interested in parsing the summary table in the beginning of the blast-output, and I haven't found a way to get at this information. I am blasting short sequences (20nt - 300nt) against genomic databases (mouse/human/refseq/miRBase). I want to know if a given sequence (out of a set of sequences) aligns to a specific genome with high identity. I want to then separate the input source fasta file into a set that aligns to the genome and one that doesn't (potentially another list of dubious sequences where there is no clear answer). For this I only need the length of the query sequence and score and the first few characters of the header line. At least that's the way I am currently doing it. I have set the blast parameters to only give me the first alignment, but the first 50 or so in the summary. Any help, comments are appreciated. Thanks, Bernd Bernd Jagla Bioinformatician Institut Pasteur Plate-forme puces a ADN Genopole / Institut Pasteur 28 rue du Docteur Roux 75724 Paris Cedex 15 France bernd.jagla at pasteur.fr tel: +33 (0) 140 61 35 13 From andreas at sdsc.edu Wed Jul 29 19:00:25 2009 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 29 Jul 2009 12:00:25 -0700 Subject: [Biojava-l] FASTQ in BioJava, BioRuby (and Biopython, BioPerl & EMBOSS) In-Reply-To: <320fb6e00907290310k16b78e72iae34f01de680ca76@mail.gmail.com> References: <320fb6e00907290310k16b78e72iae34f01de680ca76@mail.gmail.com> Message-ID: <59a41c430907291200o3ace6faj4b3455b8f3237cad@mail.gmail.com> Hi Peter, I would be happy to have support for the FASTQ file format in BioJava. We had an increased number of requests for parsing the output of sequencing machines in the last weeks, but nobody has stepped up to be module-lead for this as of yet. I am currently not working with sequencers myself, so I can't really provide support for this. I am happy to help anybody who wants to be module lead for this to get this going. Andreas On Wed, Jul 29, 2009 at 3:10 AM, Peter Cock wrote: > Dear Andreas (& Richard) and Goto-san, > > Are BioJava or BioRuby interested in supporting the FASTQ file format > used in next generation sequencing for storing sequencing reads with > associated quality scores? > > I have been working on FASTQ support in Biopython, and coordinating > with Peter Rice at EMBOSS and Chris Fields at BioPerl to ensure we > are consistent on our interpretation of these files, interconversion, and > naming. We'd like to get BioJava and BioRuby involved too. > > Please could you (or whomever at BioJava/BioRuby would be doing > FASTQ code) please sign up to the cross-project OBF mailing list? > http://lists.open-bio.org/mailman/listinfo/open-bio-l > > Thank you, > > Peter > -- > Dr Peter Cock, Biopython project >