From yingqin at rti.org Mon Dec 1 12:05:12 2003 From: yingqin at rti.org (Qin, Ying) Date: Mon Dec 1 18:04:02 2003 Subject: [Biojava-l] GenBank parsing problem Message-ID: Skipped content of type multipart/alternative From mark.schreiber at agresearch.co.nz Mon Dec 1 20:18:50 2003 From: mark.schreiber at agresearch.co.nz (Schreiber, Mark) Date: Mon Dec 1 20:25:14 2003 Subject: [Biojava-l] Announce: BioJava1.3.1 released Message-ID: Hi All - The biojava 1.3.1 release has finally been posted to the biojava website and is obtainable from http://biojava.org/download/ This represents an interim release that is partway between biojava1.3 and biojava-live (from which the biojava1.4 release will be built). Your xmas present has arrived early this year :) This release features: * lots of bug fixes from the main branch * Faster GUI rendering code * Some new GUI code for circular sequence rendering * Lots of bug fixes to improve SeqIO from flatfiles esp the protein based sequences * Improved Serialization support (esp for Protein based sequences) * LSID naming of core symbols to allow unambiguous deserialization * Bug fixes to HMM/ DP code that prevent continuous loops when training profile HMMs * Improved CircularLocation API * Other stuff I have forgotten One nice thing is that there have been very few changes or breaks to the API. The only thing that a few people may notice is that SimpleSequence and ViewSequence are now in org.biojava.bio.seq.impl. The recommended way of accessing these from now on is to use SequenceTools. This is in line with what happens in biojava-live. Another change that almost no-one will notice unless they provide custom alphabets via XML files is that Symbols named in these files must now use a full LSID as there name (see AlphabetManager.xml for examples). If you make these changes everything else will work fine. Known Issues: * At the time of writing the output of Sequences in SwissProt and GenPept format is weak * Biojava 1.3.1 only supports the Capetown version of the OBDA protocols. If you need more up to date support use biojava-live * Theres bound to be other stuff so make sure you test your programs! Details: Java Docs are at http://biojava.org/download/docs/ in both tar.gz and zip format Binaries are at http://biojava.org/download/binaries/ in jar format Source is at http://biojava.org/download/source/ in both tar.gz and zip format Future Releases: At this stage the future of the biojava 1.3.x branch is unclear. It will probably pick up a few more bugfixes but it is increasingly hard to incorporate everything from the main branch especially parts that have dependencies on new functionality such as the ontology package. My personal prefence is to vastly increase the converage of the unit tests in biojava-live and aim to have regular stable releases from the live branch (in line with the XP approach to programming). We'll see what happens. Anyhow there will probably be a 1.3.2 release at least before a 1.4 arrives. Finally: A big thanks to everyone who contributed code and told use when it didn't work. A lot of the bug fixes came about due to an increasing number of people using parts of the API that don't often get used in anger. This is encouraging as it means people actually seem to be interested and find biojava useful and are prepared to help identify and often help to fix bugs. Enjoy! - Mark ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From jan.wuerthner at uni-duesseldorf.de Tue Dec 2 03:52:03 2003 From: jan.wuerthner at uni-duesseldorf.de (Jan =?iso-8859-15?q?W=FCrthner?=) Date: Tue Dec 2 03:58:37 2003 Subject: [Biojava-l] SeqSimilaritySearchSubHit - Strand information Message-ID: <200312020952.03142.jan.wuerthner@uni-duesseldorf.de> Hi folks, I'm constructing SeqSimilaritySearchSubHit instances from xml formatted NCBI BLAST results, and I'm getting steadily confused with the query's and subject's from and to information on one hand and the query's and subject's strand on the other hand. The NCBI returns for example: 576

229

12374053

12374401

229

1 The two primary differences are in capitalization, and the choice attributes rather than separate elements for each datum in this excerpt. As a consequence, the "expected" form is more succinct. From the DTD I see the latter naming and element/attribute choice is repeated many times. I will add an admission that I have not worked with BLAST results in several years, as my focus has been on data management software (LIMS) and, more recently, analysis software. Still, as a professional in the greater bioinformatics community, who works daily with XML, I do like to see an incorporation of good practices from the "pure" software development community. Comments? Stephen Bobick -----Original Message----- From: biojava-l-bounces@portal.open-bio.org [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of Jan W?rthner Sent: Tuesday, December 02, 2003 12:52 AM To: biojava-l@biojava.org Subject: [Biojava-l] SeqSimilaritySearchSubHit - Strand information Hi folks, I'm constructing SeqSimilaritySearchSubHit instances from xml formatted NCBI BLAST results, and I'm getting steadily confused with the query's and subject's from and to information on one hand and the query's and subject's strand on the other hand. The NCBI returns for example: 576

229

12374053

12374401

-1 I'd think that the possibility to assign the from- and to-values in different orders (like descending in this query) already includes the information about the direction (POSITIVE/NEGATIVE). Why is there an additional "frame" value, and why is the query's frame value set to +1, and the subject's (=hit's) value set to -1? I assumed it to be assigned vice versa. My question is: How shall I set the SeqSimilaritySearchSubHit instance's query/subject values from these data? Having answered this will be of much help! Thank you Jan -- Jan W?rthner Institute for Medical Microbiology Building 22.21 Heinrich-Heine-University Universit?tsstra?e 1 40225 Duesseldorf Tel. +49 (0) 211 81 12461 URL: www.medmikro.uni-duesseldorf.de _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From mark.schreiber at agresearch.co.nz Tue Dec 2 17:09:49 2003 From: mark.schreiber at agresearch.co.nz (Schreiber, Mark) Date: Tue Dec 2 17:16:11 2003 Subject: BLAST DTD (was RE: [Biojava-l] SeqSimilaritySearchSubHit -Strand information) Message-ID: Hi - I suspect you hit the nail on the head when you asked how it was designed and under what standards. My guess at the answers would be, it wasn't and none, respectively. It's pretty funny if you put the DTD into a tool that automatically makes JAXB style bindings. You end up with millions of objects each of which contain a single piece of data. It would have been better to do it the way you suggested. For a long time the DTD didn't actually validate what was being produced either so I guess we should be glad it actually works now. Anhow, that's enough ranting from me. - Mark > -----Original Message----- > From: Bobick, Stephen [mailto:Stephen_Bobick@rosettabio.com] > Sent: Wednesday, 3 December 2003 10:44 a.m. > To: biojava-l@biojava.org > Subject: BLAST DTD (was RE: [Biojava-l] > SeqSimilaritySearchSubHit -Strand information) > > > > Greetings, > > I'm afraid I will not be answering the poster here, but the > message caught my curiousity and prompted me to take a peek > at the BLAST DTD, and subsequently post this commentary. My > question is how was the BLAST DTD designed and under what > standards? I find the choice of element names to be > unfortunate. In comparing to standard XML naming and DTD > design I would expect something like: > > > > Rather than the following: > > 576 > 229 > 1 > > The two primary differences are in capitalization, and the > choice attributes rather than separate elements for each > datum in this excerpt. As a consequence, the "expected" form > is more succinct. From the DTD I see the latter naming and > element/attribute choice is repeated many times. > > I will add an admission that I have not worked with BLAST > results in several years, as my focus has been on data > management software (LIMS) and, more recently, analysis > software. Still, as a professional in the greater > bioinformatics community, who works daily with XML, I do like > to see an incorporation of good practices from the "pure" > software development community. > > Comments? > > Stephen Bobick > > > -----Original Message----- > From: biojava-l-bounces@portal.open-bio.org > [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of > Jan W?rthner > Sent: Tuesday, December 02, 2003 12:52 AM > To: biojava-l@biojava.org > Subject: [Biojava-l] SeqSimilaritySearchSubHit - Strand information > > > > Hi folks, > > I'm constructing SeqSimilaritySearchSubHit instances from xml > formatted NCBI > > BLAST results, and I'm getting steadily confused with the query's and > subject's from and to information on one hand and the query's > and subject's > strand on the other hand. > > The NCBI returns for example: > > 576 > 229 > 1 > > 12374053 > 12374401 > -1 > > I'd think that the possibility to assign the from- and > to-values in different > orders (like descending in this query) already includes the > information about > the direction (POSITIVE/NEGATIVE). Why is there an additional > "frame" value, > > and why is the query's frame value set to +1, and the > subject's (=hit's) > value set to -1? I assumed it to be assigned vice versa. > > My question is: How shall I set the SeqSimilaritySearchSubHit > instance's > query/subject values from these data? > > Having answered this will be of much help! > > Thank you > Jan > > -- > Jan W?rthner > Institute for Medical Microbiology > Building 22.21 > Heinrich-Heine-University > Universit?tsstra?e 1 > 40225 Duesseldorf > > Tel. +49 (0) 211 81 12461 > URL: www.medmikro.uni-duesseldorf.de > > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From mes5k at cs.virginia.edu Tue Dec 2 17:37:02 2003 From: mes5k at cs.virginia.edu (Michael E. Smoot) Date: Tue Dec 2 17:43:25 2003 Subject: BLAST DTD (was RE: [Biojava-l] SeqSimilaritySearchSubHit - Strand information) In-Reply-To: <200312022148.hB2Lmrg0014894@portal.open-bio.org> References: <200312022148.hB2Lmrg0014894@portal.open-bio.org> Message-ID: This page explains how the DTD's were created: http://www.ncbi.nlm.nih.gov/IEB/ToolBox/XML/ncbixml.txt The short version is that the DTD's are transliterations of their ASN.1 data models. Mike On Tue, 2 Dec 2003, Bobick, Stephen wrote: > > Greetings, > > I'm afraid I will not be answering the poster here, but the message caught > my curiousity and prompted me to take a peek at the BLAST DTD, and > subsequently post this commentary. My question is how was the BLAST DTD > designed and under what standards? I find the choice of element names to be > unfortunate. In comparing to standard XML naming and DTD design I would > expect something like: > > > > Rather than the following: > > 576 > 229 > 1 > > The two primary differences are in capitalization, and the choice attributes > rather than separate elements for each datum in this excerpt. As a > consequence, the "expected" form is more succinct. From the DTD I see the > latter naming and element/attribute choice is repeated many times. > > I will add an admission that I have not worked with BLAST results in several > years, as my focus has been on data management software (LIMS) and, more > recently, analysis software. Still, as a professional in the greater > bioinformatics community, who works daily with XML, I do like to see an > incorporation of good practices from the "pure" software development > community. > > Comments? > > Stephen Bobick > > > -----Original Message----- > From: biojava-l-bounces@portal.open-bio.org > [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of Jan W?rthner > Sent: Tuesday, December 02, 2003 12:52 AM > To: biojava-l@biojava.org > Subject: [Biojava-l] SeqSimilaritySearchSubHit - Strand information > > > > Hi folks, > > I'm constructing SeqSimilaritySearchSubHit instances from xml formatted NCBI > > BLAST results, and I'm getting steadily confused with the query's and > subject's from and to information on one hand and the query's and subject's > strand on the other hand. > > The NCBI returns for example: > > 576 > 229 > 1 > > 12374053 > 12374401 > -1 > > I'd think that the possibility to assign the from- and to-values in > different > orders (like descending in this query) already includes the information > about > the direction (POSITIVE/NEGATIVE). Why is there an additional "frame" value, > > and why is the query's frame value set to +1, and the subject's (=hit's) > value set to -1? I assumed it to be assigned vice versa. > > My question is: How shall I set the SeqSimilaritySearchSubHit instance's > query/subject values from these data? > > Having answered this will be of much help! > > Thank you > Jan > > -- > Jan W?rthner > Institute for Medical Microbiology > Building 22.21 > Heinrich-Heine-University > Universit?tsstra?e 1 > 40225 Duesseldorf > > Tel. +49 (0) 211 81 12461 > URL: www.medmikro.uni-duesseldorf.de > > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > From Stephen_Bobick at rosettabio.com Tue Dec 2 17:40:33 2003 From: Stephen_Bobick at rosettabio.com (Bobick, Stephen) Date: Tue Dec 2 17:45:31 2003 Subject: BLAST DTD (was RE: [Biojava-l] SeqSimilaritySearchSubHit -Str and information) Message-ID: <200312022245.hB2MjTg0015675@portal.open-bio.org> Mark writes: >I suspect you hit the nail on the head when you asked how it was designed and under what >standards. My guess at the answers would be, it wasn't and none, respectively. That's too bad. Merely submitting an RFC to a wider audience (like this one for example) could have caught these issues quickly, and resulted in a better-designed DTD. >It's pretty funny if you put the DTD into a tool that automatically makes JAXB style bindings. >You end up with millions of objects each of which contain a single piece of data. It would have >been better to do it the way you suggested. I would even go further at this point at suggest encoding the validation rules in an XML schema... >For a long time the DTD didn't actually validate what was being produced either so I guess we >should be glad it actually works now. Since the parser validates XML... do you mean that the DTD did not agree with the emitted XML period? Ouch. Stephen Bobick From Stephen_Bobick at rosettabio.com Tue Dec 2 17:57:38 2003 From: Stephen_Bobick at rosettabio.com (Bobick, Stephen) Date: Tue Dec 2 18:02:47 2003 Subject: BLAST DTD (was RE: [Biojava-l] SeqSimilaritySearchSubHit - St rand information) Message-ID: <200312022302.hB2N2ig0015805@portal.open-bio.org> Interesting read. There are two sections worthy of comment: >NCBI is not proposing a new data model, but is simply transliterating >the data model we have used for the last decade into a different language for the >convenience of our users. ASN.1 has a number of specific data types such as INTEGER >or REAL numbers while XML has only strings, so our DTD automatically adds some >ENTITY definitions at the top which maps these numbers to strings. This mapping only >allows humans that read the DTD to see where numbers are expected; an XML validator >will not care what is there. Use of an XML Schema would allow the enforcement of data types. >Summary: >While the effect of Roles, Scope, and Alternate Forms results in extensive >tags in the XML, it does accurately reflect the structure and use of the data. It allows >XML programs to capture as little or as much of the full data structure as they wish. I guess I fail to see the point of all this. How would a structure resulting from the suggestions that I propose be "lossy" in any way? Stephen Bobick -----Original Message----- From: Michael E. Smoot [mailto:mes5k@cs.virginia.edu] Sent: Tuesday, December 02, 2003 2:37 PM To: Bobick, Stephen Cc: biojava-l@biojava.org Subject: Re: BLAST DTD (was RE: [Biojava-l] SeqSimilaritySearchSubHit - Strand information) This page explains how the DTD's were created: http://www.ncbi.nlm.nih.gov/IEB/ToolBox/XML/ncbixml.txt The short version is that the DTD's are transliterations of their ASN.1 data models. Mike From sohrab at bioinformatics.ubc.ca Tue Dec 2 19:20:17 2003 From: sohrab at bioinformatics.ubc.ca (Sohrab Shah) Date: Tue Dec 2 19:26:38 2003 Subject: [Biojava-l] referencing BioJava Message-ID: <3FCD2C41.208@bioinformatics.ubc.ca> Hi. Apologies if this has been posted to the list before, but what is the preferred way to cite BioJava in a manuscript? Thanks, Sohrab -- | Sohrab Shah sohrab@bioinformatics.ubc.ca | | UBC Bioinformatics Centre Tel: 604.875-3869 | | University of British Columbia Fax: 604.875.3840 | | Vancouver, BC Canada V5Z 4H4 bioinformatics.ubc.ca | From mark.schreiber at agresearch.co.nz Tue Dec 2 19:26:31 2003 From: mark.schreiber at agresearch.co.nz (Schreiber, Mark) Date: Tue Dec 2 19:32:57 2003 Subject: [Biojava-l] referencing BioJava Message-ID: Hi - As there is currently no publication the best way would be to put the web address and the version you are using. - Mark > -----Original Message----- > From: Sohrab Shah [mailto:sohrab@bioinformatics.ubc.ca] > Sent: Wednesday, 3 December 2003 1:20 p.m. > To: biojava-l@biojava.org > Subject: [Biojava-l] referencing BioJava > > > Hi. > > Apologies if this has been posted to the list before, but what is the > preferred way to cite BioJava in a manuscript? > > Thanks, > Sohrab > > -- > | Sohrab Shah sohrab@bioinformatics.ubc.ca | > | UBC Bioinformatics Centre Tel: 604.875-3869 | > | University of British Columbia Fax: 604.875.3840 | > | Vancouver, BC Canada V5Z 4H4 bioinformatics.ubc.ca | > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From ambesi at tigem.it Tue Dec 2 11:49:27 2003 From: ambesi at tigem.it (Alberto Ambesi) Date: Tue Dec 2 21:56:00 2003 Subject: [Biojava-l] issue in class Distribution Message-ID: <7E123FAF-24E7-11D8-B482-000A958EE60A@tigem.it> hi, I found this bug when using Distrubutions iteratively many times. The problem is that when creating Distrubution objects iteratively computation time for each iteration increases with time. this is a piece of code that demonstrates the issue: public class DistributionTest { public static void main(String[] args) throws Exception{ long timePoint = System.currentTimeMillis(); for (int i=0; i<2500; i++) { Map map = new HashMap(); map.put("seq0", DNATools.createDNA("aggag")); map.put("seq1", DNATools.createDNA("aggaa")); map.put("seq2", DNATools.createDNA("aggag")); map.put("seq3", DNATools.createDNA("aagag")); Alignment align = new SimpleAlignment(map); Distribution[] dists = distOverAlignment2(align, false, 0.01); long previousPoint = timePoint; timePoint = System.currentTimeMillis(); System.out.println(timePoint - previousPoint); } } } if I plot the output, I see that computation time of each cycle increases with time. This makes this class unusable for my purpose. Thank you for addressing this issue. Alberto Ambesi -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 2285 bytes Desc: not available Url : http://portal.open-bio.org/pipermail/biojava-l/attachments/20031202/ae076f77/attachment-0001.bin From david.huen at ntlworld.com Wed Dec 3 03:55:39 2003 From: david.huen at ntlworld.com (David Huen) Date: Wed Dec 3 04:01:56 2003 Subject: [Biojava-l] issue in class Distribution In-Reply-To: <7E123FAF-24E7-11D8-B482-000A958EE60A@tigem.it> References: <7E123FAF-24E7-11D8-B482-000A958EE60A@tigem.it> Message-ID: <200312030855.39762.david.huen@ntlworld.com> On Tuesday 02 Dec 2003 4:49 pm, Alberto Ambesi wrote: > hi, I found this bug when using Distrubutions iteratively many times. > The problem is that when creating Distrubution objects iteratively > computation time for each iteration increases with time. > > this is a piece of code that demonstrates the issue: > > public class DistributionTest { > public static void main(String[] args) throws Exception{ > long timePoint = System.currentTimeMillis(); > for (int i=0; i<2500; i++) { > Map map = new HashMap(); > map.put("seq0", DNATools.createDNA("aggag")); > map.put("seq1", DNATools.createDNA("aggaa")); > map.put("seq2", DNATools.createDNA("aggag")); > map.put("seq3", DNATools.createDNA("aagag")); > Alignment align = new SimpleAlignment(map); > Distribution[] dists = distOverAlignment2(align, false, > 0.01); > long previousPoint = timePoint; > timePoint = System.currentTimeMillis(); > System.out.println(timePoint - previousPoint); > } > } > } > Could you provide the source to distOverAlignment2 so it becomes clear what it is doing? Thanks, David Huen From ambesi at tigem.it Wed Dec 3 05:00:30 2003 From: ambesi at tigem.it (Alberto Ambesi) Date: Wed Dec 3 05:06:49 2003 Subject: [Biojava-l] issue in class Distribution In-Reply-To: <200312030855.39762.david.huen@ntlworld.com> References: <7E123FAF-24E7-11D8-B482-000A958EE60A@tigem.it> <200312030855.39762.david.huen@ntlworld.com> Message-ID: <8744351F-2577-11D8-979F-000A959B8596@tigem.it> I apologize, the line: Distribution[] dists = distOverAlignment2(align, false, 0.01); should actually be: Distribution[] dists = DistributionTools.distOverAlignment(align, false, 0.01); Thank you. Alberto Ambesi On 3 Dec 2003, at 09:55, David Huen wrote: > On Tuesday 02 Dec 2003 4:49 pm, Alberto Ambesi wrote: >> hi, I found this bug when using Distrubutions iteratively many times. >> The problem is that when creating Distrubution objects iteratively >> computation time for each iteration increases with time. >> >> this is a piece of code that demonstrates the issue: >> >> public class DistributionTest { >> public static void main(String[] args) throws Exception{ >> long timePoint = System.currentTimeMillis(); >> for (int i=0; i<2500; i++) { >> Map map = new HashMap(); >> map.put("seq0", DNATools.createDNA("aggag")); >> map.put("seq1", DNATools.createDNA("aggaa")); >> map.put("seq2", DNATools.createDNA("aggag")); >> map.put("seq3", DNATools.createDNA("aagag")); >> Alignment align = new SimpleAlignment(map); >> Distribution[] dists = distOverAlignment2(align, false, >> 0.01); >> long previousPoint = timePoint; >> timePoint = System.currentTimeMillis(); >> System.out.println(timePoint - previousPoint); >> } >> } >> } >> > > Could you provide the source to distOverAlignment2 so it becomes clear > what > it is doing? > > Thanks, > David Huen > From jan.wuerthner at uni-duesseldorf.de Wed Dec 3 08:36:38 2003 From: jan.wuerthner at uni-duesseldorf.de (Jan =?iso-8859-1?q?W=FCrthner?=) Date: Wed Dec 3 08:42:41 2003 Subject: BLAST DTD (was RE: [Biojava-l] SeqSimilaritySearchSubHit -Strand information) In-Reply-To: <200312022302.hB2N2ig0015805@portal.open-bio.org> References: <200312022302.hB2N2ig0015805@portal.open-bio.org> Message-ID: <200312031436.38897.jan.wuerthner@uni-duesseldorf.de> Hi folks, I now received the answer from blast-help@ncbi.nlm.nih.gov (see below). For my purposes, I conclude that don't need the "frame" value, especially since I use "blastn" as a program. It seems save to construct the (SeqSimilaritySearchSubHit's) query- and subject-strand values from the way the from- and to-values are ordered (ascending or descending). Jan answer from blast-help: -------8<---------------------------------------------- In our blast result, "Frame" refers to the translation orientation and frame since there are 6 possible ones with three from each strand. Their assigned value are +1, +2, +3, -1, -2, and -3. This is only relevant if query/db translation is involved (blastx, tblastn, tblastx). Since blast only reports local alignments, one may see multiple Frame with the same value mentioned, which may or may not cover the same area of the query or subject. One may be able to derive this using additional calculation from the from and to field along with the sequence length. However, BLAST calculates this out and presents it in a more straight forward manner. It is up to the user on whether to use it or not. -------------------------------->8--------------------- Am Tuesday 02 December 2003 23:57 schrieb Bobick, Stephen: > Interesting read. There are two sections worthy of comment: > >NCBI is not proposing a new data model, but is simply transliterating > >the data model we have used for the last decade into a different > > language > > for the > > >convenience of our users. ASN.1 has a number of specific data types such > > as INTEGER > > >or REAL numbers while XML has only strings, so our DTD automatically > > adds > > some > > >ENTITY definitions at the top which maps these numbers to strings. This > > mapping only > > >allows humans that read the DTD to see where numbers are expected; an > > XML > > validator > > >will not care what is there. > > Use of an XML Schema would allow the enforcement of data types. > > >Summary: > >While the effect of Roles, Scope, and Alternate Forms results in > > extensive > > >tags in the XML, it does accurately reflect the structure and use of the > > data. It allows > > >XML programs to capture as little or as much of the full data structure > > as they wish. > > I guess I fail to see the point of all this. How would a structure > resulting from the suggestions that I propose be "lossy" in any way? > > Stephen Bobick > > > -----Original Message----- > From: Michael E. Smoot [mailto:mes5k@cs.virginia.edu] > Sent: Tuesday, December 02, 2003 2:37 PM > To: Bobick, Stephen > Cc: biojava-l@biojava.org > Subject: Re: BLAST DTD (was RE: [Biojava-l] SeqSimilaritySearchSubHit - > Strand information) > > > > This page explains how the DTD's were created: > > http://www.ncbi.nlm.nih.gov/IEB/ToolBox/XML/ncbixml.txt > > The short version is that the DTD's are transliterations of their ASN.1 > data models. > > > Mike > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l -- Jan W?rthner Institute for Medical Microbiology Building 22.21 Heinrich-Heine-University Universit?tsstra?e 1 40225 Duesseldorf Tel. +49 (0) 211 81 12461 URL: www.medmikro.uni-duesseldorf.de From mark.schreiber at agresearch.co.nz Wed Dec 3 16:03:59 2003 From: mark.schreiber at agresearch.co.nz (Schreiber, Mark) Date: Wed Dec 3 16:10:34 2003 Subject: [Biojava-l] issue in class Distribution Message-ID: Hi - Below is the method that is being called (copied from DistributionTools). I cannot see anything obvious such as a memory leak but there could be one, can anybody else spot anything? Maybe if someone has a fancy profilling utility they could try it and let us know where the slow down is occuring. - Mark public static final Distribution[] distOverAlignment(Alignment a, boolean countGaps, double nullWeight) throws IllegalAlphabetException { List seqs = a.getLabels(); FiniteAlphabet alpha = (FiniteAlphabet)((SymbolList)a.symbolListForLabel(seqs.get(0))).getAlphabet(); for(int i = 1; i < seqs.size();i++){ FiniteAlphabet test = (FiniteAlphabet)((SymbolList)a.symbolListForLabel(seqs.get(i))).getAlphabet(); if(test != alpha){ throw new IllegalAlphabetException("Cannot Calculate distOverAlignment() for alignments with"+ "mixed alphabets"); } } Distribution[] pos = new Distribution[a.length()]; DistributionTrainerContext dtc = new SimpleDistributionTrainerContext(); dtc.setNullModelWeight(nullWeight); try{ for(int i = 0; i < a.length(); i++){// For each position pos[i] = DistributionFactory.DEFAULT.createDistribution(alpha); dtc.registerDistribution(pos[i]); for(Iterator j = seqs.iterator(); j.hasNext();){// of each sequence Object seqLabel = j.next(); Symbol s = a.symbolAt(seqLabel,i + 1); /*If this is working over a flexible alignment there is a possibility that s could be null if this Sequence is not really preset in this region of the Alignment. In this case it will be skipped*/ if(s == null) continue; Symbol gap = alpha.getGapSymbol(); if(countGaps == false && s.equals(gap)){ //do nothing, not counting gaps }else{ dtc.addCount(pos[i],s,1.0);// count the symbol } } } dtc.train(); }catch(Exception e){ e.printStackTrace(System.err); } return pos; } > -----Original Message----- > From: Alberto Ambesi [mailto:ambesi@tigem.it] > Sent: Wednesday, 3 December 2003 11:01 p.m. > To: biojava-l@biojava.org > Subject: Re: [Biojava-l] issue in class Distribution > > > I apologize, the line: > Distribution[] dists = distOverAlignment2(align, false, 0.01); > > should actually be: > Distribution[] dists = DistributionTools.distOverAlignment(align, > false, 0.01); > > Thank you. > > Alberto Ambesi > > > On 3 Dec 2003, at 09:55, David Huen wrote: > > > On Tuesday 02 Dec 2003 4:49 pm, Alberto Ambesi wrote: > >> hi, I found this bug when using Distrubutions iteratively > many times. > >> The problem is that when creating Distrubution objects iteratively > >> computation time for each iteration increases with time. > >> > >> this is a piece of code that demonstrates the issue: > >> > >> public class DistributionTest { > >> public static void main(String[] args) throws Exception{ > >> long timePoint = System.currentTimeMillis(); > >> for (int i=0; i<2500; i++) { > >> Map map = new HashMap(); > >> map.put("seq0", DNATools.createDNA("aggag")); > >> map.put("seq1", DNATools.createDNA("aggaa")); > >> map.put("seq2", DNATools.createDNA("aggag")); > >> map.put("seq3", DNATools.createDNA("aagag")); > >> Alignment align = new SimpleAlignment(map); > >> Distribution[] dists = > distOverAlignment2(align, false, > >> 0.01); > >> long previousPoint = timePoint; > >> timePoint = System.currentTimeMillis(); > >> System.out.println(timePoint - previousPoint); > >> } > >> } > >> } > >> > > > > Could you provide the source to distOverAlignment2 so it > becomes clear > > what > > it is doing? > > > > Thanks, > > David Huen > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From fpepin at cs.mcgill.ca Wed Dec 3 22:55:52 2003 From: fpepin at cs.mcgill.ca (Francois Pepin) Date: Wed Dec 3 23:01:54 2003 Subject: [Biojava-l] problem with AnnotationBuilder? Message-ID: <009401c3ba1a$84025fd0$6401a8c0@hermes> Hi everyone, I think that there might be a problem with AnnotationBuilder (parsing off Kegg for the curious). With the following parser: LineSplitParser tvp = new LineSplitParser(); tvp.setEndOfRecord("///"); tvp.setSplitOffset(12); tvp.setContinueOnEmptyTag(true); tvp.setTrimTag(true); tvp.setTrimValue(false); tvp.setMergeSameTag(true); The simplest of AnnotationBuilder: AnnotationBuilder tvl=new AnnotationBuilder(AnnotationType.ANY); and the following text (among others): NAME aldehyde dehydrogenase (NAD) CoA-independent aldehyde dehydrogenase m-methylbenzaldehyde dehydrogenase NAD-aldehyde dehydrogenase NAD-dependent 4-hydroxynonenal dehydrogenase NAD-dependent aldehyde dehydrogenase NAD-linked aldehyde dehydrogenase propionaldehyde dehydrogenase I end up having the following value when printing the Annotation: NAME=propionaldehyde dehydrogenase. Echo() shows that everything is being read properly: 1 NAME { 2 aldehyde dehydrogenase (NAD) 2 CoA-independent aldehyde dehydrogenase 2 m-methylbenzaldehyde dehydrogenase 2 NAD-aldehyde dehydrogenase 2 NAD-dependent 4-hydroxynonenal dehydrogenase 2 NAD-dependent aldehyde dehydrogenase 2 NAD-linked aldehyde dehydrogenase 2 propionaldehyde dehydrogenase 1 } Adding the following code (*) in AnnotationBuilder (from line 133), to check if the values indeed get to be overwritten. public void value(TagValueContext ctxt, Object value) { try { Frame top = peek(annotationStack); * if (top.annotation.containsProperty(top.tag)) * System.out.println("replacing"+ *top.annotation.getProperty(top.tag)+ " by "+value); top.type.setProperty(top.annotation, top.tag, value); } catch (ChangeVetoException cve) { throw new AssertionFailure(cve); } } This gives us the very interesting output: replacing aldehyde dehydrogenase (NAD) by CoA-independent aldehyde dehydrogenase replacing CoA-independent aldehyde dehydrogenase by m-methylbenzaldehyde dehydrogenase replacing m-methylbenzaldehyde dehydrogenase by NAD-aldehyde dehydrogenase replacing NAD-aldehyde dehydrogenase by NAD-dependent 4-hydroxynonenal dehydrogenase replacing NAD-dependent 4-hydroxynonenal dehydrogenase by NAD-dependent aldehyde dehydrogenase replacing NAD-dependent aldehyde dehydrogenase by NAD-linked aldehyde dehydrogenase replacing NAD-linked aldehyde dehydrogenase by propionaldehyde dehydrogenase Basically every line gets to be overwritten, so only the last one remains at the end. Any ideas how this could be fixed, or did I do something stupid somewhere? Thanks, Francois From yingqin at rti.org Wed Dec 3 23:01:46 2003 From: yingqin at rti.org (Qin, Ying) Date: Wed Dec 3 23:08:00 2003 Subject: [Biojava-l] Help! GenBank parsing problem Message-ID: Hi all, I am trying to turn a GenBank file to a Sequence object using sequenceIterator interface in org.biojava.bio.seq package. However, one line in the file cannot be parsed. The problem is information after that unparsable line in the GenBank file are missing from the return Sequence object. Is there any way to avoid this? Your help is highly appreciated. Thanks, Ying From matthew_pocock at yahoo.co.uk Mon Dec 8 07:58:04 2003 From: matthew_pocock at yahoo.co.uk (Matthew Pocock) Date: Mon Dec 8 08:17:35 2003 Subject: [Biojava-l] problem with AnnotationBuilder? In-Reply-To: <009401c3ba1a$84025fd0$6401a8c0@hermes> References: <009401c3ba1a$84025fd0$6401a8c0@hermes> Message-ID: <3FD4755C.6080501@yahoo.co.uk> Hi francois, The documentation here is not good. The default behavior of the annotation builder is to simply replace the properties as it goes. If you want list-building behavior, then you can configure this. For example, create a new AnnotationType where you explicitly set the type of NAME to being an un-bounded list of strings. AnnotationType annT = new AnnotationType.Impl(); PropertyConstraint c_string = new PropertyConstraint.ByClass(String.class); annT.setConstraints("NAME", c_string, CardinalityConstraint.ONE_OR_MORE); Now, when top.type.setProperty() is invoked, it will end up routing each value to an object that knows that the values in the "NAME" slot are members of a list, so each value will be appended. For examples, look inside org.biojava.bio.program.formats, and if you are brave look at the deffinition of AnnotationType.setProperty() and the things it may defer to. Matthew Francois Pepin wrote: >Hi everyone, > >I think that there might be a problem with AnnotationBuilder (parsing >off Kegg for the curious). > >With the following parser: > LineSplitParser tvp = new LineSplitParser(); > tvp.setEndOfRecord("///"); > tvp.setSplitOffset(12); > tvp.setContinueOnEmptyTag(true); > tvp.setTrimTag(true); > tvp.setTrimValue(false); > tvp.setMergeSameTag(true); > >The simplest of AnnotationBuilder: >AnnotationBuilder tvl=new AnnotationBuilder(AnnotationType.ANY); > >and the following text (among others): >NAME aldehyde dehydrogenase (NAD) > CoA-independent aldehyde dehydrogenase > m-methylbenzaldehyde dehydrogenase > NAD-aldehyde dehydrogenase > NAD-dependent 4-hydroxynonenal dehydrogenase > NAD-dependent aldehyde dehydrogenase > NAD-linked aldehyde dehydrogenase > propionaldehyde dehydrogenase > >I end up having the following value when printing the Annotation: >NAME=propionaldehyde dehydrogenase. > >Echo() shows that everything is being read properly: >1 NAME { >2 aldehyde dehydrogenase (NAD) >2 CoA-independent aldehyde dehydrogenase >2 m-methylbenzaldehyde dehydrogenase >2 NAD-aldehyde dehydrogenase >2 NAD-dependent 4-hydroxynonenal dehydrogenase >2 NAD-dependent aldehyde dehydrogenase >2 NAD-linked aldehyde dehydrogenase >2 propionaldehyde dehydrogenase >1 } > >Adding the following code (*) in AnnotationBuilder (from line 133), to >check if the values indeed get to be overwritten. > >public void value(TagValueContext ctxt, Object value) { > try { > Frame top = peek(annotationStack); > >* if (top.annotation.containsProperty(top.tag)) >* System.out.println("replacing"+ >*top.annotation.getProperty(top.tag)+ " by "+value); > > top.type.setProperty(top.annotation, top.tag, value); > } catch (ChangeVetoException cve) { > throw new AssertionFailure(cve); > } > } > >This gives us the very interesting output: >replacing aldehyde dehydrogenase (NAD) by CoA-independent aldehyde >dehydrogenase replacing CoA-independent aldehyde dehydrogenase by >m-methylbenzaldehyde dehydrogenase replacing m-methylbenzaldehyde >dehydrogenase by NAD-aldehyde dehydrogenase replacing NAD-aldehyde >dehydrogenase by NAD-dependent 4-hydroxynonenal dehydrogenase replacing >NAD-dependent 4-hydroxynonenal dehydrogenase by NAD-dependent aldehyde >dehydrogenase replacing NAD-dependent aldehyde dehydrogenase by >NAD-linked aldehyde dehydrogenase replacing NAD-linked aldehyde >dehydrogenase by propionaldehyde dehydrogenase > >Basically every line gets to be overwritten, so only the last one >remains at the end. > >Any ideas how this could be fixed, or did I do something stupid >somewhere? > >Thanks, > >Francois > > >_______________________________________________ >Biojava-l mailing list - Biojava-l@biojava.org >http://biojava.org/mailman/listinfo/biojava-l > > > From ecky.l at gmx.de Mon Dec 8 20:58:43 2003 From: ecky.l at gmx.de (Eckhard Lehmann) Date: Mon Dec 8 21:04:48 2003 Subject: [Biojava-l] indexdb.Record implementation Message-ID: <17231.1070935123@www56.gmx.net> Hi, I need to create a quick & dirty database from a Genbank File, which contains many entries. It seems that I can do this with org.biojava.bio.program.indexdb.IndexTools.indexGenbank(...). But how can I get the Entries by ID, once the index is created? It seems that the following works: BioStore bst = new BioStore(new java.io.File("/path/to/indexdir"), false); Record rec = bst.get("id_of_genbank_enty"); But Record is an interface and therefore without the implementation I would like to have (the implementation to read out the desired Genbank entry and e.g. have it as a Sequence object) . Are there somewhere implementations in biojava-1.30 for processing these Standard Records - resp. is there another way to do it without the need to extract the record from the file by parsing the byte-oriented RAF that one can get by rec.getFile()? Thanks in advance for any help, Eckhard ;) From taoxu at bioinformatics.ubc.ca Mon Dec 8 21:01:36 2003 From: taoxu at bioinformatics.ubc.ca (Tao Xu) Date: Mon Dec 8 21:08:09 2003 Subject: [Biojava-l] How to create a SymbolList with a String that contains illegal Char Message-ID: <4f8a855961.559614f8a8@cmmt.ubc.ca> Hi there, Does anyone know how to create a SymbolList with a String that contains illegal symbol? I encountered IllegalSymbolException when I tried to retrieve sequences from a sequence database. The sequence that gave me the trouble was a refseq sequence, accession number NT_039621, Mus musculus chromosome 15 genomic contig. I firsted used DNATools.createDNA(String dna), and got IllegalSymbolException that indicated there was at least one 'u' in the sequence. I then used NucleotideTools.createNucleotide(String nucleotide), this time the 'u' did not cause any problem, but however I sitll got IllegalSymbolException that inidicated there was 'l' in the sequence. I am afraid there must be lots of illegal symbols in GenBank's sequences, I am wondering if there is a way to create error-tolerate SymbolList object. If not, I am afraid I have to create an Alphabet object that contains Symbols that covers all char in java and use this Alphabet object to create a CharacterTokenization using CharacterTokenization(Alphabet alpha, boolean caseSensitive) constructor, and then use the resulting CharacterTokenization object to call SimpleSymbolList(SymbolTokenization st, String seqString) to get a SimpleSymbolList object. I guess there must be a better way in Biojava to do this. Your help is highly appreciated. If I have to create an Alphatebet that covers all char in Java, how can I do it? I originally thought merge NUCLEOTIDE and PROTEIN Alphabet to create a new Alphabet would be able to cover all the Symboles in GenBank sequences, but I noticed there was no method to merge to Alphabets in AlphabetManager. Is there a way to merge two Alphabets? If not, probably it is worth to implement one. It will be useful not only to handle IllegalSymbols exist in the databases, but also other applications like using non-standard symbols to generate blastable MSBlast database. Thanks a lot for your help. Regards, Tao From verhoeff2 at gis.a-star.edu.sg Mon Dec 8 21:25:55 2003 From: verhoeff2 at gis.a-star.edu.sg (VERHOEF Frans) Date: Mon Dec 8 21:34:38 2003 Subject: [Biojava-l] How to create a SymbolList with a String that containsillegal Char Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D560B06A4@BIONIC.biopolis.one-north.com> Hi Tao, Am I right you want to read in genbank data? You might want to take a look at this particular page of biojava in anger: http://www.biojava.org/docs/bj_in_anger/ReadingGES.htm This page describes how to read in sequence data from genbank. I hope this helps. Regards Frans > -----Original Message----- > From: biojava-l-bounces@portal.open-bio.org [mailto:biojava-l- > bounces@portal.open-bio.org] On Behalf Of Tao Xu > Sent: Tuesday, December 09, 2003 10:02 AM > To: biojava-l@biojava.org > Subject: [Biojava-l] How to create a SymbolList with a String that > containsillegal Char > > Hi there, > > Does anyone know how to create a SymbolList with a String that > contains illegal symbol? > > I encountered IllegalSymbolException when I tried to retrieve > sequences from a sequence database. The sequence that gave me the > trouble was a refseq sequence, accession number NT_039621, Mus > musculus chromosome 15 genomic contig. I firsted used > DNATools.createDNA(String dna), and got IllegalSymbolException that > indicated there was at least one 'u' in the sequence. I then used > NucleotideTools.createNucleotide(String nucleotide), this time the 'u' > did not cause any problem, but however I sitll got > IllegalSymbolException that inidicated there was 'l' in the sequence. > > I am afraid there must be lots of illegal symbols in GenBank's > sequences, I am wondering if there is a way to create error-tolerate > SymbolList object. If not, I am afraid I have to create an Alphabet > object that contains Symbols that covers all char in java and use this > Alphabet object to create a CharacterTokenization using > CharacterTokenization(Alphabet alpha, boolean caseSensitive) > constructor, and then use the resulting CharacterTokenization object > to call SimpleSymbolList(SymbolTokenization st, String seqString) to > get a SimpleSymbolList object. I guess there must be a better way in > Biojava to do this. Your help is highly appreciated. > > If I have to create an Alphatebet that covers all char in Java, how > can I do it? I originally thought merge NUCLEOTIDE and PROTEIN > Alphabet to create a new Alphabet would be able to cover all the > Symboles in GenBank sequences, but I noticed there was no method to > merge to Alphabets in AlphabetManager. Is there a way to merge two > Alphabets? If not, probably it is worth to implement one. It will be > useful not only to handle IllegalSymbols exist in the databases, but > also other applications like using non-standard symbols to generate > blastable MSBlast database. > > Thanks a lot for your help. > > Regards, > > Tao > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l From david.huen at ntlworld.com Tue Dec 9 02:59:55 2003 From: david.huen at ntlworld.com (David Huen) Date: Tue Dec 9 03:06:02 2003 Subject: [Biojava-l] How to create a SymbolList with a String that contains illegal Char In-Reply-To: <4f8a855961.559614f8a8@cmmt.ubc.ca> References: <4f8a855961.559614f8a8@cmmt.ubc.ca> Message-ID: <200312090759.55892.david.huen@ntlworld.com> On Tuesday 09 Dec 2003 2:01 am, Tao Xu wrote: > Hi there, > > Does anyone know how to create a SymbolList with a String that > contains illegal symbol? > > I encountered IllegalSymbolException when I tried to retrieve > sequences from a sequence database. The sequence that gave me the > trouble was a refseq sequence, accession number NT_039621, Mus > musculus chromosome 15 genomic contig. I firsted used > DNATools.createDNA(String dna), and got IllegalSymbolException that > indicated there was at least one 'u' in the sequence. I then used > NucleotideTools.createNucleotide(String nucleotide), this time the 'u' > did not cause any problem, but however I sitll got > IllegalSymbolException that inidicated there was 'l' in the sequence. > > I am afraid there must be lots of illegal symbols in GenBank's > sequences, I am wondering if there is a way to create error-tolerate > SymbolList object. If not, I am afraid I have to create an Alphabet > object that contains Symbols that covers all char in java and use this > Alphabet object to create a CharacterTokenization using > CharacterTokenization(Alphabet alpha, boolean caseSensitive) > constructor, and then use the resulting CharacterTokenization object > to call SimpleSymbolList(SymbolTokenization st, String seqString) to > get a SimpleSymbolList object. I guess there must be a better way in > Biojava to do this. Your help is highly appreciated. > > If I have to create an Alphatebet that covers all char in Java, how > can I do it? I originally thought merge NUCLEOTIDE and PROTEIN > Alphabet to create a new Alphabet would be able to cover all the > Symboles in GenBank sequences, but I noticed there was no method to > merge to Alphabets in AlphabetManager. Is there a way to merge two > Alphabets? If not, probably it is worth to implement one. It will be > useful not only to handle IllegalSymbols exist in the databases, but > also other applications like using non-standard symbols to generate > blastable MSBlast database. > > Thanks a lot for your help. > I think the problem you are encountering is because the sequence you are reading is an RNA sequence. So the "u" and "i" are uracil and inosine respectively and therefore correctly illegal for a DNA sequence. You will probably have much greater happiness by using:- RNATools.createRNA(String rna) Regards, David Huen From markjschreiber at hotmail.com Tue Dec 9 05:25:14 2003 From: markjschreiber at hotmail.com (mark schreiber) Date: Tue Dec 9 05:31:20 2003 Subject: [Biojava-l] How to create a SymbolList with a String thatcontains illegal Char In-Reply-To: <200312090759.55892.david.huen@ntlworld.com> Message-ID: Is 'i' actually a legal symbol from the RNA alphabet, in terms of biojava? If not how should we define it? Would it be best modelled as an atomic symbol or some kind of ambiguity? Stretching back to my biochem undergrad days I think it should be atomic. That will mean the RNA Alphabets size is 5. I've just checked the AlphabetManager.xml and inosine isn't in there. If there are no objections I will add it as an AtomicSymbol tommorrow with a mapping to the character 'i'. The question is should it be added as a member of the RNA alphabet or as a member of the nucleotide alphabet or both? - Mark -----Original Message----- From: biojava-l-bounces@portal.open-bio.org [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of David Huen Sent: Tuesday, 9 December 2003 9:00 p.m. To: taoxu@bioinformatics.ubc.ca; biojava-l@biojava.org Subject: Re: [Biojava-l] How to create a SymbolList with a String thatcontains illegal Char On Tuesday 09 Dec 2003 2:01 am, Tao Xu wrote: > Hi there, > > Does anyone know how to create a SymbolList with a String that > contains illegal symbol? > > I encountered IllegalSymbolException when I tried to retrieve > sequences from a sequence database. The sequence that gave me the > trouble was a refseq sequence, accession number NT_039621, Mus > musculus chromosome 15 genomic contig. I firsted used > DNATools.createDNA(String dna), and got IllegalSymbolException that > indicated there was at least one 'u' in the sequence. I then used > NucleotideTools.createNucleotide(String nucleotide), this time the 'u' > did not cause any problem, but however I sitll got > IllegalSymbolException that inidicated there was 'l' in the sequence. > > I am afraid there must be lots of illegal symbols in GenBank's > sequences, I am wondering if there is a way to create error-tolerate > SymbolList object. If not, I am afraid I have to create an Alphabet > object that contains Symbols that covers all char in java and use this > Alphabet object to create a CharacterTokenization using > CharacterTokenization(Alphabet alpha, boolean caseSensitive) > constructor, and then use the resulting CharacterTokenization object > to call SimpleSymbolList(SymbolTokenization st, String seqString) to > get a SimpleSymbolList object. I guess there must be a better way in > Biojava to do this. Your help is highly appreciated. > > If I have to create an Alphatebet that covers all char in Java, how > can I do it? I originally thought merge NUCLEOTIDE and PROTEIN > Alphabet to create a new Alphabet would be able to cover all the > Symboles in GenBank sequences, but I noticed there was no method to > merge to Alphabets in AlphabetManager. Is there a way to merge two > Alphabets? If not, probably it is worth to implement one. It will be > useful not only to handle IllegalSymbols exist in the databases, but > also other applications like using non-standard symbols to generate > blastable MSBlast database. > > Thanks a lot for your help. > I think the problem you are encountering is because the sequence you are reading is an RNA sequence. So the "u" and "i" are uracil and inosine respectively and therefore correctly illegal for a DNA sequence. You will probably have much greater happiness by using:- RNATools.createRNA(String rna) Regards, David Huen _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From kdj at sanger.ac.uk Tue Dec 9 05:34:05 2003 From: kdj at sanger.ac.uk (Keith James) Date: Tue Dec 9 05:40:11 2003 Subject: [Biojava-l] indexdb.Record implementation In-Reply-To: <17231.1070935123@www56.gmx.net> References: <17231.1070935123@www56.gmx.net> Message-ID: >>>>> "Eckhard" == Eckhard Lehmann writes: Eckhard> Hi, I need to create a quick & dirty database from a Eckhard> Genbank File, which contains many entries. It seems that Eckhard> I can do this with Eckhard> org.biojava.bio.program.indexdb.IndexTools.indexGenbank(...). Yes, that's correct. Eckhard> But how can I get the Entries by ID, once the index is Eckhard> created? It seems that the following works: Eckhard> BioStore bst = new BioStore(new Eckhard> java.io.File("/path/to/indexdir"), false); Record rec = Eckhard> bst.get("id_of_genbank_enty"); BioStore is part of the OBDA indexing framework. It should not be necessary to create one yourself. From BioStore docs: "BioStores represent directory and file structures which index flat files according to the OBDA specification. The preferred method of constructing new instances is to use BioStoreFactory." For more information on this see http://obda.open-bio.org/ and in the biojava release see docs/howto/BIODATABASE-ACCESS-HOWTO.txt and docs/howto/FLAT-DATABASES-HOWTO.txt Eckhard> But Record is an interface and therefore without the Eckhard> implementation I would like to have (the implementation Eckhard> to read out the desired Genbank entry and e.g. have it as Eckhard> a Sequence object) . I can see what you are getting at... the Record interface only describes byte offsets and length - it does not have any responsibility for understanding the file format. Eckhard> Are there somewhere implementations in biojava-1.30 for Eckhard> processing these Standard Records - resp. is there Eckhard> another way to do it without the need to extract the Eckhard> record from the file by parsing the byte-oriented RAF Eckhard> that one can get by rec.getFile()? One way is to set up a .bioinformatics config file (see the OBDA docs referenced above) and use the applications org.biojava.app.BioFlatIndex and org.biojava.app.BioGetSeq For a quick/dirty solution, you can go straight for a flat database without using the OBDA database organisation services. Examples are in the unit tests (see package org.biojava.bio.program.indexdb in the tests tree). e.g. "location" is a String filename of the directory which will contain the index files: public void testIndexGenbankDNA() throws Exception { File [] files = getDBFiles(new String [] { "part1.gb", "part2.gb" }); IndexTools.indexGenbank("test", new File(location), files, SeqIOConstants.DNA); SequenceDBLite db = new FlatSequenceDB(location, "genbank"); Sequence seq1 = db.getSequence("A16SRRNA"); assertEquals(1497, seq1.length()); Sequence seq2 = db.getSequence("A16STM112"); assertEquals(1346, seq2.length()); Sequence seq3 = db.getSequence("A16STM146"); assertEquals(1352, seq3.length()); Sequence seq4 = db.getSequence("AY080928"); assertEquals(557, seq4.length()); Sequence seq5 = db.getSequence("AY080929"); assertEquals(556, seq5.length()); Sequence seq6 = db.getSequence("AY080930"); assertEquals(557, seq6.length()); } hth Keith -- - Keith James Microarray Facility, Team 65 - - The Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK - From matthew_pocock at yahoo.co.uk Tue Dec 9 10:52:16 2003 From: matthew_pocock at yahoo.co.uk (Matthew Pocock) Date: Tue Dec 9 10:58:26 2003 Subject: [Biojava-l] How to create a SymbolList with a String thatcontains illegal Char In-Reply-To: References: Message-ID: <3FD5EFB0.6050307@yahoo.co.uk> mark schreiber wrote: >Is 'i' actually a legal symbol from the RNA alphabet, in terms of biojava? >If not how should we define it? Would it be best modelled as an atomic >symbol or some kind of ambiguity? Stretching back to my biochem undergrad >days I think it should be atomic. That will mean the RNA Alphabets size is >5. > Atomic. Our alphabets don't manage modifications well (e.g. methylated DNA). Another thing to think about for v2. > >I've just checked the AlphabetManager.xml and inosine isn't in there. If >there are no objections I will add it as an AtomicSymbol tommorrow with a >mapping to the character 'i'. The question is should it be added as a member >of the RNA alphabet or as a member of the nucleotide alphabet or both? > Both I guess. Defninitely should be in RNA. Matthew > >- Mark > > From martin at decode.ateneo.edu Tue Dec 9 12:11:55 2003 From: martin at decode.ateneo.edu (martin@decode.ateneo.edu) Date: Tue Dec 9 12:17:04 2003 Subject: [Biojava-l] sanger das Message-ID: <200312091717.hB9HGxFC009151@portal.open-bio.org> hi, is http://servlet.sanger.ac.uk:8080/das/ down? thanks, martin ______________________________________ This E-Mail was sent with MailMax/WEB. http://www.smartmax.com From ecky.l at gmx.de Tue Dec 9 14:57:54 2003 From: ecky.l at gmx.de (Eckhard Lehmann) Date: Tue Dec 9 15:03:58 2003 Subject: [Biojava-l] indexdb.Record implementation References: Message-ID: <11359.1070999874@www61.gmx.net> Hi Keith, The ODBA specification might be interresting in longer terms, so it is probably a good idea for me to read the documentation for it, when I have time. > SequenceDBLite db = new FlatSequenceDB(location, "genbank"); That's what I was looking for - for my actual needs the quick & dirty solution is perfectly okay. Thank you very much, Eckhard ;) From markjschreiber at hotmail.com Wed Dec 10 04:20:18 2003 From: markjschreiber at hotmail.com (mark schreiber) Date: Wed Dec 10 04:26:22 2003 Subject: [Biojava-l] How to create a SymbolList with a String thatcontains illegal Char In-Reply-To: <3FD5EFB0.6050307@yahoo.co.uk> Message-ID: -----Original Message----- From: Matthew Pocock [mailto:matthew_pocock@yahoo.co.uk] Sent: Wednesday, 10 December 2003 4:52 a.m. To: mark schreiber Cc: smh1008@cus.cam.ac.uk; taoxu@bioinformatics.ubc.ca; biojava-l@biojava.org Subject: Re: [Biojava-l] How to create a SymbolList with a String thatcontains illegal Char OK so the vote is for Atomic. At this stage I'm not going to add it to any translation tables as my biochem is not up to figuring out what effect inosine has on a codon. - Mark mark schreiber wrote: >Is 'i' actually a legal symbol from the RNA alphabet, in terms of biojava? >If not how should we define it? Would it be best modelled as an atomic >symbol or some kind of ambiguity? Stretching back to my biochem >undergrad days I think it should be atomic. That will mean the RNA >Alphabets size is 5. > Atomic. Our alphabets don't manage modifications well (e.g. methylated DNA). Another thing to think about for v2. > >I've just checked the AlphabetManager.xml and inosine isn't in there. >If there are no objections I will add it as an AtomicSymbol tommorrow >with a mapping to the character 'i'. The question is should it be added >as a member of the RNA alphabet or as a member of the nucleotide alphabet or both? > Both I guess. Defninitely should be in RNA. Matthew > >- Mark > > From divol at lirmm.fr Mon Dec 15 11:59:43 2003 From: divol at lirmm.fr (divol) Date: Mon Dec 15 19:49:51 2003 Subject: [Biojava-l] About com.kizna.html Message-ID: <14564C86-2F20-11D8-BBF2-003065551790@lirmm.fr> hi all , as others will fall on the problem located into LocusLinkParser.java , and i do not found any clue (ok, i do not check all archives). i tried to compile BioJava on my Mac i fall upon the problem concerning com.kizna.* the library is available for download on : http://htmlparser.sourceforge.net/ you may take the 1.1 version as the 1.2 as the new hierarchy (as well as the 1.3,....) anyway it's easy to change to 1.2 version : replace "com.kizna.html" by "org.htmlparser" change to 1.3 is less obvious (franckly i do not examin closely the problem) the info could be put into some FAQ , no ???? ;) if already writen down somewhere, sorry ! Jacques Divol -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 721 bytes Desc: not available Url : http://portal.open-bio.org/pipermail/biojava-l/attachments/20031215/bd2fefd1/attachment.bin From benjamins at Biomax.de Wed Dec 17 09:23:04 2003 From: benjamins at Biomax.de (Benjamin Schuster-Boeckler) Date: Wed Dec 17 09:28:57 2003 Subject: [Biojava-l] OutOfMemoryError with huge sequence flatfile Message-ID: <3FE066C8.8080209@Biomax.de> I'm trying to create a das service for a complete genome using the dazzle servlet. The underlying sequences are stored in plain flatfiles of about 200M each. Now unfortunately dazzle requires a Sequence object of every contig that corresponds to an entry point. If I try to read in one of the files into a String or directly with DNATools.readFastaDNA, I get an OutOfMemoryError, or if I push the max. heap size I get a java.nio.BufferOverflowException in String. How could I represent such large Sequences without the necessity to store them in memory? Any comment is greatly appreciated, best regards, Benjamin Schuster-B?ckler From Alexandre.Irrthum at icr.ac.uk Wed Dec 17 13:38:51 2003 From: Alexandre.Irrthum at icr.ac.uk (Alexandre Irrthum) Date: Wed Dec 17 13:45:07 2003 Subject: [Biojava-l] PubMed Query Message-ID: Hello, I'm new to biojava, and looking for a class that allows querying of PubMed (presumably through the Entrez Programming Utilities), returning BibRef objects. I didn't find any such class in the API javadocs (although there is a BibRefQuery interface), but I guess this must exist. Subsidiary question: does the BioQuery project bear any relation to BioJava ? Are the APIs somewhat "compatible" ? Thanks a lot for your help. alex From Yudong.Sun at newcastle.ac.uk Tue Dec 23 12:37:48 2003 From: Yudong.Sun at newcastle.ac.uk (Y D Sun) Date: Tue Dec 23 12:45:21 2003 Subject: [Biojava-l] Blast Version Message-ID: <3E0207605FF4864EBE508C1C634D698C37BB73@bond.ncl.ac.uk> Hi, What blast version does BlastLikeSAXParser support? I encounter the following error when running the sample code of BLAST Result Parser in Biojava In Anger to parse blast 2.2.5 output: org.xml.sax.SAXException: Program ncbi-blastp Version 2.2.5 is not supported by the biojava blast-like parsing framework at org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXPar ser.java:241) at org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser. java:160) Thanks. George