From ambi1999 at gmail.com Tue Jun 7 15:58:19 2011 From: ambi1999 at gmail.com (Ambikesh Jayal) Date: Tue, 7 Jun 2011 20:58:19 +0100 Subject: [Biojava-l] Biojava implementation of CE algorithm In-Reply-To: References: Message-ID: Hi All, There seems to be some discrepancy for some protein sequences in results of Biojava implementation of CE algorithm and the implementation on CE website http://cl.sdsc.edu/ce/ce_align.html For example between protein sequences [2aza.A] AND [1paz]. Other such example are 1cew.I and 1mol.A, 1cid and 2rhe. Is there some reason for this discrepancy? Results using BioJava implementation of CE algorithm ************* [2aza.A] AND [1paz] ************ CE afpChain.getTotalRmsdOpt() 2.5267815014062553 afpChain.getOptLength() 82 Results using CE website http://cl.sdsc.edu/ce/ce_align.html ************* [2aza.A] AND [1paz] ************ Rmsd = 2.9? Aligned/gap positions = 84/49 Kind Regards, Ambi. From jayunit100 at gmail.com Fri Jun 10 14:30:43 2011 From: jayunit100 at gmail.com (Jay Vyas) Date: Fri, 10 Jun 2011 14:30:43 -0400 Subject: [Biojava-l] StructurePairAligner Message-ID: Hi Guys : I am trying to adopt the StructurePairAligner.java program which Andreas wrote, which is available online. I noticed that "startingAlignment" "calculatedFragmentPairs" and "jointFragments" are not in the AlignmentProfressListener class. Is there an updated version of the pair aligner gui ? private void notifyStartingAlignment(String name1, Atom[] ca1, String name2, Atom[] ca2){ for (AlignmentProgressListener li : listeners){ li.startingAlignment(name1, ca1, name2, ca2); } } private void notifyFragmentListeners(List fragments){ for (AlignmentProgressListener li : listeners){ li.calculatedFragmentPairs(fragments); } } private void notifyJointFragments(JointFragments[] fragments){ for (AlignmentProgressListener li : listeners){ li.jointFragments(fragments); } } From andreas at sdsc.edu Fri Jun 10 15:34:28 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Fri, 10 Jun 2011 12:34:28 -0700 Subject: [Biojava-l] StructurePairAligner In-Reply-To: References: Message-ID: Hi Jay, What is the goal of what you want to do with this? The StructurePairAligner is historically the oldest of the various structure alignment algorithms that are coming with biojava. The events that you are pointing out allow to trace what is going on in this implementation. However because they are quite specific to this algorithm, this did not get used when we were working later on the ce and fatcat implementations. If you want to work on derived algorithms, I recommend implementing the StructureAlignment interface and use CeMain.java (and others) as a template... I am not sure what you mean with pair aligner gui. There are some examples in the demo package in biojava3-structure-gui. E.g. DemoAlignmentGui .. Hope that helps, Andreas On Fri, Jun 10, 2011 at 11:30 AM, Jay Vyas wrote: > ? Hi Guys : I am trying to adopt the StructurePairAligner.java program > which Andreas wrote, which is available online. ? I noticed that > "startingAlignment" "calculatedFragmentPairs" and "jointFragments" are not > in the AlignmentProfressListener class. ?Is there an updated version of the > pair aligner gui ? > > > private void notifyStartingAlignment(String name1, Atom[] ca1, String name2, > Atom[] ca2){ > > ? ? ? ? ? for (AlignmentProgressListener li : listeners){ > > ? ? ? ? ? ? ?li.startingAlignment(name1, ca1, name2, ca2); > > ? ? ? ? ? } > > ? ? ? ?} > > > ? ? ? ?private void notifyFragmentListeners(List fragments){ > > > ? ? ? ? ? for (AlignmentProgressListener li : listeners){ > > ? ? ? ? ? ? ?li.calculatedFragmentPairs(fragments); > > ? ? ? ? ? } > > > ? ? ? ?} > > > ? ? ? ?private void notifyJointFragments(JointFragments[] fragments){ > > ? ? ? ? ? for (AlignmentProgressListener li : listeners){ > > ? ? ? ? ? ? ?li.jointFragments(fragments); > > ? ? ? ? ? } > > ? ? ? ?} > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From jayunit100 at gmail.com Fri Jun 10 16:53:33 2011 From: jayunit100 at gmail.com (Jay Vyas) Date: Fri, 10 Jun 2011 16:53:33 -0400 Subject: [Biojava-l] StructurePairAligner In-Reply-To: References: Message-ID: Just alignment and visualization of to chains... nothing fancy. I guess I will look at the interface. From khalil.elmazouari at gmail.com Sun Jun 12 16:01:59 2011 From: khalil.elmazouari at gmail.com (Khalil El Mazouari) Date: Sun, 12 Jun 2011 22:01:59 +0200 Subject: [Biojava-l] add Source Organism to genbank Message-ID: Hi, I am trying to set Organism to RichSequence via: richSequence.setTaxon(new SimpleNCBITaxon(taxid)); //ncbi tax id the genbank ouptut: SOURCE ORGANISM . FEATURES Location/Qualifiers source 1..394 /mol_type="genomic DNA" /strand="+" /organism="" /db_xref="taxon:10090" How to add Organism display name to the Source Field and to the annotation organism? Thanks khalil From holland at eaglegenomics.com Sun Jun 12 16:19:45 2011 From: holland at eaglegenomics.com (Richard Holland) Date: Sun, 12 Jun 2011 21:19:45 +0100 Subject: [Biojava-l] add Source Organism to genbank In-Reply-To: References: Message-ID: <7794AE4E-26C3-4B19-A721-E0DD3DBED0C1@eaglegenomics.com> It only knows the display name if you've told it what it is. Therefore you either have to load up the NCBI taxonomy in memory or via a BioSQL database. cheers, Richard On 12 Jun 2011, at 21:01, Khalil El Mazouari wrote: > Hi, > > I am trying to set Organism to RichSequence via: > > richSequence.setTaxon(new SimpleNCBITaxon(taxid)); //ncbi tax id > > the genbank ouptut: > > SOURCE > ORGANISM > . > FEATURES Location/Qualifiers > source 1..394 > /mol_type="genomic DNA" > /strand="+" > /organism="" > /db_xref="taxon:10090" > > How to add Organism display name to the Source Field and to the annotation organism? > > Thanks > > khalil > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Mon Jun 13 02:11:38 2011 From: holland at eaglegenomics.com (Richard Holland) Date: Mon, 13 Jun 2011 07:11:38 +0100 Subject: [Biojava-l] add Source Organism to genbank In-Reply-To: <43BF6C21-7E01-4F23-9EFA-45A6213F7667@gmail.com> References: <7794AE4E-26C3-4B19-A721-E0DD3DBED0C1@eaglegenomics.com> <43BF6C21-7E01-4F23-9EFA-45A6213F7667@gmail.com> Message-ID: <4017484D-8887-4BCE-B244-F828DE6289C3@eaglegenomics.com> If you investigate the SimpleNCBITaxon class, you'll see that it has a addName() function. By using that on the instance you passed to richSequence.setTaxon() before doing the Genbank export then you'll get the name appearing correctly. (The name classes are defined as constants in the parent NCBITaxon interface - COMMON or SCIENTIFIC are the two most commonly used.) cheers, Richard On 12 Jun 2011, at 21:30, Khalil El Mazouari wrote: > Could you please show me how to set "Mus musculus" organism to RichSequence? > > many thanks. > > Khalil > On 12 Jun 2011, at 22:19, Richard Holland wrote: > >> It only knows the display name if you've told it what it is. Therefore you either have to load up the NCBI taxonomy in memory or via a BioSQL database. >> >> cheers, >> Richard >> >> On 12 Jun 2011, at 21:01, Khalil El Mazouari wrote: >> >>> Hi, >>> >>> I am trying to set Organism to RichSequence via: >>> >>> richSequence.setTaxon(new SimpleNCBITaxon(taxid)); //ncbi tax id >>> >>> the genbank ouptut: >>> >>> SOURCE >>> ORGANISM >>> . >>> FEATURES Location/Qualifiers >>> source 1..394 >>> /mol_type="genomic DNA" >>> /strand="+" >>> /organism="" >>> /db_xref="taxon:10090" >>> >>> How to add Organism display name to the Source Field and to the annotation organism? >>> >>> Thanks >>> >>> khalil >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> -- >> Richard Holland, BSc MBCS >> Operations and Delivery Director, Eagle Genomics Ltd >> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> > -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From asma.rabe at gmail.com Wed Jun 15 04:38:05 2011 From: asma.rabe at gmail.com (Asma rabe) Date: Wed, 15 Jun 2011 17:38:05 +0900 Subject: [Biojava-l] Protein protein interactions Message-ID: Hi all, I would like to know is there any module in biojava for processing protein protein interactions? Best Regards, Asma From amr_alhossary at hotmail.com Wed Jun 15 05:03:09 2011 From: amr_alhossary at hotmail.com (Amr AL-Hossary) Date: Wed, 15 Jun 2011 11:03:09 +0200 Subject: [Biojava-l] Protein protein interactions In-Reply-To: References: Message-ID: Please Identify what exactly do you need for Protein-Protein Interactions, Asmaa. I it is not present, this could be a good start point & it could be built upon request. Amr -------------------------------------------------- From: "Asma rabe" Sent: Wednesday, June 15, 2011 10:38 AM To: Subject: [Biojava-l] Protein protein interactions > Hi all, > > I would like to know is there any module in biojava for processing protein > protein interactions? > > Best Regards, > Asma > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From p.v.troshin at dundee.ac.uk Wed Jun 15 05:36:49 2011 From: p.v.troshin at dundee.ac.uk (Peter Troshin) Date: Wed, 15 Jun 2011 10:36:49 +0100 Subject: [Biojava-l] Protein protein interactions In-Reply-To: References: Message-ID: <4DF87D31.5050801@dundee.ac.uk> Hi Asma, I do not know about such module in Biojava, but if you are into human protein-protein interactions check out this web site http://www.compbio.dundee.ac.uk/www-pips. There are a few datasets available for download so you can build on them if you have to have a programmatic access. I hope that helps. Peter Dr Peter Troshin Bioinformatics Software Developer Phone: +44 (0)1382 388589 Fax: +44 (0)1382 385764 The Barton Group College of Life Sciences Medical Sciences Institute University of Dundee Dundee DD1 5EH UK On 15/06/2011 09:38, Asma rabe wrote: > Hi all, > > I would like to know is there any module in biojava for processing protein > protein interactions? > > Best Regards, > Asma > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From khalil.elmazouari at gmail.com Fri Jun 17 05:16:11 2011 From: khalil.elmazouari at gmail.com (Khalil El Mazouari) Date: Fri, 17 Jun 2011 11:16:11 +0200 Subject: [Biojava-l] Genbank feature parsing performance Message-ID: <43AA66DA-AC76-4172-A237-4F079D958FFD@gmail.com> Hi, I am developing an app where features are extracted from a large genbank file, and processed: multiple alignment, annotation.... The feature extraction is a real bottleneck in my app. It consumes 87% of total execution time. Feature extraction is done via: FeatureFilter ff = new FeatureFilter.ByAnnotation(key, value); FeatureHolder fh = richSequence.filter(ff); Feature feat = fh.features().next(); ... Any suggestion on how to improve the performance of features extraction is welcome. Thanks, khalil From martin.jones at ed.ac.uk Fri Jun 17 06:12:05 2011 From: martin.jones at ed.ac.uk (Martin Jones) Date: Fri, 17 Jun 2011 11:12:05 +0100 Subject: [Biojava-l] Genbank feature parsing performance In-Reply-To: <43AA66DA-AC76-4172-A237-4F079D958FFD@gmail.com> References: <43AA66DA-AC76-4172-A237-4F079D958FFD@gmail.com> Message-ID: Hi, I have had the same issue when parsing large sets of genbank files. In my case, the workaround was to first treat the whole genbank record as a string, and do a quick regex match to check if it contained something of interest (in my case I was searching for specific taxids): // first do a quick pattern-match to extract the taxid so we can exit early without the overhead of parsing the whole file private final Pattern taxidPattern = Pattern.compile("db_xref=\\\"taxon:(\\d+)"); Matcher taxidMatcher = taxidPattern.matcher(currentRecord); if (taxidMatcher.find()) { def taxid = taxidMatcher[0][1].toInteger() if (!taxidList.contains(taxid)) { return } // here do the slow part of actually parsing all the features This is in Groovy so there are a few syntactical differences. If you are only interested in a subset of the GenBank records, then this approach might be of use. M On 17 June 2011 10:16, Khalil El Mazouari wrote: > Hi, > > I am developing an app where features are extracted from a large genbank file, and processed: multiple alignment, annotation.... > > The feature extraction is a real bottleneck in my app. It consumes 87% of total execution time. > > Feature extraction is done via: > > FeatureFilter ff = new FeatureFilter.ByAnnotation(key, value); > FeatureHolder fh = richSequence.filter(ff); > Feature feat = fh.features().next(); > ... > > Any suggestion on how to improve the performance of features extraction is welcome. > > Thanks, > > khalil > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > From khalil.elmazouari at gmail.com Fri Jun 17 06:33:28 2011 From: khalil.elmazouari at gmail.com (Khalil El Mazouari) Date: Fri, 17 Jun 2011 12:33:28 +0200 Subject: [Biojava-l] Genbank feature parsing performance In-Reply-To: References: <43AA66DA-AC76-4172-A237-4F079D958FFD@gmail.com> Message-ID: <7EB36AD4-DE00-433A-8DB2-3D3B14BCE0C6@gmail.com> Thanks Martin, I already tried the regex. The performance increase was < 10%. My situation is different in 2 points: 1. info to extract from genbank file is always present. 2. there is multiple feature to extract from each record. I agree with you. Extracting a single field from a genbank file, is done munch faster with simple regex than with FeatureFilter. Regards, khalil On 17 Jun 2011, at 12:12, Martin Jones wrote: > Hi, > > I have had the same issue when parsing large sets of genbank files. In > my case, the workaround was to first treat the whole genbank record as > a string, and do a quick regex match to check if it contained > something of interest (in my case I was searching for specific > taxids): > > // first do a quick pattern-match to extract the taxid so we can > exit early without the overhead of parsing the whole file > private final Pattern taxidPattern = > Pattern.compile("db_xref=\\\"taxon:(\\d+)"); > Matcher taxidMatcher = taxidPattern.matcher(currentRecord); > if (taxidMatcher.find()) { > def taxid = taxidMatcher[0][1].toInteger() > if (!taxidList.contains(taxid)) { > return > } > // here do the slow part of actually parsing all the features > > > This is in Groovy so there are a few syntactical differences. If you > are only interested in a subset of the GenBank records, then this > approach might be of use. > > M > > > > > On 17 June 2011 10:16, Khalil El Mazouari wrote: >> Hi, >> >> I am developing an app where features are extracted from a large genbank file, and processed: multiple alignment, annotation.... >> >> The feature extraction is a real bottleneck in my app. It consumes 87% of total execution time. >> >> Feature extraction is done via: >> >> FeatureFilter ff = new FeatureFilter.ByAnnotation(key, value); >> FeatureHolder fh = richSequence.filter(ff); >> Feature feat = fh.features().next(); >> ... >> >> Any suggestion on how to improve the performance of features extraction is welcome. >> >> Thanks, >> >> khalil >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> From khalil.elmazouari at gmail.com Fri Jun 17 07:05:38 2011 From: khalil.elmazouari at gmail.com (Khalil El Mazouari) Date: Fri, 17 Jun 2011 13:05:38 +0200 Subject: [Biojava-l] Genbank feature parsing performance In-Reply-To: References: <43AA66DA-AC76-4172-A237-4F079D958FFD@gmail.com> <7EB36AD4-DE00-433A-8DB2-3D3B14BCE0C6@gmail.com> Message-ID: <59F7E9F5-023B-471D-833A-CD8FD3F85AF6@gmail.com> Good suggestion ;) However, I am not familiar with Groovy. I'll look for something similar in Java. Regards, khalil On 17 Jun 2011, at 12:36, Martin Jones wrote: > Yes, this approach won't be much use if you are interested in the > contents of every genbank record. > > Have you thought about parsing the gb files in parallel? In my > experience, parsing genbank files scales quite nicely when done in > multiple threads. I have used the GPars library for this type of job > and it is very nice to use: > > http://gpars.codehaus.org/Parallelizer > > > M > > > > On 17 June 2011 11:33, Khalil El Mazouari wrote: >> Thanks Martin, >> >> I already tried the regex. The performance increase was < 10%. >> >> My situation is different in 2 points: >> 1. info to extract from genbank file is always present. >> 2. there is multiple feature to extract from each record. >> >> I agree with you. Extracting a single field from a genbank file, is done munch faster with simple regex than with FeatureFilter. >> >> Regards, >> >> khalil >> >> On 17 Jun 2011, at 12:12, Martin Jones wrote: >> >>> Hi, >>> >>> I have had the same issue when parsing large sets of genbank files. In >>> my case, the workaround was to first treat the whole genbank record as >>> a string, and do a quick regex match to check if it contained >>> something of interest (in my case I was searching for specific >>> taxids): >>> >>> // first do a quick pattern-match to extract the taxid so we can >>> exit early without the overhead of parsing the whole file >>> private final Pattern taxidPattern = >>> Pattern.compile("db_xref=\\\"taxon:(\\d+)"); >>> Matcher taxidMatcher = taxidPattern.matcher(currentRecord); >>> if (taxidMatcher.find()) { >>> def taxid = taxidMatcher[0][1].toInteger() >>> if (!taxidList.contains(taxid)) { >>> return >>> } >>> // here do the slow part of actually parsing all the features >>> >>> >>> This is in Groovy so there are a few syntactical differences. If you >>> are only interested in a subset of the GenBank records, then this >>> approach might be of use. >>> >>> M >>> >>> >>> >>> >>> On 17 June 2011 10:16, Khalil El Mazouari wrote: >>>> Hi, >>>> >>>> I am developing an app where features are extracted from a large genbank file, and processed: multiple alignment, annotation.... >>>> >>>> The feature extraction is a real bottleneck in my app. It consumes 87% of total execution time. >>>> >>>> Feature extraction is done via: >>>> >>>> FeatureFilter ff = new FeatureFilter.ByAnnotation(key, value); >>>> FeatureHolder fh = richSequence.filter(ff); >>>> Feature feat = fh.features().next(); >>>> ... >>>> >>>> Any suggestion on how to improve the performance of features extraction is welcome. >>>> >>>> Thanks, >>>> >>>> khalil >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>>> >> >> >> From phidias51 at gmail.com Fri Jun 17 10:36:12 2011 From: phidias51 at gmail.com (Mark Fortner) Date: Fri, 17 Jun 2011 07:36:12 -0700 Subject: [Biojava-l] Genbank feature parsing performance In-Reply-To: <59F7E9F5-023B-471D-833A-CD8FD3F85AF6@gmail.com> References: <43AA66DA-AC76-4172-A237-4F079D958FFD@gmail.com> <7EB36AD4-DE00-433A-8DB2-3D3B14BCE0C6@gmail.com> <59F7E9F5-023B-471D-833A-CD8FD3F85AF6@gmail.com> Message-ID: Martin, Khalil In the code sample you check to see if the taxon is in a list. I suspect that operation is slower than you intend. You might try using a treeset and see if the lookup performance improves. As for genbank parsing performance itself, I'm curious if you've tried parsing the genbank XML files and noticed any performance difference? If you're looking for something similar to GPars in Java, you might try the ThreadPoolExecutor which manages a threadpool and queuing Runnable tasks to the threadpool. Hope this helps, Mark PS if you have Groovy code that you'd like to share, feel free to add any examples to the BioGroovy wiki . On Jun 17, 2011 4:16 AM, "Khalil El Mazouari" wrote: Good suggestion ;) However, I am not familiar with Groovy. I'll look for something similar in Java. Regards, khalil On 17 Jun 2011, at 12:36, Martin Jones wrote: > Yes, this approach won't be much use if you are interested in the > contents of every genbank record. > > Have you thought about parsing the gb files in parallel? In my > experience, parsing genbank files scales quite nicely when done in > multiple threads. I have used the GPars library for this type of job > and it is very nice to use: > > http://gpars.codehaus.org/Parallelizer > > > M > > > > On 17 June 2011 11:33, Khalil El Mazouari wrote: >> Thanks ... From khalil.elmazouari at gmail.com Fri Jun 17 12:21:43 2011 From: khalil.elmazouari at gmail.com (Khalil El Mazouari) Date: Fri, 17 Jun 2011 18:21:43 +0200 Subject: [Biojava-l] Genbank feature parsing performance In-Reply-To: References: <43AA66DA-AC76-4172-A237-4F079D958FFD@gmail.com> <7EB36AD4-DE00-433A-8DB2-3D3B14BCE0C6@gmail.com> Message-ID: <123851F2-5554-4C94-8931-081D94138D77@gmail.com> Hi, exec time for parsing Genbank, EMBL and EMBL-XML is ? the same. However, writing sequence in EMBL format was 87% slower vs Genbank format. Regards, khalil On 17 Jun 2011, at 12:36, Martin Jones wrote: > Yes, this approach won't be much use if you are interested in the > contents of every genbank record. > > Have you thought about parsing the gb files in parallel? In my > experience, parsing genbank files scales quite nicely when done in > multiple threads. I have used the GPars library for this type of job > and it is very nice to use: > > http://gpars.codehaus.org/Parallelizer > > > M > > > > On 17 June 2011 11:33, Khalil El Mazouari wrote: >> Thanks Martin, >> >> I already tried the regex. The performance increase was < 10%. >> >> My situation is different in 2 points: >> 1. info to extract from genbank file is always present. >> 2. there is multiple feature to extract from each record. >> >> I agree with you. Extracting a single field from a genbank file, is done munch faster with simple regex than with FeatureFilter. >> >> Regards, >> >> khalil >> >> On 17 Jun 2011, at 12:12, Martin Jones wrote: >> >>> Hi, >>> >>> I have had the same issue when parsing large sets of genbank files. In >>> my case, the workaround was to first treat the whole genbank record as >>> a string, and do a quick regex match to check if it contained >>> something of interest (in my case I was searching for specific >>> taxids): >>> >>> // first do a quick pattern-match to extract the taxid so we can >>> exit early without the overhead of parsing the whole file >>> private final Pattern taxidPattern = >>> Pattern.compile("db_xref=\\\"taxon:(\\d+)"); >>> Matcher taxidMatcher = taxidPattern.matcher(currentRecord); >>> if (taxidMatcher.find()) { >>> def taxid = taxidMatcher[0][1].toInteger() >>> if (!taxidList.contains(taxid)) { >>> return >>> } >>> // here do the slow part of actually parsing all the features >>> >>> >>> This is in Groovy so there are a few syntactical differences. If you >>> are only interested in a subset of the GenBank records, then this >>> approach might be of use. >>> >>> M >>> >>> >>> >>> >>> On 17 June 2011 10:16, Khalil El Mazouari wrote: >>>> Hi, >>>> >>>> I am developing an app where features are extracted from a large genbank file, and processed: multiple alignment, annotation.... >>>> >>>> The feature extraction is a real bottleneck in my app. It consumes 87% of total execution time. >>>> >>>> Feature extraction is done via: >>>> >>>> FeatureFilter ff = new FeatureFilter.ByAnnotation(key, value); >>>> FeatureHolder fh = richSequence.filter(ff); >>>> Feature feat = fh.features().next(); >>>> ... >>>> >>>> Any suggestion on how to improve the performance of features extraction is welcome. >>>> >>>> Thanks, >>>> >>>> khalil >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>>> >> >> >> From phidias51 at gmail.com Fri Jun 17 12:58:08 2011 From: phidias51 at gmail.com (Mark Fortner) Date: Fri, 17 Jun 2011 09:58:08 -0700 Subject: [Biojava-l] Genbank feature parsing performance In-Reply-To: <123851F2-5554-4C94-8931-081D94138D77@gmail.com> References: <43AA66DA-AC76-4172-A237-4F079D958FFD@gmail.com> <7EB36AD4-DE00-433A-8DB2-3D3B14BCE0C6@gmail.com> <123851F2-5554-4C94-8931-081D94138D77@gmail.com> Message-ID: Hi Khalil, Did you try the genbank xml format? Mark On Fri, Jun 17, 2011 at 9:21 AM, Khalil El Mazouari < khalil.elmazouari at gmail.com> wrote: > Hi, > > exec time for parsing Genbank, EMBL and EMBL-XML is ? the same. > > However, writing sequence in EMBL format was 87% slower vs Genbank format. > > Regards, > > khalil > > > On 17 Jun 2011, at 12:36, Martin Jones wrote: > > > Yes, this approach won't be much use if you are interested in the > > contents of every genbank record. > > > > Have you thought about parsing the gb files in parallel? In my > > experience, parsing genbank files scales quite nicely when done in > > multiple threads. I have used the GPars library for this type of job > > and it is very nice to use: > > > > http://gpars.codehaus.org/Parallelizer > > > > > > M > > > > > > > > On 17 June 2011 11:33, Khalil El Mazouari > wrote: > >> Thanks Martin, > >> > >> I already tried the regex. The performance increase was < 10%. > >> > >> My situation is different in 2 points: > >> 1. info to extract from genbank file is always present. > >> 2. there is multiple feature to extract from each record. > >> > >> I agree with you. Extracting a single field from a genbank file, is done > munch faster with simple regex than with FeatureFilter. > >> > >> Regards, > >> > >> khalil > >> > >> On 17 Jun 2011, at 12:12, Martin Jones wrote: > >> > >>> Hi, > >>> > >>> I have had the same issue when parsing large sets of genbank files. In > >>> my case, the workaround was to first treat the whole genbank record as > >>> a string, and do a quick regex match to check if it contained > >>> something of interest (in my case I was searching for specific > >>> taxids): > >>> > >>> // first do a quick pattern-match to extract the taxid so we can > >>> exit early without the overhead of parsing the whole file > >>> private final Pattern taxidPattern = > >>> Pattern.compile("db_xref=\\\"taxon:(\\d+)"); > >>> Matcher taxidMatcher = taxidPattern.matcher(currentRecord); > >>> if (taxidMatcher.find()) { > >>> def taxid = taxidMatcher[0][1].toInteger() > >>> if (!taxidList.contains(taxid)) { > >>> return > >>> } > >>> // here do the slow part of actually parsing all the features > >>> > >>> > >>> This is in Groovy so there are a few syntactical differences. If you > >>> are only interested in a subset of the GenBank records, then this > >>> approach might be of use. > >>> > >>> M > >>> > >>> > >>> > >>> > >>> On 17 June 2011 10:16, Khalil El Mazouari > wrote: > >>>> Hi, > >>>> > >>>> I am developing an app where features are extracted from a large > genbank file, and processed: multiple alignment, annotation.... > >>>> > >>>> The feature extraction is a real bottleneck in my app. It consumes 87% > of total execution time. > >>>> > >>>> Feature extraction is done via: > >>>> > >>>> FeatureFilter ff = new FeatureFilter.ByAnnotation(key, value); > >>>> FeatureHolder fh = richSequence.filter(ff); > >>>> Feature feat = fh.features().next(); > >>>> ... > >>>> > >>>> Any suggestion on how to improve the performance of features > extraction is welcome. > >>>> > >>>> Thanks, > >>>> > >>>> khalil > >>>> _______________________________________________ > >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>> > >>>> > >> > >> > >> > > From dasarnow at gmail.com Sun Jun 19 21:30:49 2011 From: dasarnow at gmail.com (Daniel Asarnow) Date: Sun, 19 Jun 2011 18:30:49 -0700 Subject: [Biojava-l] Biojava implementation of CE algorithm In-Reply-To: References: Message-ID: Ambi, >From Biojava's CE on 2aza.A and 1paz.A, I get: RMSD = 2.8960955657826997 Z-score = 3.7 aligned = 85 and from the original C version: Structure Alignment Calculator, version 1.02, last modified: Jun 15, 2001. CE Algorithm, version 1.00, 1998. Chain 1: pdb/pdb1paz.ent:A (Size=123) Chain 2: pdb/pdb2aza.ent:A (Size=129) Alignment length = 85 Rmsd = 2.90A Z-Score = 3.7 Gaps = 49(57.6%) CPU = 0s Sequence identities = 11.8% The web version of CE is the same except for 1 fewer equivalent residues (CPU/FPU differences?). Can you post your Biojava code? Best, -da On Tue, Jun 7, 2011 at 12:58, Ambikesh Jayal wrote: > Hi All, > > There seems to be some discrepancy for some protein sequences in results of > Biojava implementation of CE algorithm and the implementation on CE website > http://cl.sdsc.edu/ce/ce_align.html > For example between protein sequences [2aza.A] AND [1paz]. Other such > example are 1cew.I and 1mol.A, 1cid and 2rhe. > > Is there some reason for this discrepancy? > > Results using BioJava implementation of CE algorithm > > ************* [2aza.A] AND [1paz] ************ > CE > afpChain.getTotalRmsdOpt() 2.5267815014062553 > afpChain.getOptLength() 82 > > Results using CE website http://cl.sdsc.edu/ce/ce_align.html > > ************* [2aza.A] AND [1paz] ************ > Rmsd = 2.9? > Aligned/gap positions = 84/49 > > > > Kind Regards, > Ambi. > > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From andreas at sdsc.edu Sun Jun 19 22:13:34 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Sun, 19 Jun 2011 19:13:34 -0700 Subject: [Biojava-l] Biojava implementation of CE algorithm In-Reply-To: References: Message-ID: > and from the original C version: > Structure Alignment Calculator, version 1.02, last modified: Jun 15, 2001. > The web version of CE is the same except for 1 fewer equivalent > residues (CPU/FPU differences?) There are different versions of CE out there that have been developed over time. I believe the BioJava code is based on the version from 2003 or 2004 (CE version 2.3). Andreas From dasarnow at gmail.com Sun Jun 19 23:23:23 2011 From: dasarnow at gmail.com (Daniel Asarnow) Date: Sun, 19 Jun 2011 20:23:23 -0700 Subject: [Biojava-l] Biojava implementation of CE algorithm In-Reply-To: References: Message-ID: The two distributions available on the CE homepage are the "Linux" version I used above, apparently from 2001, and a tarball from 2004 containing a SPARC32/Solaris binary. But I think they are all (including Biojava) giving approximately the same results? -da On Sun, Jun 19, 2011 at 19:13, Andreas Prlic wrote: >> and from the original C version: >> Structure Alignment Calculator, version 1.02, last modified: Jun 15, 2001. > >> The web version of CE is the same except for 1 fewer equivalent >> residues (CPU/FPU differences?) > > There are different versions of CE out there that have been developed > over time. I believe the BioJava code is based on the version from > 2003 or 2004 (CE version 2.3). > > Andreas > From andreas at sdsc.edu Sun Jun 19 23:48:16 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Sun, 19 Jun 2011 20:48:16 -0700 Subject: [Biojava-l] Biojava implementation of CE algorithm In-Reply-To: References: Message-ID: > But I think they are all (including Biojava) giving approximately the > same results? Yes, I would expect so. A From jayunit100 at gmail.com Mon Jun 20 12:31:32 2011 From: jayunit100 at gmail.com (Jay Vyas) Date: Mon, 20 Jun 2011 12:31:32 -0400 Subject: [Biojava-l] binning structures by quality.... Message-ID: Hi Guys : I'm trying to bin some structures, about 30 of them. I was wondering if anyone knows the upper "limit" for a structure to have a correct backbone, in RMSD units. For example, a structure bundle with RMSD of 9 would clearly have an undefined backbone, whereas a structure bundle with an RMSD of 1 would definetely be precise enough to convey backbone information. I wanted a more precise bound. I'm thinking, by eye, that anything above 4 angstroms is to imprecise to convey a backbone. But I figured maybe there was a formal treatment of such RMSD "categories" somewhere. Any thoughts would be appreciated. -- Jay Vyas MMSB/UCHC From andreas.prlic at gmail.com Tue Jun 21 13:09:54 2011 From: andreas.prlic at gmail.com (Andreas Prlic) Date: Tue, 21 Jun 2011 10:09:54 -0700 Subject: [Biojava-l] Find hydrogen bond angle In-Reply-To: <5F46D59F8ABDE34BB68E87EADB758CA8369A2AA7D3@CMS07.campus.gla.ac.uk> References: <5F46D59F8ABDE34BB68E87EADB758CA8369A2AA7D3@CMS07.campus.gla.ac.uk> Message-ID: Hi Avid, The Calc class uses the atom object to represent vectors. You will need to identify the vectors between which you want to measure the angle and represent them as "atoms" as well. For example to get the vector NH you can do AminoAcid aa1 ... Atom N = aa1.getN(); Atom H = aa1.getH(); Atom nh = Calc.subtract(N,H); Andreas On Tue, Jun 21, 2011 at 4:49 AM, MOHAMMAD (AVID) AFZAL <1000947A at student.gla.ac.uk> wrote: > Dear Dr.PRLIC , > I want to calculate NHO and COH angle in a pdb file, using bioJava. However the angle method only takes two atoms as parameter and in your weblog you only mentioned how to calculate hydrogen bond energy by using distance method. I wonder how can I find the angle of hydrogen bond by using bioJava. Thank you kindly in advance for your time and concern. > Yours sincerely, > Avid Afzal. From mbondi at yahoo.com Wed Jun 22 11:58:59 2011 From: mbondi at yahoo.com (Lic. Marcelo Ignacio Bondi) Date: Wed, 22 Jun 2011 08:58:59 -0700 (PDT) Subject: [Biojava-l] Biojava, bio prl bio php bioryby bio python biolisp bioclispse In-Reply-To: Message-ID: <988760.11487.qm@web125903.mail.ne1.yahoo.com> Activity logging, validation, feasibility clinical studies Bioinformatics and Computational Biology? Pharmacovigilance? ?? Tecnovigilancia? ?? Food Surveillance? Management methods, observation, discipline, studies and / or projects? ?? Telemedicine (telemetry of vital parameters).? ?? Telemetry Monitors? ?? Audit Management, Quality and Cost Reduction ISO to comparative foreign? ?? international, national.? ?? Bioinformatics telemedicine solution? ?? Relations? ?? Lender? ?? Financier? ?? Customer? ?? interactors inappropriate, inadequate or too complicated spaces between web, telephone, mail, face.? ?? ? Interface? Prevention? Physical life (23),? ?? The emotional life (28),? ?? Intellectual life (33),? ?? Intuitive / Life Compassion (38),? ?? Aesthetic Life (43),? ?? Life Awareness (48),? ?? Spiritual life (53).? 1 - Energy Risk Bioinformatics and Computational Biology.? ?? 2 - Biological hazards Bioinformatics and Computational Biology.? ?? 3 - Environmental Risk Bioinformatics and Computational Biology.? ?? 4 - Risks resulting from? incorrect power outputs? and substances Bioinformatics and Computational Biology.? ?? 5 - Risks relating to the? use of the devices Bioinformatics and Computational Biology.? ?? 6 - User Interface inappropriate, inadequate or too complicated (human / machine communication).? ?? Connection Types Bioinformatics and Computational Biology? ?? ???????????????? Dial (Dial-Up)? ???????????????? ?? ADSL access? ???????????????? ?? Cable modem access? ???????????????? ?? Access via Mobile Phone Network? ???????????????? ?? Internet Access in Mobile Networks? ?? ???????????????? Wireless Access? ???????????????? ?? Internet Access in Mobile Networks? ?? ???????????????? Satellite Access? ???????????????? ?? Optical Fiber Access by? ???????????????? ???????????????? Power Line Access? ?? High Performance Bioinformatics and Biomedicine (Hibban)? ?- Databases on a large scale biological and biomedical? ?- Integration of data and ontologies in biology and medicine? ?- Parallel Algorithms Bioinformatics? ?- Parallel Visualization and exploration of biomedical data? ?- Parallel Visualization and analysis of biomedical images? ?- The environments of large-scale collaboration? ?- Scientific Workflows in bioinformatics and biomedicine? ?- (Web) services for bioinformatics and biomedicine? ?- Grid Computing for Bioinformatics and Biomedicine? ?- Peer-to-Peer Computing, bioinformatics and biomedicine? ?- New architectures and programming models (eg, Cell, GPU)? ?bioinformatics and biomedicine? ?- Parallel processing of bio-signals? ?- Modeling and simulation of complex biological processes? 15086/08 trecnc administrative process / http://www.cnc.gov.ar/? ?? * Access Code: 5900c16769.? Consultation System www.minplan.gov.ar the resource record? ?? Case:? ?? Status:? ?? ?? Court:? ?? ?? Judge:? ?? ?? Type:? ?? ?? ?? Presentation:? ?? ?? ?? Filed:? ?? ?? ?? Obtained:? ?? ?? ?? Executive Summary? ?? ?? Sector:? ?? ?? ?? Stage:? ?? ?? ?? Country:? ?? Budget:? Expenses:? Income:? Balance:? Costs and damages accrued Bioinformatics and Computational Biology Interface telemetry re-structural perspectives 1986-2036...1986-2035 proclamation.? Wiretap recordings, films.? ?? American Declaration of the Rights and Duties of Man? ?? American Convention on Human Rights? ?? Inter-American Convention to Prevent and Punish Torture? ?? American Convention on Forced Disappearance of Persons? ?? signal acquisition and display? ?? Vascular Neurology Cardiology Neumonologia? ?? substantially equivalent devices? ?? channels and channels? ?? Certificate ANMAT - Argentina? ?? FDA 510 (k) Number - The United States? ?? national register of producers and products of medical technology (RPPTM)? ?? EC? ?? 60601? ?? ISO? ?? Descriptive Name? ?? Identification code and technical name? ?? Medical Product Brand? ?? Hazard Class? ?? adverse events? ?? electronic product radiation control the provision? ?? Spaces between radio.? ?? Biorhythm.? ?? Computational biology detection radio link.? ?? Human Interface radio biosecurity.? ?? Clandestine radio Cruces? ?? Computational radiobiopirater?a? ?? Computational radiotaxopirater?a? ?? Computational radioecopirater?a? Computational radiocartopirater?a? Computational radiobibliopirater?a? Computational radioetnobiopirater?a? Mechanical Construction Electronic Information Systems training? Licensed indication? Model? Condition of sale? Manufacturer Name? Place / elaboration is? registration and listing of quality systems? Professional? ?? Consumer? Academy? Institution and / or Company? Press? Articles of Interest? Technical Informational Documents? Closures and Prohibitions? Prohibitions on use and / or Marketing? International Trade Service Providers, Financiers, Clients Specialized in Argentina-Euroamericas Automation, Music, Business, International Markets, Finance, Industry, Services, News, Art, Games, Social, Civil, Entertainment, Humor, Technology, Sports, Cooking, Education, Shows, Health, Government, NGO Development Project. Scholarship, Mediaship, Profile, Recommendations, Visibility, Settings, Management. Add both a personal and work News Feed, Messages, Events, Friends, Games, Photos, Video, Groups, Notes, Applications, Edit, Online Networks to: Marcelo Ignacio Bondi. Fax: 00 +5411.4807.3791 Ph.: 00 +5411.4805.5783 Postal: Marcelo Ignacio Bondi J. A. Pacheco de Melo 2475 P.B. ? B ? (1425AUA) Buenos Aires Capital Federal, Argentina mbondi at yahoo.com Marcelo at Bondi.ws commodities_broker at yahoo.com Call / Messenger on Skype ?: user id marceloignaciobondi MSN Messenger ?: commodities_broker at hotmail.com http://marcelo_ignacio_bondi.myplaxo.com/ http://pulse.plaxo.com/pulse/groups/profile/marcelo-ignacio-bondi?n=1 http://www.linkedin.com/in/marceloignaciobondi http://www.facebook.com/marcelo.bondi http://www.denexos.com http://www.redsocialpymes.com Internet radiobiogenesis? www.bondi.ws draft consideration. This "AS IS " discussion only information treatment. This auxiliary system-part may be falsified, form, to improve or modify undetermined. we can not guarantee to develop capacity for every system individually, but rest assured that your information will be taken into discussion. Mail address change at mbondi at yahoo.com --- On Tue, 6/21/11, biojava-l-request at lists.open-bio.org wrote: From: biojava-l-request at lists.open-bio.org Subject: Biojava-l Digest, Vol 101, Issue 8 To: biojava-l at lists.open-bio.org Date: Tuesday, June 21, 2011, 1:00 PM Send Biojava-l mailing list submissions to ??? biojava-l at lists.open-bio.org To subscribe or unsubscribe via the World Wide Web, visit ??? http://lists.open-bio.org/mailman/listinfo/biojava-l or, via email, send a message with subject or body 'help' to ??? biojava-l-request at lists.open-bio.org You can reach the person managing the list at ??? biojava-l-owner at lists.open-bio.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Biojava-l digest..." Today's Topics: ???1. binning structures by quality.... (Jay Vyas) ---------------------------------------------------------------------- Message: 1 Date: Mon, 20 Jun 2011 12:31:32 -0400 From: Jay Vyas Subject: [Biojava-l] binning structures by quality.... To: biojava-l at lists.open-bio.org Message-ID: Content-Type: text/plain; charset=ISO-8859-1 Hi Guys : I'm trying to bin some structures, about 30 of them.? I was wondering if anyone knows the upper "limit" for a structure to have a correct backbone, in RMSD units.? For example, a structure bundle with RMSD of 9 would clearly have an undefined backbone, whereas a structure bundle with an RMSD of 1 would definetely be precise enough to convey backbone information. I wanted a more precise bound.? I'm thinking, by eye, that anything above 4 angstroms is to imprecise to convey a backbone. But I figured maybe there was a formal treatment of such RMSD "categories" somewhere. Any thoughts would be appreciated. -- Jay Vyas MMSB/UCHC ------------------------------ _______________________________________________ Biojava-l mailing list? -? Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l End of Biojava-l Digest, Vol 101, Issue 8 ***************************************** From darnells at dnastar.com Wed Jun 22 13:21:51 2011 From: darnells at dnastar.com (Steve Darnell) Date: Wed, 22 Jun 2011 12:21:51 -0500 Subject: [Biojava-l] binning structures by quality.... In-Reply-To: References: Message-ID: Hi Jay, Perhaps this snippet from Proteopedia will help in your search: http://www.proteopedia.org/wiki/index.php/Structural_alignment_tools Evaluating Structural Alignments The structural differences between two optimally aligned models are usually measured as the Root Mean Square Deviation (RMSD) between the aligned alpha-carbon positions (excluding deviations from the non-aligned positions). To provide a frame of reference for RMSD values, note that up to 0.5 ? RMSD of alpha carbons occurs in independent determinations of the same protein[3]. Crystallographic models of proteins with about 50% sequence identity differ by about 1 ? RMSD[3][4]. Deviations can be much larger for models determined by NMR[4]. [3] Chothia C, Lesk AM. The relation between the divergence of sequence and structure in proteins. EMBO J. 1986 Apr;5(4):823-6. PMID:3709526 [4] Schwede T, Diemand A, Guex N, Peitsch MC. Protein structure computing in the genomic era. Res Microbiol. 2000 Mar;151(2):107-12. PMID:10865955 -- Of course, large RMSD values can occur due to misoriented regions (loops, termini, etc.) even though the core structure aligns well. See Zhang and Skolnick for one example. Zhang Y, Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005 2005 Apr 22;33(7):2302-9. PMID:15849316 Maybe others have a more direct answer to your question. Regards, Steve -- Steve Darnell DNASTAR, Inc. Madison, WI USA -----Original Message----- From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Jay Vyas Sent: Monday, June 20, 2011 11:32 AM To: biojava-l at lists.open-bio.org Subject: [Biojava-l] binning structures by quality.... Hi Guys : I'm trying to bin some structures, about 30 of them. I was wondering if anyone knows the upper "limit" for a structure to have a correct backbone, in RMSD units. For example, a structure bundle with RMSD of 9 would clearly have an undefined backbone, whereas a structure bundle with an RMSD of 1 would definetely be precise enough to convey backbone information. I wanted a more precise bound. I'm thinking, by eye, that anything above 4 angstroms is to imprecise to convey a backbone. But I figured maybe there was a formal treatment of such RMSD "categories" somewhere. Any thoughts would be appreciated. -- Jay Vyas MMSB/UCHC _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From jayunit100 at gmail.com Wed Jun 22 18:24:29 2011 From: jayunit100 at gmail.com (Jay Vyas) Date: Wed, 22 Jun 2011 18:24:29 -0400 Subject: [Biojava-l] CEAlign > 0 for exact same structures ? Message-ID: Hi guys.. Im finding that two structures that are the same give me non-zero RMSD alignments.... There could be a bug in my code, but this is an initial notice that I'm pretty sure about.... Any thoughts on the performance of CEAlign in ideal cases ? Atom[] ca1 = GPdbUtils.getAtoms(s1, type); Atom[] ca2 = GPdbUtils.getAtoms(s2, type); System.out.println("Aligning two sets of atoms, " + ca1.length +","+ca2.length); // get default parameters CeParameters params = new CeParameters(); // set the maximum gap size to unlimited // params.setMaxGapSize(-1); StructureAlignment algorithm = StructureAlignmentFactory .getAlgorithm(CeMain.algorithmName); // The results are stored in an AFPChain object AFPChain afpChain = algorithm.align(ca1, ca2, params); afpChain.setName1("A"); afpChain.setName2("B"); return (float) afpChain.getChainRmsd(); From andreas at sdsc.edu Wed Jun 22 18:35:42 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 22 Jun 2011 15:35:42 -0700 Subject: [Biojava-l] CEAlign > 0 for exact same structures ? In-Reply-To: References: Message-ID: Hi Jay, what are the pdb Ids? When I try this for e.g. 4hhb.A against itself I am getting an RMSD of 0. Use the afpChain.getTotalRmsdOpt() to get the RMSD of the final alignment. (see also DemoCE.java) Andreas On Wed, Jun 22, 2011 at 3:24 PM, Jay Vyas wrote: > Hi guys.. Im finding that two structures that are the same give me non-zero > RMSD alignments.... ? There could be a bug in my code, but this is an > initial notice that I'm pretty sure about.... Any thoughts on the > performance of CEAlign in ideal cases ? > > > ? ? ? ?Atom[] ca1 = GPdbUtils.getAtoms(s1, type); > ? ? ? ?Atom[] ca2 = GPdbUtils.getAtoms(s2, type); > > ? ? ? ?System.out.println("Aligning two sets of atoms, " + ca1.length > +","+ca2.length); > ? ? ? ?// get default parameters > ? ? ? ?CeParameters params = new CeParameters(); > > ? ? ? ?// set the maximum gap size to unlimited > ? ? ? ?// params.setMaxGapSize(-1); > ? ? ? ?StructureAlignment algorithm = StructureAlignmentFactory > ? ? ? ? ? ? ? ?.getAlgorithm(CeMain.algorithmName); > > ? ? ? ?// The results are stored in an AFPChain object > ? ? ? ?AFPChain afpChain = algorithm.align(ca1, ca2, params); > ? ? ? ?afpChain.setName1("A"); > ? ? ? ?afpChain.setName2("B"); > > ? ? ? ?return (float) afpChain.getChainRmsd(); > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From jayunit100 at gmail.com Sat Jun 25 11:34:42 2011 From: jayunit100 at gmail.com (Jay Vyas) Date: Sat, 25 Jun 2011 11:34:42 -0400 Subject: [Biojava-l] help with short reads ? Message-ID: Hi everyone. A collaborator sent me some short reads in GZ format for 2 bacterial genomes. I have NO IDEA how to process this data or convert it. Any help or utilies out there ? If you're interested in collaborating on a publication , let me know. We can get you're name on it. And it won't be much work for those of you that know about contig assembly..... For me, its out of my league, im a protein guy.... -- Jay Vyas MMSB/UCHC From jw12 at sanger.ac.uk Tue Jun 28 05:48:38 2011 From: jw12 at sanger.ac.uk (Jonathan Warren) Date: Tue, 28 Jun 2011 10:48:38 +0100 Subject: [Biojava-l] Central registration place for BAM, BigBed and BigWig files Message-ID: <11F1D7FB-46AD-4490-A340-879804254BB6@sanger.ac.uk> If you have "big" file data that you think would be useful to other researches you can now register them with the DAS Registry at http://www.dasregistry.org For further in formation see below: You can now add bigfile formats such as BAM, BIGBED and BIGWIG using the bigfile-bam, bigfile-bigbed and bigfile-bigwig capabilities types of the DAS Registry. These files are not served from a DAS server but are just available from the web (Thus the urls for these ?sources? are not expected to have other capabilities such as a sources command or a format command). People can use the meta data associated with a DAS source or in this case a bigfile to advertise the availability of the file to other researchers. The additional information includes a coordinate system e.g. GRCh_37, Chromosome, Homo sapiens so others know what is the correct sequence to attach the files to. Also descriptions and helpUrls etc. To register a big file register yourself with the DAS Registry (simple email/pass system here https://www.dasregistry.org/loginFirst.jsp) , then go to the register a service page Register new select the second option ?registering a plain file..? then add the meta data for your data file. Once you have registered your file it will appear in the https://www.dasregistry.org/listSources.jsp page ? you can filter to show only the bigfile format of the appropriate type using the capabilities drop down. Any problems or suggestions please contact dasregistry at sanger.ac.uk Many thanks Jonathan. Jonathan Warren Senior Developer and DAS coordinator blog: http://biodasman.wordpress.com/ -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From alastair.m.kilpatrick at googlemail.com Wed Jun 29 06:22:25 2011 From: alastair.m.kilpatrick at googlemail.com (Alastair Kilpatrick) Date: Wed, 29 Jun 2011 11:22:25 +0100 Subject: [Biojava-l] DNATools lower case? In-Reply-To: <20110527121201.42723aw4e2egh86c@gator1273.hostgator.com> References: <20110527111518.81163axdepe9fm04@gator1273.hostgator.com> <20110527121201.42723aw4e2egh86c@gator1273.hostgator.com> Message-ID: Hi all, Just in case anyone is interested in this and still looking for a fix - this isn't a proper solution but it seems to work so far (with thanks to Shirley, have only tried with BioJava 1.X i'm afraid): In BioJava 1.8.1, core-1.8.1.jar contains the file \org\biojava\bio\symbol\AlphabetManager.xml Within that file, the lines for atomic mappings need to be changed from: to: ..similarly for C, G & T. In BioJava 1.4, the equivalent file is at \org\biojava\bio\symbol\Alphabet.xml - the changes are just the same. After changing, I had to update my project setup, but that may just be an Eclipse thing, isn't too much bother. Alastair PhD candidate, School of Informatics, University of Edinburgh On 27 May 2011 18:12, George Waldon wrote: > Hi Shirley, > > I am not really familiar with this code but I think AlternateTokenization > was introduced after the Logo code and that is why you do not find it there. > Can you fill a bug report? Also I'll be happy to add any patch you submit. > > Thank you. > > > George > > Quoting Shirley Hui >: > > Thanks George. It looks like using alternate tokenization works if you >> are >> "stringifying" a Sequence explicitly using the alphabet.getTokenization() >> method. >> >> But presumably this gets done within the DistributionLogo class or some >> other class down the line and I don't want to modify any Biojava classes >> unless I really have to. >> >> The way I am constructing the DNA sequences is like this: >> >> Sequences seq =DNATools.createDNASequence(**sequence, name); >> >> The list sequences is used to make a SimpleWeightMatrix wm. >> Then the call to DistributionLogo is like this: >> >> Distribution dist = wm.getColumn(columnNumber); >> DistributionLogo dl = new DistributionLogo(); >> dl.setRenderingHints(hints); >> dl.setOpaque(false); >> dl.setDistribution(dist); >> dl.setPreferredSize(new Dimension((int) columnWidth, (int) columnHeight)); >> dl.setLogoPainter(new TextLogoPainter()); >> dl.setStyle(symbolColorStyle); >> >> There no way that I can tell right now in the DNATools API to make the >> DNATools use the alternate string tokenization via the call to >> createDNASequence() >> or another type of set method? >> >> shirley >> >> >> On Fri, May 27, 2011 at 12:15 PM, George Waldon >> >**wrote: >> >> Hello Shirley, >>> >>> I think you need to use AlternateTokenisation at some point; check >>> BJ1.8.1 >>> Cookbook at http://www.biojava.org/wiki/**BioJava:Cookbook:Sequence >>> >>> regards, >>> >>> George >>> >>> >>> Quoting Shirley Hui >: >>> >>> Hi, >>> >>>> I am using DNATools to generate dna Sequences. >>>> I noticed that the static methods in DNATools a(),c(),t(),g() map to >>>> lower >>>> case characters. >>>> I am using using DistributionLogo class to draw sequences logos for a >>>> set >>>> of >>>> dna Sequences. >>>> I think DistributionLogo is calling the static methods to map the >>>> nucleotides which is lower case. >>>> But I want the logo output the nucleotides in uppercase. How can I do >>>> this? >>>> Thanks for your help >>>> shirley >>>> ______________________________**_________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/**mailman/listinfo/biojava-l >>>> >>>> >>>> >>> >>> >>> >> > > > ______________________________**_________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/biojava-l > From kurka at mikro.biologie.tu-muenchen.de Wed Jun 29 10:39:06 2011 From: kurka at mikro.biologie.tu-muenchen.de (Hedwig Kurka) Date: Wed, 29 Jun 2011 16:39:06 +0200 Subject: [Biojava-l] make feature to create embl or genbank file Message-ID: <4E0B390A.5050005@mikro.biologie.tu-muenchen.de> Hello all, I have a problem concerning creating EMBL or Genbank files. Below is a fragment of my code and an example of how the EMBL file looks like. String name = "test genome"; String seqString = pFasta.getSequence(1, pFasta.getLength()); Sequence seq = DNATools.createDNASequence(seqString, name); Alphabet dna = AlphabetManager.alphabetForName("DNA"); RichSequence rs = Tools.createRichSequence(RichObjectFactory.getDefaultNamespace(), name, seqString, dna); Set rfeatSet = new HashSet(); StrandedFeature.Template t = new StrandedFeature.Template(); for(int i=0; i stop){ t.location = new RangeLocation(stop, start); t.strand = StrandedFeature.NEGATIVE; } Feature f = seq.createFeature(t); RichFeature rf = RichFeature.Tools.enrich(f); rfeatSet.add(rf); } rs.setFeatureSet(rfeatSet); rs = RichSequence.Tools.enrich(rs); RichSequence.IOTools.writeEMBL(output, rs, RichObjectFactory.getDefaultNamespace()); EMBL file: FT any 1889536..1890903 FT any 134636..136987 FT any 3727110..3727625 FT any 2812636..2813517 FT any 580648..581643 FT any 2330962..2331921 FT any 1012371..1013513 FT any 1260854..1261720 FT any 1602858..1603706 FT any 4108079..4108999 FT any 346637..347731 FT any 4073395..4074549 I wonder where the information of plus and minus strand is, why is there "any" in the file and not "CDS" and so on. As tutorial I found that: http://www.biojava.org/wiki/BioJava:Cookbook:Locations:Feature. Is there another one? Thank you for your help! And any help is appreciated, Hedwig From gwaldon at geneinfinity.org Wed Jun 29 11:39:04 2011 From: gwaldon at geneinfinity.org (George Waldon) Date: Wed, 29 Jun 2011 10:39:04 -0500 Subject: [Biojava-l] make feature to create embl or genbank file In-Reply-To: <4E0B390A.5050005@mikro.biologie.tu-muenchen.de> References: <4E0B390A.5050005@mikro.biologie.tu-muenchen.de> Message-ID: <20110629103904.942543b411ok9xqw@gator1273.hostgator.com> Hi Hedwig, The problem holds with StrandedFeature. The strandeness of a feature is the transdeness of its location. StrandedFeature should be eliminated from bj1. Use biojavaX instead, something like this, once you have created a RichLocation on the appropriate strand: public Feature.Template getFeatureTemplate(RichSequence parent,RichLocation loc) { RichFeature.Template templ = new RichFeature.Template(); RichAnnotation rans = new SimpleRichAnnotation(); templ.annotation = rans; templ.sourceTerm = // find an appropriate term templ.typeTerm = RichObjectFactory.getDefaultOntology().getOrCreateTerm("CDS"); templ.featureRelationshipSet = new TreeSet(); templ.rankedCrossRefs = new TreeSet(); templ.location = loc; // add notes if any you'd like return templ; } That should make it into the output file. Regards, George Quoting Hedwig Kurka : > Hello all, > > I have a problem concerning creating EMBL or Genbank files. > Below is a fragment of my code and an example of how the EMBL file looks > like. > > String name = "test genome"; > String seqString = pFasta.getSequence(1, pFasta.getLength()); > Sequence seq = DNATools.createDNASequence(seqString, name); > Alphabet dna = AlphabetManager.alphabetForName("DNA"); > RichSequence rs = > Tools.createRichSequence(RichObjectFactory.getDefaultNamespace(), name, > seqString, dna); > Set rfeatSet = new HashSet(); > StrandedFeature.Template t = new StrandedFeature.Template(); > for(int i=0; i int start = (int) Math.abs(anno.get(i).getStart()); > int stop = (int) Math.abs(anno.get(i).getStop()); > t.type = "CDS"; > if(start < stop){ > t.location = new RangeLocation(start, stop); > t.strand = StrandedFeature.POSITIVE; > } > if(start > stop){ > t.location = new RangeLocation(stop, start); > t.strand = StrandedFeature.NEGATIVE; > } > Feature f = seq.createFeature(t); > RichFeature rf = RichFeature.Tools.enrich(f); > rfeatSet.add(rf); > } > rs.setFeatureSet(rfeatSet); > rs = RichSequence.Tools.enrich(rs); > RichSequence.IOTools.writeEMBL(output, rs, > RichObjectFactory.getDefaultNamespace()); > > EMBL file: > FT any 1889536..1890903 > FT any 134636..136987 > FT any 3727110..3727625 > FT any 2812636..2813517 > FT any 580648..581643 > FT any 2330962..2331921 > FT any 1012371..1013513 > FT any 1260854..1261720 > FT any 1602858..1603706 > FT any 4108079..4108999 > FT any 346637..347731 > FT any 4073395..4074549 > > I wonder where the information of plus and minus strand is, why is there > "any" in the file and not "CDS" and so on. > > As tutorial I found that: > http://www.biojava.org/wiki/BioJava:Cookbook:Locations:Feature. Is there > another one? > > Thank you for your help! > > And any help is appreciated, > > Hedwig > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From kurka at mikro.biologie.tu-muenchen.de Thu Jun 30 04:31:47 2011 From: kurka at mikro.biologie.tu-muenchen.de (Hedwig Kurka) Date: Thu, 30 Jun 2011 10:31:47 +0200 Subject: [Biojava-l] make feature to create embl or genbank file In-Reply-To: <20110629103904.942543b411ok9xqw@gator1273.hostgator.com> References: <4E0B390A.5050005@mikro.biologie.tu-muenchen.de> <20110629103904.942543b411ok9xqw@gator1273.hostgator.com> Message-ID: <4E0C3473.7070406@mikro.biologie.tu-muenchen.de> Hi George, Thank you for your answer. I have some questions. Maybe very stupid, but I don't know how to get RichLocation objects. RichLocation loc = (RichLocation) new RangeLocation(start, stop); That doesn't work. And where does the programm know, that the feature lies on the plus or the minus strand? Regards, Hedwig Am 29.06.2011 17:39, schrieb George Waldon: > Hi Hedwig, > > The problem holds with StrandedFeature. The strandeness of a feature > is the transdeness of its location. StrandedFeature should be > eliminated from bj1. Use biojavaX instead, something like this, once > you have created a RichLocation on the appropriate strand: > > public Feature.Template getFeatureTemplate(RichSequence > parent,RichLocation loc) { > RichFeature.Template templ = new RichFeature.Template(); > RichAnnotation rans = new SimpleRichAnnotation(); > templ.annotation = rans; > templ.sourceTerm = // find an appropriate term > templ.typeTerm = > RichObjectFactory.getDefaultOntology().getOrCreateTerm("CDS"); > templ.featureRelationshipSet = new TreeSet(); > templ.rankedCrossRefs = new TreeSet(); > templ.location = loc; > > // add notes if any you'd like > > return templ; > } > > That should make it into the output file. > > Regards, > George > > > Quoting Hedwig Kurka : > >> Hello all, >> >> I have a problem concerning creating EMBL or Genbank files. >> Below is a fragment of my code and an example of how the EMBL file looks >> like. >> >> String name = "test genome"; >> String seqString = pFasta.getSequence(1, pFasta.getLength()); >> Sequence seq = DNATools.createDNASequence(seqString, name); >> Alphabet dna = AlphabetManager.alphabetForName("DNA"); >> RichSequence rs = >> Tools.createRichSequence(RichObjectFactory.getDefaultNamespace(), name, >> seqString, dna); >> Set rfeatSet = new HashSet(); >> StrandedFeature.Template t = new StrandedFeature.Template(); >> for(int i=0; i> int start = (int) Math.abs(anno.get(i).getStart()); >> int stop = (int) Math.abs(anno.get(i).getStop()); >> t.type = "CDS"; >> if(start < stop){ >> t.location = new RangeLocation(start, stop); >> t.strand = StrandedFeature.POSITIVE; >> } >> if(start > stop){ >> t.location = new RangeLocation(stop, start); >> t.strand = StrandedFeature.NEGATIVE; >> } >> Feature f = seq.createFeature(t); >> RichFeature rf = RichFeature.Tools.enrich(f); >> rfeatSet.add(rf); >> } >> rs.setFeatureSet(rfeatSet); >> rs = RichSequence.Tools.enrich(rs); >> RichSequence.IOTools.writeEMBL(output, rs, >> RichObjectFactory.getDefaultNamespace()); >> >> EMBL file: >> FT any 1889536..1890903 >> FT any 134636..136987 >> FT any 3727110..3727625 >> FT any 2812636..2813517 >> FT any 580648..581643 >> FT any 2330962..2331921 >> FT any 1012371..1013513 >> FT any 1260854..1261720 >> FT any 1602858..1603706 >> FT any 4108079..4108999 >> FT any 346637..347731 >> FT any 4073395..4074549 >> >> I wonder where the information of plus and minus strand is, why is there >> "any" in the file and not "CDS" and so on. >> >> As tutorial I found that: >> http://www.biojava.org/wiki/BioJava:Cookbook:Locations:Feature. Is there >> another one? >> >> Thank you for your help! >> >> And any help is appreciated, >> >> Hedwig >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > > > > From kurka at mikro.biologie.tu-muenchen.de Thu Jun 30 05:02:49 2011 From: kurka at mikro.biologie.tu-muenchen.de (Hedwig Kurka) Date: Thu, 30 Jun 2011 11:02:49 +0200 Subject: [Biojava-l] make feature to create embl or genbank file In-Reply-To: <33EA9829-70CA-47F6-B336-109BF00ADBCC@eaglegenomics.com> References: <4E0C2FF4.1080504@mikro.biologie.tu-muenchen.de> <33EA9829-70CA-47F6-B336-109BF00ADBCC@eaglegenomics.com> Message-ID: <4E0C3BB9.2030601@mikro.biologie.tu-muenchen.de> I already built the set and populated it. Now I want to give it the RichSequence. But when I do that in that line: rs.setRichFeatureSet(rfeatSet); It says, that it needs a Set Regards, Hedwig > I'm not sure what you're trying to do - if you want to build a Set, you can just use the standard Java Collections API to create and populate a Set? > > cheers, > Richard > > On 30 Jun 2011, at 09:12, Hedwig Kurka wrote: > > >> Hi Richard, >> >> Thank you for your answer. >> If I create RichFeature objects, then I have to do conversions in that line: >> RichFeature f = (RichFeature) seq.createFeature(t); >> and then I have in that line: >> rs.setRichFeatureSet(rfeatSet); >> the problem, that I have a Set and not Set, but I >> didn't find a method builds a Set containing RichFeature objects on a >> RichSequence. Is there one? >> >> >> >>> The conversion from Feature to RichFeature does its best but is not >>> ideal. As you already have a RichSequence object to work with then you >>> would be better creating native RichFeature objects instead of doing >>> conversions. >>> >>> Richard Holland >>> Eagle Genomics Ltd >>> Sent from my HTC >>> >>> >>> >>> >> > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > > > From holland at eaglegenomics.com Thu Jun 30 05:36:12 2011 From: holland at eaglegenomics.com (Richard Holland) Date: Thu, 30 Jun 2011 10:36:12 +0100 Subject: [Biojava-l] make feature to create embl or genbank file In-Reply-To: <4E0C3BB9.2030601@mikro.biologie.tu-muenchen.de> References: <4E0C2FF4.1080504@mikro.biologie.tu-muenchen.de> <33EA9829-70CA-47F6-B336-109BF00ADBCC@eaglegenomics.com> <4E0C3BB9.2030601@mikro.biologie.tu-muenchen.de> Message-ID: <285A2F4C-9C8E-4300-A48B-E31E95601570@eaglegenomics.com> There is a coding problem in ThinRichSequence (from which SimpleRichSequence and others extend) that allow only Set as input, but require the Feature objects to actually be RichFeature objects. This was for a number of reasons that probably seemed good at the time but I have now forgotten what they were. The workaround is to declare your set as a Set but populate it with RichFeature objects (as RichFeature extends Feature and so the Set will still accept them). The code is being phased out in favour of the new BJ3 model so it is unlikely to be fixed, but hopefully this workaround solves your particular case. cheers, Richard On 30 Jun 2011, at 10:02, Hedwig Kurka wrote: > I already built the set and populated it. > Now I want to give it the RichSequence. But when I do that in that line: > > rs.setRichFeatureSet(rfeatSet); > > It says, that it needs a Set > > Regards, > Hedwig > >> I'm not sure what you're trying to do - if you want to build a Set, you can just use the standard Java Collections API to create and populate a Set? >> >> cheers, >> Richard >> >> On 30 Jun 2011, at 09:12, Hedwig Kurka wrote: >> >> >>> Hi Richard, >>> >>> Thank you for your answer. >>> If I create RichFeature objects, then I have to do conversions in that line: >>> RichFeature f = (RichFeature) seq.createFeature(t); >>> and then I have in that line: >>> rs.setRichFeatureSet(rfeatSet); >>> the problem, that I have a Set and not Set, but I >>> didn't find a method builds a Set containing RichFeature objects on a >>> RichSequence. Is there one? >>> >>> >>> >>>> The conversion from Feature to RichFeature does its best but is not >>>> ideal. As you already have a RichSequence object to work with then you >>>> would be better creating native RichFeature objects instead of doing >>>> conversions. >>>> >>>> Richard Holland >>>> Eagle Genomics Ltd >>>> Sent from my HTC >>>> >>>> >>>> >>>> >>> >> -- >> Richard Holland, BSc MBCS >> Operations and Delivery Director, Eagle Genomics Ltd >> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> >> >> >> > -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From gwaldon at geneinfinity.org Thu Jun 30 11:24:23 2011 From: gwaldon at geneinfinity.org (George Waldon) Date: Thu, 30 Jun 2011 10:24:23 -0500 Subject: [Biojava-l] make feature to create embl or genbank file In-Reply-To: <4E0C3473.7070406@mikro.biologie.tu-muenchen.de> References: <4E0B390A.5050005@mikro.biologie.tu-muenchen.de> <20110629103904.942543b411ok9xqw@gator1273.hostgator.com> <4E0C3473.7070406@mikro.biologie.tu-muenchen.de> Message-ID: <20110630102423.185442z4cl0wn7eo@gator1273.hostgator.com> No stupid question here, only bad answer. Hope this one is good: http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Working_with_RichLocation_objects. - George Quoting Hedwig Kurka : > Hi George, > > Thank you for your answer. > I have some questions. Maybe very stupid, but I don't know how to get > RichLocation objects. > RichLocation loc = (RichLocation) new RangeLocation(start, stop); > That doesn't work. From khalil.elmazouari at gmail.com Thu Jun 30 13:59:43 2011 From: khalil.elmazouari at gmail.com (Khalil El Mazouari) Date: Thu, 30 Jun 2011 19:59:43 +0200 Subject: [Biojava-l] Biojava-l Digest, Vol 101, Issue 14 In-Reply-To: References: Message-ID: Hi Hedwig try this: RichFeature richFeature = RichFeature.Tools.makeEmptyFeature(); RichLocation richLocation = new SimpleRichLocation( new SimplePosition(start), new SimplePosition(end), rank, RichLocation.Strand.POSITIVE_STRAND); richFeature.setLocation(richLocation); richFeature.setType("misc_feat"); // or get it from RichObjectFactory. richSequence.getFeatureSet().add(richFeature); Regards, khalil On 30 Jun 2011, at 18:00, biojava-l-request at lists.open-bio.org wrote: > Send Biojava-l mailing list submissions to > biojava-l at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/biojava-l > or, via email, send a message with subject or body 'help' to > biojava-l-request at lists.open-bio.org > > You can reach the person managing the list at > biojava-l-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biojava-l digest..." > > > Today's Topics: > > 1. Re: make feature to create embl or genbank file (Hedwig Kurka) > 2. Re: make feature to create embl or genbank file (Hedwig Kurka) > 3. Re: make feature to create embl or genbank file (Richard Holland) > 4. Re: make feature to create embl or genbank file (George Waldon) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 30 Jun 2011 10:31:47 +0200 > From: Hedwig Kurka > Subject: Re: [Biojava-l] make feature to create embl or genbank file > To: George Waldon , > biojava-l at lists.open-bio.org > Message-ID: <4E0C3473.7070406 at mikro.biologie.tu-muenchen.de> > Content-Type: text/plain; charset=ISO-8859-1 > > Hi George, > > Thank you for your answer. > I have some questions. Maybe very stupid, but I don't know how to get > RichLocation objects. > RichLocation loc = (RichLocation) new RangeLocation(start, stop); > That doesn't work. > And where does the programm know, that the feature lies on the plus or > the minus strand? > > Regards, > Hedwig > > Am 29.06.2011 17:39, schrieb George Waldon: >> Hi Hedwig, >> >> The problem holds with StrandedFeature. The strandeness of a feature >> is the transdeness of its location. StrandedFeature should be >> eliminated from bj1. Use biojavaX instead, something like this, once >> you have created a RichLocation on the appropriate strand: >> >> public Feature.Template getFeatureTemplate(RichSequence >> parent,RichLocation loc) { >> RichFeature.Template templ = new RichFeature.Template(); >> RichAnnotation rans = new SimpleRichAnnotation(); >> templ.annotation = rans; >> templ.sourceTerm = // find an appropriate term >> templ.typeTerm = >> RichObjectFactory.getDefaultOntology().getOrCreateTerm("CDS"); >> templ.featureRelationshipSet = new TreeSet(); >> templ.rankedCrossRefs = new TreeSet(); >> templ.location = loc; >> >> // add notes if any you'd like >> >> return templ; >> } >> >> That should make it into the output file. >> >> Regards, >> George >> >> >> Quoting Hedwig Kurka : >> >>> Hello all, >>> >>> I have a problem concerning creating EMBL or Genbank files. >>> Below is a fragment of my code and an example of how the EMBL file looks >>> like. >>> >>> String name = "test genome"; >>> String seqString = pFasta.getSequence(1, pFasta.getLength()); >>> Sequence seq = DNATools.createDNASequence(seqString, name); >>> Alphabet dna = AlphabetManager.alphabetForName("DNA"); >>> RichSequence rs = >>> Tools.createRichSequence(RichObjectFactory.getDefaultNamespace(), name, >>> seqString, dna); >>> Set rfeatSet = new HashSet(); >>> StrandedFeature.Template t = new StrandedFeature.Template(); >>> for(int i=0; i>> int start = (int) Math.abs(anno.get(i).getStart()); >>> int stop = (int) Math.abs(anno.get(i).getStop()); >>> t.type = "CDS"; >>> if(start < stop){ >>> t.location = new RangeLocation(start, stop); >>> t.strand = StrandedFeature.POSITIVE; >>> } >>> if(start > stop){ >>> t.location = new RangeLocation(stop, start); >>> t.strand = StrandedFeature.NEGATIVE; >>> } >>> Feature f = seq.createFeature(t); >>> RichFeature rf = RichFeature.Tools.enrich(f); >>> rfeatSet.add(rf); >>> } >>> rs.setFeatureSet(rfeatSet); >>> rs = RichSequence.Tools.enrich(rs); >>> RichSequence.IOTools.writeEMBL(output, rs, >>> RichObjectFactory.getDefaultNamespace()); >>> >>> EMBL file: >>> FT any 1889536..1890903 >>> FT any 134636..136987 >>> FT any 3727110..3727625 >>> FT any 2812636..2813517 >>> FT any 580648..581643 >>> FT any 2330962..2331921 >>> FT any 1012371..1013513 >>> FT any 1260854..1261720 >>> FT any 1602858..1603706 >>> FT any 4108079..4108999 >>> FT any 346637..347731 >>> FT any 4073395..4074549 >>> >>> I wonder where the information of plus and minus strand is, why is there >>> "any" in the file and not "CDS" and so on. >>> >>> As tutorial I found that: >>> http://www.biojava.org/wiki/BioJava:Cookbook:Locations:Feature. Is there >>> another one? >>> >>> Thank you for your help! >>> >>> And any help is appreciated, >>> >>> Hedwig >>> >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> >> >> >> >> > > > > > ------------------------------ > > Message: 2 > Date: Thu, 30 Jun 2011 11:02:49 +0200 > From: Hedwig Kurka > Subject: Re: [Biojava-l] make feature to create embl or genbank file > To: Richard Holland , > biojava-l at lists.open-bio.org > Message-ID: <4E0C3BB9.2030601 at mikro.biologie.tu-muenchen.de> > Content-Type: text/plain; charset=ISO-8859-1 > > I already built the set and populated it. > Now I want to give it the RichSequence. But when I do that in that line: > > rs.setRichFeatureSet(rfeatSet); > > It says, that it needs a Set > > Regards, > Hedwig > >> I'm not sure what you're trying to do - if you want to build a Set, you can just use the standard Java Collections API to create and populate a Set? >> >> cheers, >> Richard >> >> On 30 Jun 2011, at 09:12, Hedwig Kurka wrote: >> >> >>> Hi Richard, >>> >>> Thank you for your answer. >>> If I create RichFeature objects, then I have to do conversions in that line: >>> RichFeature f = (RichFeature) seq.createFeature(t); >>> and then I have in that line: >>> rs.setRichFeatureSet(rfeatSet); >>> the problem, that I have a Set and not Set, but I >>> didn't find a method builds a Set containing RichFeature objects on a >>> RichSequence. Is there one? >>> >>> >>> >>>> The conversion from Feature to RichFeature does its best but is not >>>> ideal. As you already have a RichSequence object to work with then you >>>> would be better creating native RichFeature objects instead of doing >>>> conversions. >>>> >>>> Richard Holland >>>> Eagle Genomics Ltd >>>> Sent from my HTC >>>> >>>> >>>> >>>> >>> >> -- >> Richard Holland, BSc MBCS >> Operations and Delivery Director, Eagle Genomics Ltd >> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> >> >> >> > > > > ------------------------------ > > Message: 3 > Date: Thu, 30 Jun 2011 10:36:12 +0100 > From: Richard Holland > Subject: Re: [Biojava-l] make feature to create embl or genbank file > To: Hedwig Kurka > Cc: biojava-l at lists.open-bio.org > Message-ID: <285A2F4C-9C8E-4300-A48B-E31E95601570 at eaglegenomics.com> > Content-Type: text/plain; charset=us-ascii > > There is a coding problem in ThinRichSequence (from which SimpleRichSequence and others extend) that allow only Set as input, but require the Feature objects to actually be RichFeature objects. This was for a number of reasons that probably seemed good at the time but I have now forgotten what they were. The workaround is to declare your set as a Set but populate it with RichFeature objects (as RichFeature extends Feature and so the Set will still accept them). > > The code is being phased out in favour of the new BJ3 model so it is unlikely to be fixed, but hopefully this workaround solves your particular case. > > cheers, > Richard > > On 30 Jun 2011, at 10:02, Hedwig Kurka wrote: > >> I already built the set and populated it. >> Now I want to give it the RichSequence. But when I do that in that line: >> >> rs.setRichFeatureSet(rfeatSet); >> >> It says, that it needs a Set >> >> Regards, >> Hedwig >> >>> I'm not sure what you're trying to do - if you want to build a Set, you can just use the standard Java Collections API to create and populate a Set? >>> >>> cheers, >>> Richard >>> >>> On 30 Jun 2011, at 09:12, Hedwig Kurka wrote: >>> >>> >>>> Hi Richard, >>>> >>>> Thank you for your answer. >>>> If I create RichFeature objects, then I have to do conversions in that line: >>>> RichFeature f = (RichFeature) seq.createFeature(t); >>>> and then I have in that line: >>>> rs.setRichFeatureSet(rfeatSet); >>>> the problem, that I have a Set and not Set, but I >>>> didn't find a method builds a Set containing RichFeature objects on a >>>> RichSequence. Is there one? >>>> >>>> >>>> >>>>> The conversion from Feature to RichFeature does its best but is not >>>>> ideal. As you already have a RichSequence object to work with then you >>>>> would be better creating native RichFeature objects instead of doing >>>>> conversions. >>>>> >>>>> Richard Holland >>>>> Eagle Genomics Ltd >>>>> Sent from my HTC >>>>> >>>>> >>>>> >>>>> >>>> >>> -- >>> Richard Holland, BSc MBCS >>> Operations and Delivery Director, Eagle Genomics Ltd >>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com >>> http://www.eaglegenomics.com/ >>> >>> >>> >>> >> > > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > > > > ------------------------------ > > Message: 4 > Date: Thu, 30 Jun 2011 10:24:23 -0500 > From: George Waldon > Subject: Re: [Biojava-l] make feature to create embl or genbank file > To: Hedwig Kurka > Cc: "biojava-l at lists.open-bio.org" > Message-ID: <20110630102423.185442z4cl0wn7eo at gator1273.hostgator.com> > Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes"; > format="flowed" > > No stupid question here, only bad answer. Hope this one is good: > > http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Working_with_RichLocation_objects. > > - George > > Quoting Hedwig Kurka : > >> Hi George, >> >> Thank you for your answer. >> I have some questions. Maybe very stupid, but I don't know how to get >> RichLocation objects. >> RichLocation loc = (RichLocation) new RangeLocation(start, stop); >> That doesn't work. > > > > > > > > ------------------------------ > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > End of Biojava-l Digest, Vol 101, Issue 14 > ****************************************** From shakunb at uom.ac.mu Thu Jun 30 15:20:41 2011 From: shakunb at uom.ac.mu (Shakuntala Baichoo) Date: Thu, 30 Jun 2011 23:20:41 +0400 Subject: [Biojava-l] Help on NCBIQBlastService and BlastXMLQuery Message-ID: Hi! Grateful If anybody could help me with NCBIQBlastService I need to blast a set (in this case only 2) of nucleotide sequences and I am using Biojava3's NCBIQBlastService. I direct the results in xml files and try to parse that xml file so as to get all the results, in terms of % match, e-value etc... But I am only getting the reference of the sequences that have matched, as follows: ........... trying to get BLAST results for RID 0TJFFD5E01S Jun 30, 2011 11:10:03 PM org.biojava3.genome.query. BlastXMLQuery INFO: Start read of 0TJFFD5E01SResults_XML.xml Jun 30, 2011 11:10:03 PM org.biojava3.genome.query.BlastXMLQuery INFO: Read finished Jun 30, 2011 11:10:03 PM org.biojava3.genome.query.BlastXMLQuery getHitsQueryDef INFO: Query for hits Jun 30, 2011 11:10:03 PM org.biojava3.genome.query.BlastXMLQuery getHitsQueryDef INFO: 1 hits [CP002614, CP002487, FQ312003, CP001363, FN424405, CP000857, AE006468, AE017220, CP001138, CP001127, AM933172, AM933173, CP001144, FM200053, CP001120, CP001113, CP000886, CP000026, FR775193, AE014613, AL627266] *********************************************** trying to get BLAST results for RID 0TJFHZV201S Jun 30, 2011 11:10:27 PM org.biojava3.genome.query.BlastXMLQuery INFO: Start read of 0TJFHZV201SResults_XML.xml Jun 30, 2011 11:10:27 PM org.biojava3.genome.query.BlastXMLQuery INFO: Read finished Jun 30, 2011 11:10:27 PM org.biojava3.genome.query.BlastXMLQuery getHitsQueryDef INFO: Query for hits Jun 30, 2011 11:10:27 PM org.biojava3.genome.query.BlastXMLQuery getHitsQueryDef INFO: 1 hits [CP002614, CP002487, AP011957, FQ312003, CP001363, FN424405, AE006468, L19338, CP001113, CP000857, CP001138, AE017220, CP001120, CP000886, FR775195, AM933172, FM200053, AM933173, CP000026, CP001144, CP001127, AE014613, AL627267, M90677, CP000822] BUILD SUCCESSFUL (total time: 54 seconds) Note that when I open the generated xml file, it does contain all the results. Any idea how to extract all the info. Please... Here's the sample program: -------------------------------- /* * To change this template, choose Tools | Templates * and open the template in the editor. */ package BlastPackage; import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.io.PrintStream; import java.util.ArrayList; import java.util.Collection; import java.util.Iterator; import java.util.LinkedHashMap; import java.util.List; import java.util.Map.Entry; import java.util.Set; import org.biojava3.core.sequence.DNASequence; import org.biojava3.genome.query.BlastXMLQuery; import org.biojava3.core.sequence.ProteinSequence; import org.biojava3.core.sequence.compound.AmbiguityDNACompoundSet; import org.biojava3.core.sequence.compound.NucleotideCompound; import org.biojava3.core.sequence.io.DNASequenceCreator; import org.biojava3.core.sequence.io.FastaReader; import org.biojava3.core.sequence.io.FastaReaderHelper; import org.biojava3.core.sequence.io.GenericFastaHeaderParser; import org.biojava3.ws.alignment.qblast.NCBIQBlastService; import org.biojava3.ws.alignment.qblast.NCBIQBlastAlignmentProperties; import org.biojava3.ws.alignment.qblast.NCBIQBlastOutputProperties; import org.biojava3.ws.alignment.qblast.NCBIQBlastOutputFormat; import org.biojava.bio.program.sax.*; import org.biojava.bio.program.ssbind.*; import org.biojava.bio.search.*; import org.biojava.bio.seq.db.*; import org.xml.sax.*; import org.biojava.bio.*; public class NCBIQBlastServiceTest { /** * The program take only a string with a path toward a sequence file * * For this example, I keep it simple with a single FASTA formatted file * */ public static void main(String[] args) { NCBIQBlastService rbw; NCBIQBlastAlignmentProperties rqb; NCBIQBlastOutputProperties rof; InputStream is = null; ArrayList rid = new ArrayList(); try { // Let's capture the sequences in a file... //LinkedHashMap a = FastaReaderHelper.readFastaDNASequence(new File("TestBlast.fas")); FileInputStream inStream = new FileInputStream( "TestBlast.fas" ); FastaReader fastaReader = new FastaReader( inStream, new GenericFastaHeaderParser(), new DNASequenceCreator(AmbiguityDNACompoundSet.getDNACompoundSet())); LinkedHashMap b = fastaReader.process(); /* * You would imagine that one would blast a bunch of sequences of * identical nature with identical parameters... */ rbw = new NCBIQBlastService(); rqb = new NCBIQBlastAlignmentProperties(); rqb.NCBIQBlastAlignmentProperties(); rqb.setBlastProgram("blastn"); rqb.setBlastDatabase("nr"); /* * First, let's send all the sequences to the QBlast service and * keep the RID for fetching the results at some later moments * (actually, in a few seconds :-)) * * Using a data structure to keep track of all request IDs is a good * practice. * */ for (Entry entry : b.entrySet()) { System.out.println( entry.getValue().getOriginalHeader() + "\n"); String s = entry.getValue().toString(); //System.out.println("Query Sequence:"); System.out.println(s); String request = rbw.sendAlignmentRequest(s,rqb); //request=rbw. rid.add(request); } /* * Let's check that our requests have been processed. If completed, * let's look at the alignments with my own selection of output and * alignment formats. */ for (String aRid : rid) { System.out.println("***********************************************"); System.out.println("trying to get BLAST results for RID " + aRid); boolean wasBlasted = false; while (!wasBlasted) { wasBlasted = rbw.isReady(aRid, System.currentTimeMillis()); } rof = new NCBIQBlastOutputProperties(); rof.setOutputFormat(NCBIQBlastOutputFormat.XML); rof.setAlignmentOutputFormat(NCBIQBlastOutputFormat.TABULAR); rof.setDescriptionNumber(20); rof.setAlignmentNumber(20); //System.out.println("Output Options:"+"\n"+rof.getOutputOptions()); is = rbw.getAlignmentResults(aRid, rof); BufferedReader br = new BufferedReader( new InputStreamReader(is)); String line = null; String OutputFilename1=aRid+"Results_XML.xml"; FileOutputStream fp1=null; fp1 = new FileOutputStream(OutputFilename1); while ((line = br.readLine()) != null) { //System.out.println(line); new PrintStream(fp1).println(line); } fp1.close(); BlastHomologyHits BL=new BlastHomologyHits(); BlastXMLQuery B=new BlastXMLQuery(OutputFilename1); LinkedHashMap> hits=B.getHitsQueryDef(1E-100); //System.out.println(hits); //LinkedHashMap> Homologyhits=BL.getMatches(new File(OutputFilename1), 1E-100); Collection c=hits.values(); Iterator i=c.iterator(); while(i.hasNext()) System.out.println(i.next()); } is.close(); } /* * What happens if the file can't be read */ catch (IOException ioe) { ioe.printStackTrace(); } /* * What happens if FastaReaderHelper hits a snag */ catch (Exception bio) { bio.printStackTrace(); } } } ------------------------ Thanks Shakuntala Email Disclaimer: This email and all its contents are subject to the disclaimer at http://www.uom.ac.mu/emaildisclaimer From ambi1999 at gmail.com Tue Jun 7 19:58:19 2011 From: ambi1999 at gmail.com (Ambikesh Jayal) Date: Tue, 7 Jun 2011 20:58:19 +0100 Subject: [Biojava-l] Biojava implementation of CE algorithm In-Reply-To: References: Message-ID: Hi All, There seems to be some discrepancy for some protein sequences in results of Biojava implementation of CE algorithm and the implementation on CE website http://cl.sdsc.edu/ce/ce_align.html For example between protein sequences [2aza.A] AND [1paz]. Other such example are 1cew.I and 1mol.A, 1cid and 2rhe. Is there some reason for this discrepancy? Results using BioJava implementation of CE algorithm ************* [2aza.A] AND [1paz] ************ CE afpChain.getTotalRmsdOpt() 2.5267815014062553 afpChain.getOptLength() 82 Results using CE website http://cl.sdsc.edu/ce/ce_align.html ************* [2aza.A] AND [1paz] ************ Rmsd = 2.9? Aligned/gap positions = 84/49 Kind Regards, Ambi. From jayunit100 at gmail.com Fri Jun 10 18:30:43 2011 From: jayunit100 at gmail.com (Jay Vyas) Date: Fri, 10 Jun 2011 14:30:43 -0400 Subject: [Biojava-l] StructurePairAligner Message-ID: Hi Guys : I am trying to adopt the StructurePairAligner.java program which Andreas wrote, which is available online. I noticed that "startingAlignment" "calculatedFragmentPairs" and "jointFragments" are not in the AlignmentProfressListener class. Is there an updated version of the pair aligner gui ? private void notifyStartingAlignment(String name1, Atom[] ca1, String name2, Atom[] ca2){ for (AlignmentProgressListener li : listeners){ li.startingAlignment(name1, ca1, name2, ca2); } } private void notifyFragmentListeners(List fragments){ for (AlignmentProgressListener li : listeners){ li.calculatedFragmentPairs(fragments); } } private void notifyJointFragments(JointFragments[] fragments){ for (AlignmentProgressListener li : listeners){ li.jointFragments(fragments); } } From andreas at sdsc.edu Fri Jun 10 19:34:28 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Fri, 10 Jun 2011 12:34:28 -0700 Subject: [Biojava-l] StructurePairAligner In-Reply-To: References: Message-ID: Hi Jay, What is the goal of what you want to do with this? The StructurePairAligner is historically the oldest of the various structure alignment algorithms that are coming with biojava. The events that you are pointing out allow to trace what is going on in this implementation. However because they are quite specific to this algorithm, this did not get used when we were working later on the ce and fatcat implementations. If you want to work on derived algorithms, I recommend implementing the StructureAlignment interface and use CeMain.java (and others) as a template... I am not sure what you mean with pair aligner gui. There are some examples in the demo package in biojava3-structure-gui. E.g. DemoAlignmentGui .. Hope that helps, Andreas On Fri, Jun 10, 2011 at 11:30 AM, Jay Vyas wrote: > ? Hi Guys : I am trying to adopt the StructurePairAligner.java program > which Andreas wrote, which is available online. ? I noticed that > "startingAlignment" "calculatedFragmentPairs" and "jointFragments" are not > in the AlignmentProfressListener class. ?Is there an updated version of the > pair aligner gui ? > > > private void notifyStartingAlignment(String name1, Atom[] ca1, String name2, > Atom[] ca2){ > > ? ? ? ? ? for (AlignmentProgressListener li : listeners){ > > ? ? ? ? ? ? ?li.startingAlignment(name1, ca1, name2, ca2); > > ? ? ? ? ? } > > ? ? ? ?} > > > ? ? ? ?private void notifyFragmentListeners(List fragments){ > > > ? ? ? ? ? for (AlignmentProgressListener li : listeners){ > > ? ? ? ? ? ? ?li.calculatedFragmentPairs(fragments); > > ? ? ? ? ? } > > > ? ? ? ?} > > > ? ? ? ?private void notifyJointFragments(JointFragments[] fragments){ > > ? ? ? ? ? for (AlignmentProgressListener li : listeners){ > > ? ? ? ? ? ? ?li.jointFragments(fragments); > > ? ? ? ? ? } > > ? ? ? ?} > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From jayunit100 at gmail.com Fri Jun 10 20:53:33 2011 From: jayunit100 at gmail.com (Jay Vyas) Date: Fri, 10 Jun 2011 16:53:33 -0400 Subject: [Biojava-l] StructurePairAligner In-Reply-To: References: Message-ID: Just alignment and visualization of to chains... nothing fancy. I guess I will look at the interface. From khalil.elmazouari at gmail.com Sun Jun 12 20:01:59 2011 From: khalil.elmazouari at gmail.com (Khalil El Mazouari) Date: Sun, 12 Jun 2011 22:01:59 +0200 Subject: [Biojava-l] add Source Organism to genbank Message-ID: Hi, I am trying to set Organism to RichSequence via: richSequence.setTaxon(new SimpleNCBITaxon(taxid)); //ncbi tax id the genbank ouptut: SOURCE ORGANISM . FEATURES Location/Qualifiers source 1..394 /mol_type="genomic DNA" /strand="+" /organism="" /db_xref="taxon:10090" How to add Organism display name to the Source Field and to the annotation organism? Thanks khalil From holland at eaglegenomics.com Sun Jun 12 20:19:45 2011 From: holland at eaglegenomics.com (Richard Holland) Date: Sun, 12 Jun 2011 21:19:45 +0100 Subject: [Biojava-l] add Source Organism to genbank In-Reply-To: References: Message-ID: <7794AE4E-26C3-4B19-A721-E0DD3DBED0C1@eaglegenomics.com> It only knows the display name if you've told it what it is. Therefore you either have to load up the NCBI taxonomy in memory or via a BioSQL database. cheers, Richard On 12 Jun 2011, at 21:01, Khalil El Mazouari wrote: > Hi, > > I am trying to set Organism to RichSequence via: > > richSequence.setTaxon(new SimpleNCBITaxon(taxid)); //ncbi tax id > > the genbank ouptut: > > SOURCE > ORGANISM > . > FEATURES Location/Qualifiers > source 1..394 > /mol_type="genomic DNA" > /strand="+" > /organism="" > /db_xref="taxon:10090" > > How to add Organism display name to the Source Field and to the annotation organism? > > Thanks > > khalil > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Mon Jun 13 06:11:38 2011 From: holland at eaglegenomics.com (Richard Holland) Date: Mon, 13 Jun 2011 07:11:38 +0100 Subject: [Biojava-l] add Source Organism to genbank In-Reply-To: <43BF6C21-7E01-4F23-9EFA-45A6213F7667@gmail.com> References: <7794AE4E-26C3-4B19-A721-E0DD3DBED0C1@eaglegenomics.com> <43BF6C21-7E01-4F23-9EFA-45A6213F7667@gmail.com> Message-ID: <4017484D-8887-4BCE-B244-F828DE6289C3@eaglegenomics.com> If you investigate the SimpleNCBITaxon class, you'll see that it has a addName() function. By using that on the instance you passed to richSequence.setTaxon() before doing the Genbank export then you'll get the name appearing correctly. (The name classes are defined as constants in the parent NCBITaxon interface - COMMON or SCIENTIFIC are the two most commonly used.) cheers, Richard On 12 Jun 2011, at 21:30, Khalil El Mazouari wrote: > Could you please show me how to set "Mus musculus" organism to RichSequence? > > many thanks. > > Khalil > On 12 Jun 2011, at 22:19, Richard Holland wrote: > >> It only knows the display name if you've told it what it is. Therefore you either have to load up the NCBI taxonomy in memory or via a BioSQL database. >> >> cheers, >> Richard >> >> On 12 Jun 2011, at 21:01, Khalil El Mazouari wrote: >> >>> Hi, >>> >>> I am trying to set Organism to RichSequence via: >>> >>> richSequence.setTaxon(new SimpleNCBITaxon(taxid)); //ncbi tax id >>> >>> the genbank ouptut: >>> >>> SOURCE >>> ORGANISM >>> . >>> FEATURES Location/Qualifiers >>> source 1..394 >>> /mol_type="genomic DNA" >>> /strand="+" >>> /organism="" >>> /db_xref="taxon:10090" >>> >>> How to add Organism display name to the Source Field and to the annotation organism? >>> >>> Thanks >>> >>> khalil >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> -- >> Richard Holland, BSc MBCS >> Operations and Delivery Director, Eagle Genomics Ltd >> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> > -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From asma.rabe at gmail.com Wed Jun 15 08:38:05 2011 From: asma.rabe at gmail.com (Asma rabe) Date: Wed, 15 Jun 2011 17:38:05 +0900 Subject: [Biojava-l] Protein protein interactions Message-ID: Hi all, I would like to know is there any module in biojava for processing protein protein interactions? Best Regards, Asma From amr_alhossary at hotmail.com Wed Jun 15 09:03:09 2011 From: amr_alhossary at hotmail.com (Amr AL-Hossary) Date: Wed, 15 Jun 2011 11:03:09 +0200 Subject: [Biojava-l] Protein protein interactions In-Reply-To: References: Message-ID: Please Identify what exactly do you need for Protein-Protein Interactions, Asmaa. I it is not present, this could be a good start point & it could be built upon request. Amr -------------------------------------------------- From: "Asma rabe" Sent: Wednesday, June 15, 2011 10:38 AM To: Subject: [Biojava-l] Protein protein interactions > Hi all, > > I would like to know is there any module in biojava for processing protein > protein interactions? > > Best Regards, > Asma > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From p.v.troshin at dundee.ac.uk Wed Jun 15 09:36:49 2011 From: p.v.troshin at dundee.ac.uk (Peter Troshin) Date: Wed, 15 Jun 2011 10:36:49 +0100 Subject: [Biojava-l] Protein protein interactions In-Reply-To: References: Message-ID: <4DF87D31.5050801@dundee.ac.uk> Hi Asma, I do not know about such module in Biojava, but if you are into human protein-protein interactions check out this web site http://www.compbio.dundee.ac.uk/www-pips. There are a few datasets available for download so you can build on them if you have to have a programmatic access. I hope that helps. Peter Dr Peter Troshin Bioinformatics Software Developer Phone: +44 (0)1382 388589 Fax: +44 (0)1382 385764 The Barton Group College of Life Sciences Medical Sciences Institute University of Dundee Dundee DD1 5EH UK On 15/06/2011 09:38, Asma rabe wrote: > Hi all, > > I would like to know is there any module in biojava for processing protein > protein interactions? > > Best Regards, > Asma > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From khalil.elmazouari at gmail.com Fri Jun 17 09:16:11 2011 From: khalil.elmazouari at gmail.com (Khalil El Mazouari) Date: Fri, 17 Jun 2011 11:16:11 +0200 Subject: [Biojava-l] Genbank feature parsing performance Message-ID: <43AA66DA-AC76-4172-A237-4F079D958FFD@gmail.com> Hi, I am developing an app where features are extracted from a large genbank file, and processed: multiple alignment, annotation.... The feature extraction is a real bottleneck in my app. It consumes 87% of total execution time. Feature extraction is done via: FeatureFilter ff = new FeatureFilter.ByAnnotation(key, value); FeatureHolder fh = richSequence.filter(ff); Feature feat = fh.features().next(); ... Any suggestion on how to improve the performance of features extraction is welcome. Thanks, khalil From martin.jones at ed.ac.uk Fri Jun 17 10:12:05 2011 From: martin.jones at ed.ac.uk (Martin Jones) Date: Fri, 17 Jun 2011 11:12:05 +0100 Subject: [Biojava-l] Genbank feature parsing performance In-Reply-To: <43AA66DA-AC76-4172-A237-4F079D958FFD@gmail.com> References: <43AA66DA-AC76-4172-A237-4F079D958FFD@gmail.com> Message-ID: Hi, I have had the same issue when parsing large sets of genbank files. In my case, the workaround was to first treat the whole genbank record as a string, and do a quick regex match to check if it contained something of interest (in my case I was searching for specific taxids): // first do a quick pattern-match to extract the taxid so we can exit early without the overhead of parsing the whole file private final Pattern taxidPattern = Pattern.compile("db_xref=\\\"taxon:(\\d+)"); Matcher taxidMatcher = taxidPattern.matcher(currentRecord); if (taxidMatcher.find()) { def taxid = taxidMatcher[0][1].toInteger() if (!taxidList.contains(taxid)) { return } // here do the slow part of actually parsing all the features This is in Groovy so there are a few syntactical differences. If you are only interested in a subset of the GenBank records, then this approach might be of use. M On 17 June 2011 10:16, Khalil El Mazouari wrote: > Hi, > > I am developing an app where features are extracted from a large genbank file, and processed: multiple alignment, annotation.... > > The feature extraction is a real bottleneck in my app. It consumes 87% of total execution time. > > Feature extraction is done via: > > FeatureFilter ff = new FeatureFilter.ByAnnotation(key, value); > FeatureHolder fh = richSequence.filter(ff); > Feature feat = fh.features().next(); > ... > > Any suggestion on how to improve the performance of features extraction is welcome. > > Thanks, > > khalil > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > From khalil.elmazouari at gmail.com Fri Jun 17 10:33:28 2011 From: khalil.elmazouari at gmail.com (Khalil El Mazouari) Date: Fri, 17 Jun 2011 12:33:28 +0200 Subject: [Biojava-l] Genbank feature parsing performance In-Reply-To: References: <43AA66DA-AC76-4172-A237-4F079D958FFD@gmail.com> Message-ID: <7EB36AD4-DE00-433A-8DB2-3D3B14BCE0C6@gmail.com> Thanks Martin, I already tried the regex. The performance increase was < 10%. My situation is different in 2 points: 1. info to extract from genbank file is always present. 2. there is multiple feature to extract from each record. I agree with you. Extracting a single field from a genbank file, is done munch faster with simple regex than with FeatureFilter. Regards, khalil On 17 Jun 2011, at 12:12, Martin Jones wrote: > Hi, > > I have had the same issue when parsing large sets of genbank files. In > my case, the workaround was to first treat the whole genbank record as > a string, and do a quick regex match to check if it contained > something of interest (in my case I was searching for specific > taxids): > > // first do a quick pattern-match to extract the taxid so we can > exit early without the overhead of parsing the whole file > private final Pattern taxidPattern = > Pattern.compile("db_xref=\\\"taxon:(\\d+)"); > Matcher taxidMatcher = taxidPattern.matcher(currentRecord); > if (taxidMatcher.find()) { > def taxid = taxidMatcher[0][1].toInteger() > if (!taxidList.contains(taxid)) { > return > } > // here do the slow part of actually parsing all the features > > > This is in Groovy so there are a few syntactical differences. If you > are only interested in a subset of the GenBank records, then this > approach might be of use. > > M > > > > > On 17 June 2011 10:16, Khalil El Mazouari wrote: >> Hi, >> >> I am developing an app where features are extracted from a large genbank file, and processed: multiple alignment, annotation.... >> >> The feature extraction is a real bottleneck in my app. It consumes 87% of total execution time. >> >> Feature extraction is done via: >> >> FeatureFilter ff = new FeatureFilter.ByAnnotation(key, value); >> FeatureHolder fh = richSequence.filter(ff); >> Feature feat = fh.features().next(); >> ... >> >> Any suggestion on how to improve the performance of features extraction is welcome. >> >> Thanks, >> >> khalil >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> From khalil.elmazouari at gmail.com Fri Jun 17 11:05:38 2011 From: khalil.elmazouari at gmail.com (Khalil El Mazouari) Date: Fri, 17 Jun 2011 13:05:38 +0200 Subject: [Biojava-l] Genbank feature parsing performance In-Reply-To: References: <43AA66DA-AC76-4172-A237-4F079D958FFD@gmail.com> <7EB36AD4-DE00-433A-8DB2-3D3B14BCE0C6@gmail.com> Message-ID: <59F7E9F5-023B-471D-833A-CD8FD3F85AF6@gmail.com> Good suggestion ;) However, I am not familiar with Groovy. I'll look for something similar in Java. Regards, khalil On 17 Jun 2011, at 12:36, Martin Jones wrote: > Yes, this approach won't be much use if you are interested in the > contents of every genbank record. > > Have you thought about parsing the gb files in parallel? In my > experience, parsing genbank files scales quite nicely when done in > multiple threads. I have used the GPars library for this type of job > and it is very nice to use: > > http://gpars.codehaus.org/Parallelizer > > > M > > > > On 17 June 2011 11:33, Khalil El Mazouari wrote: >> Thanks Martin, >> >> I already tried the regex. The performance increase was < 10%. >> >> My situation is different in 2 points: >> 1. info to extract from genbank file is always present. >> 2. there is multiple feature to extract from each record. >> >> I agree with you. Extracting a single field from a genbank file, is done munch faster with simple regex than with FeatureFilter. >> >> Regards, >> >> khalil >> >> On 17 Jun 2011, at 12:12, Martin Jones wrote: >> >>> Hi, >>> >>> I have had the same issue when parsing large sets of genbank files. In >>> my case, the workaround was to first treat the whole genbank record as >>> a string, and do a quick regex match to check if it contained >>> something of interest (in my case I was searching for specific >>> taxids): >>> >>> // first do a quick pattern-match to extract the taxid so we can >>> exit early without the overhead of parsing the whole file >>> private final Pattern taxidPattern = >>> Pattern.compile("db_xref=\\\"taxon:(\\d+)"); >>> Matcher taxidMatcher = taxidPattern.matcher(currentRecord); >>> if (taxidMatcher.find()) { >>> def taxid = taxidMatcher[0][1].toInteger() >>> if (!taxidList.contains(taxid)) { >>> return >>> } >>> // here do the slow part of actually parsing all the features >>> >>> >>> This is in Groovy so there are a few syntactical differences. If you >>> are only interested in a subset of the GenBank records, then this >>> approach might be of use. >>> >>> M >>> >>> >>> >>> >>> On 17 June 2011 10:16, Khalil El Mazouari wrote: >>>> Hi, >>>> >>>> I am developing an app where features are extracted from a large genbank file, and processed: multiple alignment, annotation.... >>>> >>>> The feature extraction is a real bottleneck in my app. It consumes 87% of total execution time. >>>> >>>> Feature extraction is done via: >>>> >>>> FeatureFilter ff = new FeatureFilter.ByAnnotation(key, value); >>>> FeatureHolder fh = richSequence.filter(ff); >>>> Feature feat = fh.features().next(); >>>> ... >>>> >>>> Any suggestion on how to improve the performance of features extraction is welcome. >>>> >>>> Thanks, >>>> >>>> khalil >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>>> >> >> >> From phidias51 at gmail.com Fri Jun 17 14:36:12 2011 From: phidias51 at gmail.com (Mark Fortner) Date: Fri, 17 Jun 2011 07:36:12 -0700 Subject: [Biojava-l] Genbank feature parsing performance In-Reply-To: <59F7E9F5-023B-471D-833A-CD8FD3F85AF6@gmail.com> References: <43AA66DA-AC76-4172-A237-4F079D958FFD@gmail.com> <7EB36AD4-DE00-433A-8DB2-3D3B14BCE0C6@gmail.com> <59F7E9F5-023B-471D-833A-CD8FD3F85AF6@gmail.com> Message-ID: Martin, Khalil In the code sample you check to see if the taxon is in a list. I suspect that operation is slower than you intend. You might try using a treeset and see if the lookup performance improves. As for genbank parsing performance itself, I'm curious if you've tried parsing the genbank XML files and noticed any performance difference? If you're looking for something similar to GPars in Java, you might try the ThreadPoolExecutor which manages a threadpool and queuing Runnable tasks to the threadpool. Hope this helps, Mark PS if you have Groovy code that you'd like to share, feel free to add any examples to the BioGroovy wiki . On Jun 17, 2011 4:16 AM, "Khalil El Mazouari" wrote: Good suggestion ;) However, I am not familiar with Groovy. I'll look for something similar in Java. Regards, khalil On 17 Jun 2011, at 12:36, Martin Jones wrote: > Yes, this approach won't be much use if you are interested in the > contents of every genbank record. > > Have you thought about parsing the gb files in parallel? In my > experience, parsing genbank files scales quite nicely when done in > multiple threads. I have used the GPars library for this type of job > and it is very nice to use: > > http://gpars.codehaus.org/Parallelizer > > > M > > > > On 17 June 2011 11:33, Khalil El Mazouari wrote: >> Thanks ... From khalil.elmazouari at gmail.com Fri Jun 17 16:21:43 2011 From: khalil.elmazouari at gmail.com (Khalil El Mazouari) Date: Fri, 17 Jun 2011 18:21:43 +0200 Subject: [Biojava-l] Genbank feature parsing performance In-Reply-To: References: <43AA66DA-AC76-4172-A237-4F079D958FFD@gmail.com> <7EB36AD4-DE00-433A-8DB2-3D3B14BCE0C6@gmail.com> Message-ID: <123851F2-5554-4C94-8931-081D94138D77@gmail.com> Hi, exec time for parsing Genbank, EMBL and EMBL-XML is ? the same. However, writing sequence in EMBL format was 87% slower vs Genbank format. Regards, khalil On 17 Jun 2011, at 12:36, Martin Jones wrote: > Yes, this approach won't be much use if you are interested in the > contents of every genbank record. > > Have you thought about parsing the gb files in parallel? In my > experience, parsing genbank files scales quite nicely when done in > multiple threads. I have used the GPars library for this type of job > and it is very nice to use: > > http://gpars.codehaus.org/Parallelizer > > > M > > > > On 17 June 2011 11:33, Khalil El Mazouari wrote: >> Thanks Martin, >> >> I already tried the regex. The performance increase was < 10%. >> >> My situation is different in 2 points: >> 1. info to extract from genbank file is always present. >> 2. there is multiple feature to extract from each record. >> >> I agree with you. Extracting a single field from a genbank file, is done munch faster with simple regex than with FeatureFilter. >> >> Regards, >> >> khalil >> >> On 17 Jun 2011, at 12:12, Martin Jones wrote: >> >>> Hi, >>> >>> I have had the same issue when parsing large sets of genbank files. In >>> my case, the workaround was to first treat the whole genbank record as >>> a string, and do a quick regex match to check if it contained >>> something of interest (in my case I was searching for specific >>> taxids): >>> >>> // first do a quick pattern-match to extract the taxid so we can >>> exit early without the overhead of parsing the whole file >>> private final Pattern taxidPattern = >>> Pattern.compile("db_xref=\\\"taxon:(\\d+)"); >>> Matcher taxidMatcher = taxidPattern.matcher(currentRecord); >>> if (taxidMatcher.find()) { >>> def taxid = taxidMatcher[0][1].toInteger() >>> if (!taxidList.contains(taxid)) { >>> return >>> } >>> // here do the slow part of actually parsing all the features >>> >>> >>> This is in Groovy so there are a few syntactical differences. If you >>> are only interested in a subset of the GenBank records, then this >>> approach might be of use. >>> >>> M >>> >>> >>> >>> >>> On 17 June 2011 10:16, Khalil El Mazouari wrote: >>>> Hi, >>>> >>>> I am developing an app where features are extracted from a large genbank file, and processed: multiple alignment, annotation.... >>>> >>>> The feature extraction is a real bottleneck in my app. It consumes 87% of total execution time. >>>> >>>> Feature extraction is done via: >>>> >>>> FeatureFilter ff = new FeatureFilter.ByAnnotation(key, value); >>>> FeatureHolder fh = richSequence.filter(ff); >>>> Feature feat = fh.features().next(); >>>> ... >>>> >>>> Any suggestion on how to improve the performance of features extraction is welcome. >>>> >>>> Thanks, >>>> >>>> khalil >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>>> >> >> >> From phidias51 at gmail.com Fri Jun 17 16:58:08 2011 From: phidias51 at gmail.com (Mark Fortner) Date: Fri, 17 Jun 2011 09:58:08 -0700 Subject: [Biojava-l] Genbank feature parsing performance In-Reply-To: <123851F2-5554-4C94-8931-081D94138D77@gmail.com> References: <43AA66DA-AC76-4172-A237-4F079D958FFD@gmail.com> <7EB36AD4-DE00-433A-8DB2-3D3B14BCE0C6@gmail.com> <123851F2-5554-4C94-8931-081D94138D77@gmail.com> Message-ID: Hi Khalil, Did you try the genbank xml format? Mark On Fri, Jun 17, 2011 at 9:21 AM, Khalil El Mazouari < khalil.elmazouari at gmail.com> wrote: > Hi, > > exec time for parsing Genbank, EMBL and EMBL-XML is ? the same. > > However, writing sequence in EMBL format was 87% slower vs Genbank format. > > Regards, > > khalil > > > On 17 Jun 2011, at 12:36, Martin Jones wrote: > > > Yes, this approach won't be much use if you are interested in the > > contents of every genbank record. > > > > Have you thought about parsing the gb files in parallel? In my > > experience, parsing genbank files scales quite nicely when done in > > multiple threads. I have used the GPars library for this type of job > > and it is very nice to use: > > > > http://gpars.codehaus.org/Parallelizer > > > > > > M > > > > > > > > On 17 June 2011 11:33, Khalil El Mazouari > wrote: > >> Thanks Martin, > >> > >> I already tried the regex. The performance increase was < 10%. > >> > >> My situation is different in 2 points: > >> 1. info to extract from genbank file is always present. > >> 2. there is multiple feature to extract from each record. > >> > >> I agree with you. Extracting a single field from a genbank file, is done > munch faster with simple regex than with FeatureFilter. > >> > >> Regards, > >> > >> khalil > >> > >> On 17 Jun 2011, at 12:12, Martin Jones wrote: > >> > >>> Hi, > >>> > >>> I have had the same issue when parsing large sets of genbank files. In > >>> my case, the workaround was to first treat the whole genbank record as > >>> a string, and do a quick regex match to check if it contained > >>> something of interest (in my case I was searching for specific > >>> taxids): > >>> > >>> // first do a quick pattern-match to extract the taxid so we can > >>> exit early without the overhead of parsing the whole file > >>> private final Pattern taxidPattern = > >>> Pattern.compile("db_xref=\\\"taxon:(\\d+)"); > >>> Matcher taxidMatcher = taxidPattern.matcher(currentRecord); > >>> if (taxidMatcher.find()) { > >>> def taxid = taxidMatcher[0][1].toInteger() > >>> if (!taxidList.contains(taxid)) { > >>> return > >>> } > >>> // here do the slow part of actually parsing all the features > >>> > >>> > >>> This is in Groovy so there are a few syntactical differences. If you > >>> are only interested in a subset of the GenBank records, then this > >>> approach might be of use. > >>> > >>> M > >>> > >>> > >>> > >>> > >>> On 17 June 2011 10:16, Khalil El Mazouari > wrote: > >>>> Hi, > >>>> > >>>> I am developing an app where features are extracted from a large > genbank file, and processed: multiple alignment, annotation.... > >>>> > >>>> The feature extraction is a real bottleneck in my app. It consumes 87% > of total execution time. > >>>> > >>>> Feature extraction is done via: > >>>> > >>>> FeatureFilter ff = new FeatureFilter.ByAnnotation(key, value); > >>>> FeatureHolder fh = richSequence.filter(ff); > >>>> Feature feat = fh.features().next(); > >>>> ... > >>>> > >>>> Any suggestion on how to improve the performance of features > extraction is welcome. > >>>> > >>>> Thanks, > >>>> > >>>> khalil > >>>> _______________________________________________ > >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>> > >>>> > >> > >> > >> > > From dasarnow at gmail.com Mon Jun 20 01:30:49 2011 From: dasarnow at gmail.com (Daniel Asarnow) Date: Sun, 19 Jun 2011 18:30:49 -0700 Subject: [Biojava-l] Biojava implementation of CE algorithm In-Reply-To: References: Message-ID: Ambi, >From Biojava's CE on 2aza.A and 1paz.A, I get: RMSD = 2.8960955657826997 Z-score = 3.7 aligned = 85 and from the original C version: Structure Alignment Calculator, version 1.02, last modified: Jun 15, 2001. CE Algorithm, version 1.00, 1998. Chain 1: pdb/pdb1paz.ent:A (Size=123) Chain 2: pdb/pdb2aza.ent:A (Size=129) Alignment length = 85 Rmsd = 2.90A Z-Score = 3.7 Gaps = 49(57.6%) CPU = 0s Sequence identities = 11.8% The web version of CE is the same except for 1 fewer equivalent residues (CPU/FPU differences?). Can you post your Biojava code? Best, -da On Tue, Jun 7, 2011 at 12:58, Ambikesh Jayal wrote: > Hi All, > > There seems to be some discrepancy for some protein sequences in results of > Biojava implementation of CE algorithm and the implementation on CE website > http://cl.sdsc.edu/ce/ce_align.html > For example between protein sequences [2aza.A] AND [1paz]. Other such > example are 1cew.I and 1mol.A, 1cid and 2rhe. > > Is there some reason for this discrepancy? > > Results using BioJava implementation of CE algorithm > > ************* [2aza.A] AND [1paz] ************ > CE > afpChain.getTotalRmsdOpt() 2.5267815014062553 > afpChain.getOptLength() 82 > > Results using CE website http://cl.sdsc.edu/ce/ce_align.html > > ************* [2aza.A] AND [1paz] ************ > Rmsd = 2.9? > Aligned/gap positions = 84/49 > > > > Kind Regards, > Ambi. > > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From andreas at sdsc.edu Mon Jun 20 02:13:34 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Sun, 19 Jun 2011 19:13:34 -0700 Subject: [Biojava-l] Biojava implementation of CE algorithm In-Reply-To: References: Message-ID: > and from the original C version: > Structure Alignment Calculator, version 1.02, last modified: Jun 15, 2001. > The web version of CE is the same except for 1 fewer equivalent > residues (CPU/FPU differences?) There are different versions of CE out there that have been developed over time. I believe the BioJava code is based on the version from 2003 or 2004 (CE version 2.3). Andreas From dasarnow at gmail.com Mon Jun 20 03:23:23 2011 From: dasarnow at gmail.com (Daniel Asarnow) Date: Sun, 19 Jun 2011 20:23:23 -0700 Subject: [Biojava-l] Biojava implementation of CE algorithm In-Reply-To: References: Message-ID: The two distributions available on the CE homepage are the "Linux" version I used above, apparently from 2001, and a tarball from 2004 containing a SPARC32/Solaris binary. But I think they are all (including Biojava) giving approximately the same results? -da On Sun, Jun 19, 2011 at 19:13, Andreas Prlic wrote: >> and from the original C version: >> Structure Alignment Calculator, version 1.02, last modified: Jun 15, 2001. > >> The web version of CE is the same except for 1 fewer equivalent >> residues (CPU/FPU differences?) > > There are different versions of CE out there that have been developed > over time. I believe the BioJava code is based on the version from > 2003 or 2004 (CE version 2.3). > > Andreas > From andreas at sdsc.edu Mon Jun 20 03:48:16 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Sun, 19 Jun 2011 20:48:16 -0700 Subject: [Biojava-l] Biojava implementation of CE algorithm In-Reply-To: References: Message-ID: > But I think they are all (including Biojava) giving approximately the > same results? Yes, I would expect so. A From jayunit100 at gmail.com Mon Jun 20 16:31:32 2011 From: jayunit100 at gmail.com (Jay Vyas) Date: Mon, 20 Jun 2011 12:31:32 -0400 Subject: [Biojava-l] binning structures by quality.... Message-ID: Hi Guys : I'm trying to bin some structures, about 30 of them. I was wondering if anyone knows the upper "limit" for a structure to have a correct backbone, in RMSD units. For example, a structure bundle with RMSD of 9 would clearly have an undefined backbone, whereas a structure bundle with an RMSD of 1 would definetely be precise enough to convey backbone information. I wanted a more precise bound. I'm thinking, by eye, that anything above 4 angstroms is to imprecise to convey a backbone. But I figured maybe there was a formal treatment of such RMSD "categories" somewhere. Any thoughts would be appreciated. -- Jay Vyas MMSB/UCHC From andreas.prlic at gmail.com Tue Jun 21 17:09:54 2011 From: andreas.prlic at gmail.com (Andreas Prlic) Date: Tue, 21 Jun 2011 10:09:54 -0700 Subject: [Biojava-l] Find hydrogen bond angle In-Reply-To: <5F46D59F8ABDE34BB68E87EADB758CA8369A2AA7D3@CMS07.campus.gla.ac.uk> References: <5F46D59F8ABDE34BB68E87EADB758CA8369A2AA7D3@CMS07.campus.gla.ac.uk> Message-ID: Hi Avid, The Calc class uses the atom object to represent vectors. You will need to identify the vectors between which you want to measure the angle and represent them as "atoms" as well. For example to get the vector NH you can do AminoAcid aa1 ... Atom N = aa1.getN(); Atom H = aa1.getH(); Atom nh = Calc.subtract(N,H); Andreas On Tue, Jun 21, 2011 at 4:49 AM, MOHAMMAD (AVID) AFZAL <1000947A at student.gla.ac.uk> wrote: > Dear Dr.PRLIC , > I want to calculate NHO and COH angle in a pdb file, using bioJava. However the angle method only takes two atoms as parameter and in your weblog you only mentioned how to calculate hydrogen bond energy by using distance method. I wonder how can I find the angle of hydrogen bond by using bioJava. Thank you kindly in advance for your time and concern. > Yours sincerely, > Avid Afzal. From mbondi at yahoo.com Wed Jun 22 15:58:59 2011 From: mbondi at yahoo.com (Lic. Marcelo Ignacio Bondi) Date: Wed, 22 Jun 2011 08:58:59 -0700 (PDT) Subject: [Biojava-l] Biojava, bio prl bio php bioryby bio python biolisp bioclispse In-Reply-To: Message-ID: <988760.11487.qm@web125903.mail.ne1.yahoo.com> Activity logging, validation, feasibility clinical studies Bioinformatics and Computational Biology? Pharmacovigilance? ?? Tecnovigilancia? ?? Food Surveillance? Management methods, observation, discipline, studies and / or projects? ?? Telemedicine (telemetry of vital parameters).? ?? Telemetry Monitors? ?? Audit Management, Quality and Cost Reduction ISO to comparative foreign? ?? international, national.? ?? Bioinformatics telemedicine solution? ?? Relations? ?? Lender? ?? Financier? ?? Customer? ?? interactors inappropriate, inadequate or too complicated spaces between web, telephone, mail, face.? ?? ? Interface? Prevention? Physical life (23),? ?? The emotional life (28),? ?? Intellectual life (33),? ?? Intuitive / Life Compassion (38),? ?? Aesthetic Life (43),? ?? Life Awareness (48),? ?? Spiritual life (53).? 1 - Energy Risk Bioinformatics and Computational Biology.? ?? 2 - Biological hazards Bioinformatics and Computational Biology.? ?? 3 - Environmental Risk Bioinformatics and Computational Biology.? ?? 4 - Risks resulting from? incorrect power outputs? and substances Bioinformatics and Computational Biology.? ?? 5 - Risks relating to the? use of the devices Bioinformatics and Computational Biology.? ?? 6 - User Interface inappropriate, inadequate or too complicated (human / machine communication).? ?? Connection Types Bioinformatics and Computational Biology? ?? ???????????????? Dial (Dial-Up)? ???????????????? ?? ADSL access? ???????????????? ?? Cable modem access? ???????????????? ?? Access via Mobile Phone Network? ???????????????? ?? Internet Access in Mobile Networks? ?? ???????????????? Wireless Access? ???????????????? ?? Internet Access in Mobile Networks? ?? ???????????????? Satellite Access? ???????????????? ?? Optical Fiber Access by? ???????????????? ???????????????? Power Line Access? ?? High Performance Bioinformatics and Biomedicine (Hibban)? ?- Databases on a large scale biological and biomedical? ?- Integration of data and ontologies in biology and medicine? ?- Parallel Algorithms Bioinformatics? ?- Parallel Visualization and exploration of biomedical data? ?- Parallel Visualization and analysis of biomedical images? ?- The environments of large-scale collaboration? ?- Scientific Workflows in bioinformatics and biomedicine? ?- (Web) services for bioinformatics and biomedicine? ?- Grid Computing for Bioinformatics and Biomedicine? ?- Peer-to-Peer Computing, bioinformatics and biomedicine? ?- New architectures and programming models (eg, Cell, GPU)? ?bioinformatics and biomedicine? ?- Parallel processing of bio-signals? ?- Modeling and simulation of complex biological processes? 15086/08 trecnc administrative process / http://www.cnc.gov.ar/? ?? * Access Code: 5900c16769.? Consultation System www.minplan.gov.ar the resource record? ?? Case:? ?? Status:? ?? ?? Court:? ?? ?? Judge:? ?? ?? Type:? ?? ?? ?? Presentation:? ?? ?? ?? Filed:? ?? ?? ?? Obtained:? ?? ?? ?? Executive Summary? ?? ?? Sector:? ?? ?? ?? Stage:? ?? ?? ?? Country:? ?? Budget:? Expenses:? Income:? Balance:? Costs and damages accrued Bioinformatics and Computational Biology Interface telemetry re-structural perspectives 1986-2036...1986-2035 proclamation.? Wiretap recordings, films.? ?? American Declaration of the Rights and Duties of Man? ?? American Convention on Human Rights? ?? Inter-American Convention to Prevent and Punish Torture? ?? American Convention on Forced Disappearance of Persons? ?? signal acquisition and display? ?? Vascular Neurology Cardiology Neumonologia? ?? substantially equivalent devices? ?? channels and channels? ?? Certificate ANMAT - Argentina? ?? FDA 510 (k) Number - The United States? ?? national register of producers and products of medical technology (RPPTM)? ?? EC? ?? 60601? ?? ISO? ?? Descriptive Name? ?? Identification code and technical name? ?? Medical Product Brand? ?? Hazard Class? ?? adverse events? ?? electronic product radiation control the provision? ?? Spaces between radio.? ?? Biorhythm.? ?? Computational biology detection radio link.? ?? Human Interface radio biosecurity.? ?? Clandestine radio Cruces? ?? Computational radiobiopirater?a? ?? Computational radiotaxopirater?a? ?? Computational radioecopirater?a? Computational radiocartopirater?a? Computational radiobibliopirater?a? Computational radioetnobiopirater?a? Mechanical Construction Electronic Information Systems training? Licensed indication? Model? Condition of sale? Manufacturer Name? Place / elaboration is? registration and listing of quality systems? Professional? ?? Consumer? Academy? Institution and / or Company? Press? Articles of Interest? Technical Informational Documents? Closures and Prohibitions? Prohibitions on use and / or Marketing? International Trade Service Providers, Financiers, Clients Specialized in Argentina-Euroamericas Automation, Music, Business, International Markets, Finance, Industry, Services, News, Art, Games, Social, Civil, Entertainment, Humor, Technology, Sports, Cooking, Education, Shows, Health, Government, NGO Development Project. Scholarship, Mediaship, Profile, Recommendations, Visibility, Settings, Management. Add both a personal and work News Feed, Messages, Events, Friends, Games, Photos, Video, Groups, Notes, Applications, Edit, Online Networks to: Marcelo Ignacio Bondi. Fax: 00 +5411.4807.3791 Ph.: 00 +5411.4805.5783 Postal: Marcelo Ignacio Bondi J. A. Pacheco de Melo 2475 P.B. ? B ? (1425AUA) Buenos Aires Capital Federal, Argentina mbondi at yahoo.com Marcelo at Bondi.ws commodities_broker at yahoo.com Call / Messenger on Skype ?: user id marceloignaciobondi MSN Messenger ?: commodities_broker at hotmail.com http://marcelo_ignacio_bondi.myplaxo.com/ http://pulse.plaxo.com/pulse/groups/profile/marcelo-ignacio-bondi?n=1 http://www.linkedin.com/in/marceloignaciobondi http://www.facebook.com/marcelo.bondi http://www.denexos.com http://www.redsocialpymes.com Internet radiobiogenesis? www.bondi.ws draft consideration. This "AS IS " discussion only information treatment. This auxiliary system-part may be falsified, form, to improve or modify undetermined. we can not guarantee to develop capacity for every system individually, but rest assured that your information will be taken into discussion. Mail address change at mbondi at yahoo.com --- On Tue, 6/21/11, biojava-l-request at lists.open-bio.org wrote: From: biojava-l-request at lists.open-bio.org Subject: Biojava-l Digest, Vol 101, Issue 8 To: biojava-l at lists.open-bio.org Date: Tuesday, June 21, 2011, 1:00 PM Send Biojava-l mailing list submissions to ??? biojava-l at lists.open-bio.org To subscribe or unsubscribe via the World Wide Web, visit ??? http://lists.open-bio.org/mailman/listinfo/biojava-l or, via email, send a message with subject or body 'help' to ??? biojava-l-request at lists.open-bio.org You can reach the person managing the list at ??? biojava-l-owner at lists.open-bio.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Biojava-l digest..." Today's Topics: ???1. binning structures by quality.... (Jay Vyas) ---------------------------------------------------------------------- Message: 1 Date: Mon, 20 Jun 2011 12:31:32 -0400 From: Jay Vyas Subject: [Biojava-l] binning structures by quality.... To: biojava-l at lists.open-bio.org Message-ID: Content-Type: text/plain; charset=ISO-8859-1 Hi Guys : I'm trying to bin some structures, about 30 of them.? I was wondering if anyone knows the upper "limit" for a structure to have a correct backbone, in RMSD units.? For example, a structure bundle with RMSD of 9 would clearly have an undefined backbone, whereas a structure bundle with an RMSD of 1 would definetely be precise enough to convey backbone information. I wanted a more precise bound.? I'm thinking, by eye, that anything above 4 angstroms is to imprecise to convey a backbone. But I figured maybe there was a formal treatment of such RMSD "categories" somewhere. Any thoughts would be appreciated. -- Jay Vyas MMSB/UCHC ------------------------------ _______________________________________________ Biojava-l mailing list? -? Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l End of Biojava-l Digest, Vol 101, Issue 8 ***************************************** From darnells at dnastar.com Wed Jun 22 17:21:51 2011 From: darnells at dnastar.com (Steve Darnell) Date: Wed, 22 Jun 2011 12:21:51 -0500 Subject: [Biojava-l] binning structures by quality.... In-Reply-To: References: Message-ID: Hi Jay, Perhaps this snippet from Proteopedia will help in your search: http://www.proteopedia.org/wiki/index.php/Structural_alignment_tools Evaluating Structural Alignments The structural differences between two optimally aligned models are usually measured as the Root Mean Square Deviation (RMSD) between the aligned alpha-carbon positions (excluding deviations from the non-aligned positions). To provide a frame of reference for RMSD values, note that up to 0.5 ? RMSD of alpha carbons occurs in independent determinations of the same protein[3]. Crystallographic models of proteins with about 50% sequence identity differ by about 1 ? RMSD[3][4]. Deviations can be much larger for models determined by NMR[4]. [3] Chothia C, Lesk AM. The relation between the divergence of sequence and structure in proteins. EMBO J. 1986 Apr;5(4):823-6. PMID:3709526 [4] Schwede T, Diemand A, Guex N, Peitsch MC. Protein structure computing in the genomic era. Res Microbiol. 2000 Mar;151(2):107-12. PMID:10865955 -- Of course, large RMSD values can occur due to misoriented regions (loops, termini, etc.) even though the core structure aligns well. See Zhang and Skolnick for one example. Zhang Y, Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005 2005 Apr 22;33(7):2302-9. PMID:15849316 Maybe others have a more direct answer to your question. Regards, Steve -- Steve Darnell DNASTAR, Inc. Madison, WI USA -----Original Message----- From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Jay Vyas Sent: Monday, June 20, 2011 11:32 AM To: biojava-l at lists.open-bio.org Subject: [Biojava-l] binning structures by quality.... Hi Guys : I'm trying to bin some structures, about 30 of them. I was wondering if anyone knows the upper "limit" for a structure to have a correct backbone, in RMSD units. For example, a structure bundle with RMSD of 9 would clearly have an undefined backbone, whereas a structure bundle with an RMSD of 1 would definetely be precise enough to convey backbone information. I wanted a more precise bound. I'm thinking, by eye, that anything above 4 angstroms is to imprecise to convey a backbone. But I figured maybe there was a formal treatment of such RMSD "categories" somewhere. Any thoughts would be appreciated. -- Jay Vyas MMSB/UCHC _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From jayunit100 at gmail.com Wed Jun 22 22:24:29 2011 From: jayunit100 at gmail.com (Jay Vyas) Date: Wed, 22 Jun 2011 18:24:29 -0400 Subject: [Biojava-l] CEAlign > 0 for exact same structures ? Message-ID: Hi guys.. Im finding that two structures that are the same give me non-zero RMSD alignments.... There could be a bug in my code, but this is an initial notice that I'm pretty sure about.... Any thoughts on the performance of CEAlign in ideal cases ? Atom[] ca1 = GPdbUtils.getAtoms(s1, type); Atom[] ca2 = GPdbUtils.getAtoms(s2, type); System.out.println("Aligning two sets of atoms, " + ca1.length +","+ca2.length); // get default parameters CeParameters params = new CeParameters(); // set the maximum gap size to unlimited // params.setMaxGapSize(-1); StructureAlignment algorithm = StructureAlignmentFactory .getAlgorithm(CeMain.algorithmName); // The results are stored in an AFPChain object AFPChain afpChain = algorithm.align(ca1, ca2, params); afpChain.setName1("A"); afpChain.setName2("B"); return (float) afpChain.getChainRmsd(); From andreas at sdsc.edu Wed Jun 22 22:35:42 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 22 Jun 2011 15:35:42 -0700 Subject: [Biojava-l] CEAlign > 0 for exact same structures ? In-Reply-To: References: Message-ID: Hi Jay, what are the pdb Ids? When I try this for e.g. 4hhb.A against itself I am getting an RMSD of 0. Use the afpChain.getTotalRmsdOpt() to get the RMSD of the final alignment. (see also DemoCE.java) Andreas On Wed, Jun 22, 2011 at 3:24 PM, Jay Vyas wrote: > Hi guys.. Im finding that two structures that are the same give me non-zero > RMSD alignments.... ? There could be a bug in my code, but this is an > initial notice that I'm pretty sure about.... Any thoughts on the > performance of CEAlign in ideal cases ? > > > ? ? ? ?Atom[] ca1 = GPdbUtils.getAtoms(s1, type); > ? ? ? ?Atom[] ca2 = GPdbUtils.getAtoms(s2, type); > > ? ? ? ?System.out.println("Aligning two sets of atoms, " + ca1.length > +","+ca2.length); > ? ? ? ?// get default parameters > ? ? ? ?CeParameters params = new CeParameters(); > > ? ? ? ?// set the maximum gap size to unlimited > ? ? ? ?// params.setMaxGapSize(-1); > ? ? ? ?StructureAlignment algorithm = StructureAlignmentFactory > ? ? ? ? ? ? ? ?.getAlgorithm(CeMain.algorithmName); > > ? ? ? ?// The results are stored in an AFPChain object > ? ? ? ?AFPChain afpChain = algorithm.align(ca1, ca2, params); > ? ? ? ?afpChain.setName1("A"); > ? ? ? ?afpChain.setName2("B"); > > ? ? ? ?return (float) afpChain.getChainRmsd(); > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From jayunit100 at gmail.com Sat Jun 25 15:34:42 2011 From: jayunit100 at gmail.com (Jay Vyas) Date: Sat, 25 Jun 2011 11:34:42 -0400 Subject: [Biojava-l] help with short reads ? Message-ID: Hi everyone. A collaborator sent me some short reads in GZ format for 2 bacterial genomes. I have NO IDEA how to process this data or convert it. Any help or utilies out there ? If you're interested in collaborating on a publication , let me know. We can get you're name on it. And it won't be much work for those of you that know about contig assembly..... For me, its out of my league, im a protein guy.... -- Jay Vyas MMSB/UCHC From jw12 at sanger.ac.uk Tue Jun 28 09:48:38 2011 From: jw12 at sanger.ac.uk (Jonathan Warren) Date: Tue, 28 Jun 2011 10:48:38 +0100 Subject: [Biojava-l] Central registration place for BAM, BigBed and BigWig files Message-ID: <11F1D7FB-46AD-4490-A340-879804254BB6@sanger.ac.uk> If you have "big" file data that you think would be useful to other researches you can now register them with the DAS Registry at http://www.dasregistry.org For further in formation see below: You can now add bigfile formats such as BAM, BIGBED and BIGWIG using the bigfile-bam, bigfile-bigbed and bigfile-bigwig capabilities types of the DAS Registry. These files are not served from a DAS server but are just available from the web (Thus the urls for these ?sources? are not expected to have other capabilities such as a sources command or a format command). People can use the meta data associated with a DAS source or in this case a bigfile to advertise the availability of the file to other researchers. The additional information includes a coordinate system e.g. GRCh_37, Chromosome, Homo sapiens so others know what is the correct sequence to attach the files to. Also descriptions and helpUrls etc. To register a big file register yourself with the DAS Registry (simple email/pass system here https://www.dasregistry.org/loginFirst.jsp) , then go to the register a service page Register new select the second option ?registering a plain file..? then add the meta data for your data file. Once you have registered your file it will appear in the https://www.dasregistry.org/listSources.jsp page ? you can filter to show only the bigfile format of the appropriate type using the capabilities drop down. Any problems or suggestions please contact dasregistry at sanger.ac.uk Many thanks Jonathan. Jonathan Warren Senior Developer and DAS coordinator blog: http://biodasman.wordpress.com/ -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From alastair.m.kilpatrick at googlemail.com Wed Jun 29 10:22:25 2011 From: alastair.m.kilpatrick at googlemail.com (Alastair Kilpatrick) Date: Wed, 29 Jun 2011 11:22:25 +0100 Subject: [Biojava-l] DNATools lower case? In-Reply-To: <20110527121201.42723aw4e2egh86c@gator1273.hostgator.com> References: <20110527111518.81163axdepe9fm04@gator1273.hostgator.com> <20110527121201.42723aw4e2egh86c@gator1273.hostgator.com> Message-ID: Hi all, Just in case anyone is interested in this and still looking for a fix - this isn't a proper solution but it seems to work so far (with thanks to Shirley, have only tried with BioJava 1.X i'm afraid): In BioJava 1.8.1, core-1.8.1.jar contains the file \org\biojava\bio\symbol\AlphabetManager.xml Within that file, the lines for atomic mappings need to be changed from: to: ..similarly for C, G & T. In BioJava 1.4, the equivalent file is at \org\biojava\bio\symbol\Alphabet.xml - the changes are just the same. After changing, I had to update my project setup, but that may just be an Eclipse thing, isn't too much bother. Alastair PhD candidate, School of Informatics, University of Edinburgh On 27 May 2011 18:12, George Waldon wrote: > Hi Shirley, > > I am not really familiar with this code but I think AlternateTokenization > was introduced after the Logo code and that is why you do not find it there. > Can you fill a bug report? Also I'll be happy to add any patch you submit. > > Thank you. > > > George > > Quoting Shirley Hui >: > > Thanks George. It looks like using alternate tokenization works if you >> are >> "stringifying" a Sequence explicitly using the alphabet.getTokenization() >> method. >> >> But presumably this gets done within the DistributionLogo class or some >> other class down the line and I don't want to modify any Biojava classes >> unless I really have to. >> >> The way I am constructing the DNA sequences is like this: >> >> Sequences seq =DNATools.createDNASequence(**sequence, name); >> >> The list sequences is used to make a SimpleWeightMatrix wm. >> Then the call to DistributionLogo is like this: >> >> Distribution dist = wm.getColumn(columnNumber); >> DistributionLogo dl = new DistributionLogo(); >> dl.setRenderingHints(hints); >> dl.setOpaque(false); >> dl.setDistribution(dist); >> dl.setPreferredSize(new Dimension((int) columnWidth, (int) columnHeight)); >> dl.setLogoPainter(new TextLogoPainter()); >> dl.setStyle(symbolColorStyle); >> >> There no way that I can tell right now in the DNATools API to make the >> DNATools use the alternate string tokenization via the call to >> createDNASequence() >> or another type of set method? >> >> shirley >> >> >> On Fri, May 27, 2011 at 12:15 PM, George Waldon >> >**wrote: >> >> Hello Shirley, >>> >>> I think you need to use AlternateTokenisation at some point; check >>> BJ1.8.1 >>> Cookbook at http://www.biojava.org/wiki/**BioJava:Cookbook:Sequence >>> >>> regards, >>> >>> George >>> >>> >>> Quoting Shirley Hui >: >>> >>> Hi, >>> >>>> I am using DNATools to generate dna Sequences. >>>> I noticed that the static methods in DNATools a(),c(),t(),g() map to >>>> lower >>>> case characters. >>>> I am using using DistributionLogo class to draw sequences logos for a >>>> set >>>> of >>>> dna Sequences. >>>> I think DistributionLogo is calling the static methods to map the >>>> nucleotides which is lower case. >>>> But I want the logo output the nucleotides in uppercase. How can I do >>>> this? >>>> Thanks for your help >>>> shirley >>>> ______________________________**_________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/**mailman/listinfo/biojava-l >>>> >>>> >>>> >>> >>> >>> >> > > > ______________________________**_________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/biojava-l > From kurka at mikro.biologie.tu-muenchen.de Wed Jun 29 14:39:06 2011 From: kurka at mikro.biologie.tu-muenchen.de (Hedwig Kurka) Date: Wed, 29 Jun 2011 16:39:06 +0200 Subject: [Biojava-l] make feature to create embl or genbank file Message-ID: <4E0B390A.5050005@mikro.biologie.tu-muenchen.de> Hello all, I have a problem concerning creating EMBL or Genbank files. Below is a fragment of my code and an example of how the EMBL file looks like. String name = "test genome"; String seqString = pFasta.getSequence(1, pFasta.getLength()); Sequence seq = DNATools.createDNASequence(seqString, name); Alphabet dna = AlphabetManager.alphabetForName("DNA"); RichSequence rs = Tools.createRichSequence(RichObjectFactory.getDefaultNamespace(), name, seqString, dna); Set rfeatSet = new HashSet(); StrandedFeature.Template t = new StrandedFeature.Template(); for(int i=0; i stop){ t.location = new RangeLocation(stop, start); t.strand = StrandedFeature.NEGATIVE; } Feature f = seq.createFeature(t); RichFeature rf = RichFeature.Tools.enrich(f); rfeatSet.add(rf); } rs.setFeatureSet(rfeatSet); rs = RichSequence.Tools.enrich(rs); RichSequence.IOTools.writeEMBL(output, rs, RichObjectFactory.getDefaultNamespace()); EMBL file: FT any 1889536..1890903 FT any 134636..136987 FT any 3727110..3727625 FT any 2812636..2813517 FT any 580648..581643 FT any 2330962..2331921 FT any 1012371..1013513 FT any 1260854..1261720 FT any 1602858..1603706 FT any 4108079..4108999 FT any 346637..347731 FT any 4073395..4074549 I wonder where the information of plus and minus strand is, why is there "any" in the file and not "CDS" and so on. As tutorial I found that: http://www.biojava.org/wiki/BioJava:Cookbook:Locations:Feature. Is there another one? Thank you for your help! And any help is appreciated, Hedwig From gwaldon at geneinfinity.org Wed Jun 29 15:39:04 2011 From: gwaldon at geneinfinity.org (George Waldon) Date: Wed, 29 Jun 2011 10:39:04 -0500 Subject: [Biojava-l] make feature to create embl or genbank file In-Reply-To: <4E0B390A.5050005@mikro.biologie.tu-muenchen.de> References: <4E0B390A.5050005@mikro.biologie.tu-muenchen.de> Message-ID: <20110629103904.942543b411ok9xqw@gator1273.hostgator.com> Hi Hedwig, The problem holds with StrandedFeature. The strandeness of a feature is the transdeness of its location. StrandedFeature should be eliminated from bj1. Use biojavaX instead, something like this, once you have created a RichLocation on the appropriate strand: public Feature.Template getFeatureTemplate(RichSequence parent,RichLocation loc) { RichFeature.Template templ = new RichFeature.Template(); RichAnnotation rans = new SimpleRichAnnotation(); templ.annotation = rans; templ.sourceTerm = // find an appropriate term templ.typeTerm = RichObjectFactory.getDefaultOntology().getOrCreateTerm("CDS"); templ.featureRelationshipSet = new TreeSet(); templ.rankedCrossRefs = new TreeSet(); templ.location = loc; // add notes if any you'd like return templ; } That should make it into the output file. Regards, George Quoting Hedwig Kurka : > Hello all, > > I have a problem concerning creating EMBL or Genbank files. > Below is a fragment of my code and an example of how the EMBL file looks > like. > > String name = "test genome"; > String seqString = pFasta.getSequence(1, pFasta.getLength()); > Sequence seq = DNATools.createDNASequence(seqString, name); > Alphabet dna = AlphabetManager.alphabetForName("DNA"); > RichSequence rs = > Tools.createRichSequence(RichObjectFactory.getDefaultNamespace(), name, > seqString, dna); > Set rfeatSet = new HashSet(); > StrandedFeature.Template t = new StrandedFeature.Template(); > for(int i=0; i int start = (int) Math.abs(anno.get(i).getStart()); > int stop = (int) Math.abs(anno.get(i).getStop()); > t.type = "CDS"; > if(start < stop){ > t.location = new RangeLocation(start, stop); > t.strand = StrandedFeature.POSITIVE; > } > if(start > stop){ > t.location = new RangeLocation(stop, start); > t.strand = StrandedFeature.NEGATIVE; > } > Feature f = seq.createFeature(t); > RichFeature rf = RichFeature.Tools.enrich(f); > rfeatSet.add(rf); > } > rs.setFeatureSet(rfeatSet); > rs = RichSequence.Tools.enrich(rs); > RichSequence.IOTools.writeEMBL(output, rs, > RichObjectFactory.getDefaultNamespace()); > > EMBL file: > FT any 1889536..1890903 > FT any 134636..136987 > FT any 3727110..3727625 > FT any 2812636..2813517 > FT any 580648..581643 > FT any 2330962..2331921 > FT any 1012371..1013513 > FT any 1260854..1261720 > FT any 1602858..1603706 > FT any 4108079..4108999 > FT any 346637..347731 > FT any 4073395..4074549 > > I wonder where the information of plus and minus strand is, why is there > "any" in the file and not "CDS" and so on. > > As tutorial I found that: > http://www.biojava.org/wiki/BioJava:Cookbook:Locations:Feature. Is there > another one? > > Thank you for your help! > > And any help is appreciated, > > Hedwig > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From kurka at mikro.biologie.tu-muenchen.de Thu Jun 30 08:31:47 2011 From: kurka at mikro.biologie.tu-muenchen.de (Hedwig Kurka) Date: Thu, 30 Jun 2011 10:31:47 +0200 Subject: [Biojava-l] make feature to create embl or genbank file In-Reply-To: <20110629103904.942543b411ok9xqw@gator1273.hostgator.com> References: <4E0B390A.5050005@mikro.biologie.tu-muenchen.de> <20110629103904.942543b411ok9xqw@gator1273.hostgator.com> Message-ID: <4E0C3473.7070406@mikro.biologie.tu-muenchen.de> Hi George, Thank you for your answer. I have some questions. Maybe very stupid, but I don't know how to get RichLocation objects. RichLocation loc = (RichLocation) new RangeLocation(start, stop); That doesn't work. And where does the programm know, that the feature lies on the plus or the minus strand? Regards, Hedwig Am 29.06.2011 17:39, schrieb George Waldon: > Hi Hedwig, > > The problem holds with StrandedFeature. The strandeness of a feature > is the transdeness of its location. StrandedFeature should be > eliminated from bj1. Use biojavaX instead, something like this, once > you have created a RichLocation on the appropriate strand: > > public Feature.Template getFeatureTemplate(RichSequence > parent,RichLocation loc) { > RichFeature.Template templ = new RichFeature.Template(); > RichAnnotation rans = new SimpleRichAnnotation(); > templ.annotation = rans; > templ.sourceTerm = // find an appropriate term > templ.typeTerm = > RichObjectFactory.getDefaultOntology().getOrCreateTerm("CDS"); > templ.featureRelationshipSet = new TreeSet(); > templ.rankedCrossRefs = new TreeSet(); > templ.location = loc; > > // add notes if any you'd like > > return templ; > } > > That should make it into the output file. > > Regards, > George > > > Quoting Hedwig Kurka : > >> Hello all, >> >> I have a problem concerning creating EMBL or Genbank files. >> Below is a fragment of my code and an example of how the EMBL file looks >> like. >> >> String name = "test genome"; >> String seqString = pFasta.getSequence(1, pFasta.getLength()); >> Sequence seq = DNATools.createDNASequence(seqString, name); >> Alphabet dna = AlphabetManager.alphabetForName("DNA"); >> RichSequence rs = >> Tools.createRichSequence(RichObjectFactory.getDefaultNamespace(), name, >> seqString, dna); >> Set rfeatSet = new HashSet(); >> StrandedFeature.Template t = new StrandedFeature.Template(); >> for(int i=0; i> int start = (int) Math.abs(anno.get(i).getStart()); >> int stop = (int) Math.abs(anno.get(i).getStop()); >> t.type = "CDS"; >> if(start < stop){ >> t.location = new RangeLocation(start, stop); >> t.strand = StrandedFeature.POSITIVE; >> } >> if(start > stop){ >> t.location = new RangeLocation(stop, start); >> t.strand = StrandedFeature.NEGATIVE; >> } >> Feature f = seq.createFeature(t); >> RichFeature rf = RichFeature.Tools.enrich(f); >> rfeatSet.add(rf); >> } >> rs.setFeatureSet(rfeatSet); >> rs = RichSequence.Tools.enrich(rs); >> RichSequence.IOTools.writeEMBL(output, rs, >> RichObjectFactory.getDefaultNamespace()); >> >> EMBL file: >> FT any 1889536..1890903 >> FT any 134636..136987 >> FT any 3727110..3727625 >> FT any 2812636..2813517 >> FT any 580648..581643 >> FT any 2330962..2331921 >> FT any 1012371..1013513 >> FT any 1260854..1261720 >> FT any 1602858..1603706 >> FT any 4108079..4108999 >> FT any 346637..347731 >> FT any 4073395..4074549 >> >> I wonder where the information of plus and minus strand is, why is there >> "any" in the file and not "CDS" and so on. >> >> As tutorial I found that: >> http://www.biojava.org/wiki/BioJava:Cookbook:Locations:Feature. Is there >> another one? >> >> Thank you for your help! >> >> And any help is appreciated, >> >> Hedwig >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > > > > From kurka at mikro.biologie.tu-muenchen.de Thu Jun 30 09:02:49 2011 From: kurka at mikro.biologie.tu-muenchen.de (Hedwig Kurka) Date: Thu, 30 Jun 2011 11:02:49 +0200 Subject: [Biojava-l] make feature to create embl or genbank file In-Reply-To: <33EA9829-70CA-47F6-B336-109BF00ADBCC@eaglegenomics.com> References: <4E0C2FF4.1080504@mikro.biologie.tu-muenchen.de> <33EA9829-70CA-47F6-B336-109BF00ADBCC@eaglegenomics.com> Message-ID: <4E0C3BB9.2030601@mikro.biologie.tu-muenchen.de> I already built the set and populated it. Now I want to give it the RichSequence. But when I do that in that line: rs.setRichFeatureSet(rfeatSet); It says, that it needs a Set Regards, Hedwig > I'm not sure what you're trying to do - if you want to build a Set, you can just use the standard Java Collections API to create and populate a Set? > > cheers, > Richard > > On 30 Jun 2011, at 09:12, Hedwig Kurka wrote: > > >> Hi Richard, >> >> Thank you for your answer. >> If I create RichFeature objects, then I have to do conversions in that line: >> RichFeature f = (RichFeature) seq.createFeature(t); >> and then I have in that line: >> rs.setRichFeatureSet(rfeatSet); >> the problem, that I have a Set and not Set, but I >> didn't find a method builds a Set containing RichFeature objects on a >> RichSequence. Is there one? >> >> >> >>> The conversion from Feature to RichFeature does its best but is not >>> ideal. As you already have a RichSequence object to work with then you >>> would be better creating native RichFeature objects instead of doing >>> conversions. >>> >>> Richard Holland >>> Eagle Genomics Ltd >>> Sent from my HTC >>> >>> >>> >>> >> > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > > > From holland at eaglegenomics.com Thu Jun 30 09:36:12 2011 From: holland at eaglegenomics.com (Richard Holland) Date: Thu, 30 Jun 2011 10:36:12 +0100 Subject: [Biojava-l] make feature to create embl or genbank file In-Reply-To: <4E0C3BB9.2030601@mikro.biologie.tu-muenchen.de> References: <4E0C2FF4.1080504@mikro.biologie.tu-muenchen.de> <33EA9829-70CA-47F6-B336-109BF00ADBCC@eaglegenomics.com> <4E0C3BB9.2030601@mikro.biologie.tu-muenchen.de> Message-ID: <285A2F4C-9C8E-4300-A48B-E31E95601570@eaglegenomics.com> There is a coding problem in ThinRichSequence (from which SimpleRichSequence and others extend) that allow only Set as input, but require the Feature objects to actually be RichFeature objects. This was for a number of reasons that probably seemed good at the time but I have now forgotten what they were. The workaround is to declare your set as a Set but populate it with RichFeature objects (as RichFeature extends Feature and so the Set will still accept them). The code is being phased out in favour of the new BJ3 model so it is unlikely to be fixed, but hopefully this workaround solves your particular case. cheers, Richard On 30 Jun 2011, at 10:02, Hedwig Kurka wrote: > I already built the set and populated it. > Now I want to give it the RichSequence. But when I do that in that line: > > rs.setRichFeatureSet(rfeatSet); > > It says, that it needs a Set > > Regards, > Hedwig > >> I'm not sure what you're trying to do - if you want to build a Set, you can just use the standard Java Collections API to create and populate a Set? >> >> cheers, >> Richard >> >> On 30 Jun 2011, at 09:12, Hedwig Kurka wrote: >> >> >>> Hi Richard, >>> >>> Thank you for your answer. >>> If I create RichFeature objects, then I have to do conversions in that line: >>> RichFeature f = (RichFeature) seq.createFeature(t); >>> and then I have in that line: >>> rs.setRichFeatureSet(rfeatSet); >>> the problem, that I have a Set and not Set, but I >>> didn't find a method builds a Set containing RichFeature objects on a >>> RichSequence. Is there one? >>> >>> >>> >>>> The conversion from Feature to RichFeature does its best but is not >>>> ideal. As you already have a RichSequence object to work with then you >>>> would be better creating native RichFeature objects instead of doing >>>> conversions. >>>> >>>> Richard Holland >>>> Eagle Genomics Ltd >>>> Sent from my HTC >>>> >>>> >>>> >>>> >>> >> -- >> Richard Holland, BSc MBCS >> Operations and Delivery Director, Eagle Genomics Ltd >> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> >> >> >> > -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From gwaldon at geneinfinity.org Thu Jun 30 15:24:23 2011 From: gwaldon at geneinfinity.org (George Waldon) Date: Thu, 30 Jun 2011 10:24:23 -0500 Subject: [Biojava-l] make feature to create embl or genbank file In-Reply-To: <4E0C3473.7070406@mikro.biologie.tu-muenchen.de> References: <4E0B390A.5050005@mikro.biologie.tu-muenchen.de> <20110629103904.942543b411ok9xqw@gator1273.hostgator.com> <4E0C3473.7070406@mikro.biologie.tu-muenchen.de> Message-ID: <20110630102423.185442z4cl0wn7eo@gator1273.hostgator.com> No stupid question here, only bad answer. Hope this one is good: http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Working_with_RichLocation_objects. - George Quoting Hedwig Kurka : > Hi George, > > Thank you for your answer. > I have some questions. Maybe very stupid, but I don't know how to get > RichLocation objects. > RichLocation loc = (RichLocation) new RangeLocation(start, stop); > That doesn't work. From khalil.elmazouari at gmail.com Thu Jun 30 17:59:43 2011 From: khalil.elmazouari at gmail.com (Khalil El Mazouari) Date: Thu, 30 Jun 2011 19:59:43 +0200 Subject: [Biojava-l] Biojava-l Digest, Vol 101, Issue 14 In-Reply-To: References: Message-ID: Hi Hedwig try this: RichFeature richFeature = RichFeature.Tools.makeEmptyFeature(); RichLocation richLocation = new SimpleRichLocation( new SimplePosition(start), new SimplePosition(end), rank, RichLocation.Strand.POSITIVE_STRAND); richFeature.setLocation(richLocation); richFeature.setType("misc_feat"); // or get it from RichObjectFactory. richSequence.getFeatureSet().add(richFeature); Regards, khalil On 30 Jun 2011, at 18:00, biojava-l-request at lists.open-bio.org wrote: > Send Biojava-l mailing list submissions to > biojava-l at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/biojava-l > or, via email, send a message with subject or body 'help' to > biojava-l-request at lists.open-bio.org > > You can reach the person managing the list at > biojava-l-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biojava-l digest..." > > > Today's Topics: > > 1. Re: make feature to create embl or genbank file (Hedwig Kurka) > 2. Re: make feature to create embl or genbank file (Hedwig Kurka) > 3. Re: make feature to create embl or genbank file (Richard Holland) > 4. Re: make feature to create embl or genbank file (George Waldon) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 30 Jun 2011 10:31:47 +0200 > From: Hedwig Kurka > Subject: Re: [Biojava-l] make feature to create embl or genbank file > To: George Waldon , > biojava-l at lists.open-bio.org > Message-ID: <4E0C3473.7070406 at mikro.biologie.tu-muenchen.de> > Content-Type: text/plain; charset=ISO-8859-1 > > Hi George, > > Thank you for your answer. > I have some questions. Maybe very stupid, but I don't know how to get > RichLocation objects. > RichLocation loc = (RichLocation) new RangeLocation(start, stop); > That doesn't work. > And where does the programm know, that the feature lies on the plus or > the minus strand? > > Regards, > Hedwig > > Am 29.06.2011 17:39, schrieb George Waldon: >> Hi Hedwig, >> >> The problem holds with StrandedFeature. The strandeness of a feature >> is the transdeness of its location. StrandedFeature should be >> eliminated from bj1. Use biojavaX instead, something like this, once >> you have created a RichLocation on the appropriate strand: >> >> public Feature.Template getFeatureTemplate(RichSequence >> parent,RichLocation loc) { >> RichFeature.Template templ = new RichFeature.Template(); >> RichAnnotation rans = new SimpleRichAnnotation(); >> templ.annotation = rans; >> templ.sourceTerm = // find an appropriate term >> templ.typeTerm = >> RichObjectFactory.getDefaultOntology().getOrCreateTerm("CDS"); >> templ.featureRelationshipSet = new TreeSet(); >> templ.rankedCrossRefs = new TreeSet(); >> templ.location = loc; >> >> // add notes if any you'd like >> >> return templ; >> } >> >> That should make it into the output file. >> >> Regards, >> George >> >> >> Quoting Hedwig Kurka : >> >>> Hello all, >>> >>> I have a problem concerning creating EMBL or Genbank files. >>> Below is a fragment of my code and an example of how the EMBL file looks >>> like. >>> >>> String name = "test genome"; >>> String seqString = pFasta.getSequence(1, pFasta.getLength()); >>> Sequence seq = DNATools.createDNASequence(seqString, name); >>> Alphabet dna = AlphabetManager.alphabetForName("DNA"); >>> RichSequence rs = >>> Tools.createRichSequence(RichObjectFactory.getDefaultNamespace(), name, >>> seqString, dna); >>> Set rfeatSet = new HashSet(); >>> StrandedFeature.Template t = new StrandedFeature.Template(); >>> for(int i=0; i>> int start = (int) Math.abs(anno.get(i).getStart()); >>> int stop = (int) Math.abs(anno.get(i).getStop()); >>> t.type = "CDS"; >>> if(start < stop){ >>> t.location = new RangeLocation(start, stop); >>> t.strand = StrandedFeature.POSITIVE; >>> } >>> if(start > stop){ >>> t.location = new RangeLocation(stop, start); >>> t.strand = StrandedFeature.NEGATIVE; >>> } >>> Feature f = seq.createFeature(t); >>> RichFeature rf = RichFeature.Tools.enrich(f); >>> rfeatSet.add(rf); >>> } >>> rs.setFeatureSet(rfeatSet); >>> rs = RichSequence.Tools.enrich(rs); >>> RichSequence.IOTools.writeEMBL(output, rs, >>> RichObjectFactory.getDefaultNamespace()); >>> >>> EMBL file: >>> FT any 1889536..1890903 >>> FT any 134636..136987 >>> FT any 3727110..3727625 >>> FT any 2812636..2813517 >>> FT any 580648..581643 >>> FT any 2330962..2331921 >>> FT any 1012371..1013513 >>> FT any 1260854..1261720 >>> FT any 1602858..1603706 >>> FT any 4108079..4108999 >>> FT any 346637..347731 >>> FT any 4073395..4074549 >>> >>> I wonder where the information of plus and minus strand is, why is there >>> "any" in the file and not "CDS" and so on. >>> >>> As tutorial I found that: >>> http://www.biojava.org/wiki/BioJava:Cookbook:Locations:Feature. Is there >>> another one? >>> >>> Thank you for your help! >>> >>> And any help is appreciated, >>> >>> Hedwig >>> >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> >> >> >> >> > > > > > ------------------------------ > > Message: 2 > Date: Thu, 30 Jun 2011 11:02:49 +0200 > From: Hedwig Kurka > Subject: Re: [Biojava-l] make feature to create embl or genbank file > To: Richard Holland , > biojava-l at lists.open-bio.org > Message-ID: <4E0C3BB9.2030601 at mikro.biologie.tu-muenchen.de> > Content-Type: text/plain; charset=ISO-8859-1 > > I already built the set and populated it. > Now I want to give it the RichSequence. But when I do that in that line: > > rs.setRichFeatureSet(rfeatSet); > > It says, that it needs a Set > > Regards, > Hedwig > >> I'm not sure what you're trying to do - if you want to build a Set, you can just use the standard Java Collections API to create and populate a Set? >> >> cheers, >> Richard >> >> On 30 Jun 2011, at 09:12, Hedwig Kurka wrote: >> >> >>> Hi Richard, >>> >>> Thank you for your answer. >>> If I create RichFeature objects, then I have to do conversions in that line: >>> RichFeature f = (RichFeature) seq.createFeature(t); >>> and then I have in that line: >>> rs.setRichFeatureSet(rfeatSet); >>> the problem, that I have a Set and not Set, but I >>> didn't find a method builds a Set containing RichFeature objects on a >>> RichSequence. Is there one? >>> >>> >>> >>>> The conversion from Feature to RichFeature does its best but is not >>>> ideal. As you already have a RichSequence object to work with then you >>>> would be better creating native RichFeature objects instead of doing >>>> conversions. >>>> >>>> Richard Holland >>>> Eagle Genomics Ltd >>>> Sent from my HTC >>>> >>>> >>>> >>>> >>> >> -- >> Richard Holland, BSc MBCS >> Operations and Delivery Director, Eagle Genomics Ltd >> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> >> >> >> > > > > ------------------------------ > > Message: 3 > Date: Thu, 30 Jun 2011 10:36:12 +0100 > From: Richard Holland > Subject: Re: [Biojava-l] make feature to create embl or genbank file > To: Hedwig Kurka > Cc: biojava-l at lists.open-bio.org > Message-ID: <285A2F4C-9C8E-4300-A48B-E31E95601570 at eaglegenomics.com> > Content-Type: text/plain; charset=us-ascii > > There is a coding problem in ThinRichSequence (from which SimpleRichSequence and others extend) that allow only Set as input, but require the Feature objects to actually be RichFeature objects. This was for a number of reasons that probably seemed good at the time but I have now forgotten what they were. The workaround is to declare your set as a Set but populate it with RichFeature objects (as RichFeature extends Feature and so the Set will still accept them). > > The code is being phased out in favour of the new BJ3 model so it is unlikely to be fixed, but hopefully this workaround solves your particular case. > > cheers, > Richard > > On 30 Jun 2011, at 10:02, Hedwig Kurka wrote: > >> I already built the set and populated it. >> Now I want to give it the RichSequence. But when I do that in that line: >> >> rs.setRichFeatureSet(rfeatSet); >> >> It says, that it needs a Set >> >> Regards, >> Hedwig >> >>> I'm not sure what you're trying to do - if you want to build a Set, you can just use the standard Java Collections API to create and populate a Set? >>> >>> cheers, >>> Richard >>> >>> On 30 Jun 2011, at 09:12, Hedwig Kurka wrote: >>> >>> >>>> Hi Richard, >>>> >>>> Thank you for your answer. >>>> If I create RichFeature objects, then I have to do conversions in that line: >>>> RichFeature f = (RichFeature) seq.createFeature(t); >>>> and then I have in that line: >>>> rs.setRichFeatureSet(rfeatSet); >>>> the problem, that I have a Set and not Set, but I >>>> didn't find a method builds a Set containing RichFeature objects on a >>>> RichSequence. Is there one? >>>> >>>> >>>> >>>>> The conversion from Feature to RichFeature does its best but is not >>>>> ideal. As you already have a RichSequence object to work with then you >>>>> would be better creating native RichFeature objects instead of doing >>>>> conversions. >>>>> >>>>> Richard Holland >>>>> Eagle Genomics Ltd >>>>> Sent from my HTC >>>>> >>>>> >>>>> >>>>> >>>> >>> -- >>> Richard Holland, BSc MBCS >>> Operations and Delivery Director, Eagle Genomics Ltd >>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com >>> http://www.eaglegenomics.com/ >>> >>> >>> >>> >> > > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > > > > ------------------------------ > > Message: 4 > Date: Thu, 30 Jun 2011 10:24:23 -0500 > From: George Waldon > Subject: Re: [Biojava-l] make feature to create embl or genbank file > To: Hedwig Kurka > Cc: "biojava-l at lists.open-bio.org" > Message-ID: <20110630102423.185442z4cl0wn7eo at gator1273.hostgator.com> > Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes"; > format="flowed" > > No stupid question here, only bad answer. Hope this one is good: > > http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Working_with_RichLocation_objects. > > - George > > Quoting Hedwig Kurka : > >> Hi George, >> >> Thank you for your answer. >> I have some questions. Maybe very stupid, but I don't know how to get >> RichLocation objects. >> RichLocation loc = (RichLocation) new RangeLocation(start, stop); >> That doesn't work. > > > > > > > > ------------------------------ > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > End of Biojava-l Digest, Vol 101, Issue 14 > ****************************************** From shakunb at uom.ac.mu Thu Jun 30 19:20:41 2011 From: shakunb at uom.ac.mu (Shakuntala Baichoo) Date: Thu, 30 Jun 2011 23:20:41 +0400 Subject: [Biojava-l] Help on NCBIQBlastService and BlastXMLQuery Message-ID: Hi! Grateful If anybody could help me with NCBIQBlastService I need to blast a set (in this case only 2) of nucleotide sequences and I am using Biojava3's NCBIQBlastService. I direct the results in xml files and try to parse that xml file so as to get all the results, in terms of % match, e-value etc... But I am only getting the reference of the sequences that have matched, as follows: ........... trying to get BLAST results for RID 0TJFFD5E01S Jun 30, 2011 11:10:03 PM org.biojava3.genome.query. BlastXMLQuery INFO: Start read of 0TJFFD5E01SResults_XML.xml Jun 30, 2011 11:10:03 PM org.biojava3.genome.query.BlastXMLQuery INFO: Read finished Jun 30, 2011 11:10:03 PM org.biojava3.genome.query.BlastXMLQuery getHitsQueryDef INFO: Query for hits Jun 30, 2011 11:10:03 PM org.biojava3.genome.query.BlastXMLQuery getHitsQueryDef INFO: 1 hits [CP002614, CP002487, FQ312003, CP001363, FN424405, CP000857, AE006468, AE017220, CP001138, CP001127, AM933172, AM933173, CP001144, FM200053, CP001120, CP001113, CP000886, CP000026, FR775193, AE014613, AL627266] *********************************************** trying to get BLAST results for RID 0TJFHZV201S Jun 30, 2011 11:10:27 PM org.biojava3.genome.query.BlastXMLQuery INFO: Start read of 0TJFHZV201SResults_XML.xml Jun 30, 2011 11:10:27 PM org.biojava3.genome.query.BlastXMLQuery INFO: Read finished Jun 30, 2011 11:10:27 PM org.biojava3.genome.query.BlastXMLQuery getHitsQueryDef INFO: Query for hits Jun 30, 2011 11:10:27 PM org.biojava3.genome.query.BlastXMLQuery getHitsQueryDef INFO: 1 hits [CP002614, CP002487, AP011957, FQ312003, CP001363, FN424405, AE006468, L19338, CP001113, CP000857, CP001138, AE017220, CP001120, CP000886, FR775195, AM933172, FM200053, AM933173, CP000026, CP001144, CP001127, AE014613, AL627267, M90677, CP000822] BUILD SUCCESSFUL (total time: 54 seconds) Note that when I open the generated xml file, it does contain all the results. Any idea how to extract all the info. Please... Here's the sample program: -------------------------------- /* * To change this template, choose Tools | Templates * and open the template in the editor. */ package BlastPackage; import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.io.PrintStream; import java.util.ArrayList; import java.util.Collection; import java.util.Iterator; import java.util.LinkedHashMap; import java.util.List; import java.util.Map.Entry; import java.util.Set; import org.biojava3.core.sequence.DNASequence; import org.biojava3.genome.query.BlastXMLQuery; import org.biojava3.core.sequence.ProteinSequence; import org.biojava3.core.sequence.compound.AmbiguityDNACompoundSet; import org.biojava3.core.sequence.compound.NucleotideCompound; import org.biojava3.core.sequence.io.DNASequenceCreator; import org.biojava3.core.sequence.io.FastaReader; import org.biojava3.core.sequence.io.FastaReaderHelper; import org.biojava3.core.sequence.io.GenericFastaHeaderParser; import org.biojava3.ws.alignment.qblast.NCBIQBlastService; import org.biojava3.ws.alignment.qblast.NCBIQBlastAlignmentProperties; import org.biojava3.ws.alignment.qblast.NCBIQBlastOutputProperties; import org.biojava3.ws.alignment.qblast.NCBIQBlastOutputFormat; import org.biojava.bio.program.sax.*; import org.biojava.bio.program.ssbind.*; import org.biojava.bio.search.*; import org.biojava.bio.seq.db.*; import org.xml.sax.*; import org.biojava.bio.*; public class NCBIQBlastServiceTest { /** * The program take only a string with a path toward a sequence file * * For this example, I keep it simple with a single FASTA formatted file * */ public static void main(String[] args) { NCBIQBlastService rbw; NCBIQBlastAlignmentProperties rqb; NCBIQBlastOutputProperties rof; InputStream is = null; ArrayList rid = new ArrayList(); try { // Let's capture the sequences in a file... //LinkedHashMap a = FastaReaderHelper.readFastaDNASequence(new File("TestBlast.fas")); FileInputStream inStream = new FileInputStream( "TestBlast.fas" ); FastaReader fastaReader = new FastaReader( inStream, new GenericFastaHeaderParser(), new DNASequenceCreator(AmbiguityDNACompoundSet.getDNACompoundSet())); LinkedHashMap b = fastaReader.process(); /* * You would imagine that one would blast a bunch of sequences of * identical nature with identical parameters... */ rbw = new NCBIQBlastService(); rqb = new NCBIQBlastAlignmentProperties(); rqb.NCBIQBlastAlignmentProperties(); rqb.setBlastProgram("blastn"); rqb.setBlastDatabase("nr"); /* * First, let's send all the sequences to the QBlast service and * keep the RID for fetching the results at some later moments * (actually, in a few seconds :-)) * * Using a data structure to keep track of all request IDs is a good * practice. * */ for (Entry entry : b.entrySet()) { System.out.println( entry.getValue().getOriginalHeader() + "\n"); String s = entry.getValue().toString(); //System.out.println("Query Sequence:"); System.out.println(s); String request = rbw.sendAlignmentRequest(s,rqb); //request=rbw. rid.add(request); } /* * Let's check that our requests have been processed. If completed, * let's look at the alignments with my own selection of output and * alignment formats. */ for (String aRid : rid) { System.out.println("***********************************************"); System.out.println("trying to get BLAST results for RID " + aRid); boolean wasBlasted = false; while (!wasBlasted) { wasBlasted = rbw.isReady(aRid, System.currentTimeMillis()); } rof = new NCBIQBlastOutputProperties(); rof.setOutputFormat(NCBIQBlastOutputFormat.XML); rof.setAlignmentOutputFormat(NCBIQBlastOutputFormat.TABULAR); rof.setDescriptionNumber(20); rof.setAlignmentNumber(20); //System.out.println("Output Options:"+"\n"+rof.getOutputOptions()); is = rbw.getAlignmentResults(aRid, rof); BufferedReader br = new BufferedReader( new InputStreamReader(is)); String line = null; String OutputFilename1=aRid+"Results_XML.xml"; FileOutputStream fp1=null; fp1 = new FileOutputStream(OutputFilename1); while ((line = br.readLine()) != null) { //System.out.println(line); new PrintStream(fp1).println(line); } fp1.close(); BlastHomologyHits BL=new BlastHomologyHits(); BlastXMLQuery B=new BlastXMLQuery(OutputFilename1); LinkedHashMap> hits=B.getHitsQueryDef(1E-100); //System.out.println(hits); //LinkedHashMap> Homologyhits=BL.getMatches(new File(OutputFilename1), 1E-100); Collection c=hits.values(); Iterator i=c.iterator(); while(i.hasNext()) System.out.println(i.next()); } is.close(); } /* * What happens if the file can't be read */ catch (IOException ioe) { ioe.printStackTrace(); } /* * What happens if FastaReaderHelper hits a snag */ catch (Exception bio) { bio.printStackTrace(); } } } ------------------------ Thanks Shakuntala Email Disclaimer: This email and all its contents are subject to the disclaimer at http://www.uom.ac.mu/emaildisclaimer