From florent.angly at gmail.com Thu Nov 1 01:49:13 2012 From: florent.angly at gmail.com (Florent Angly) Date: Thu, 01 Nov 2012 15:49:13 +1000 Subject: [Bioperl-l] Bio::PrimarySeq speedup Message-ID: <50920D59.4010307@gmail.com> Hi all, I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was associated with passing full genomes between functions. For example, when doing a call to length(), length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So, every call to length is very expensive for long sequences. And there is a lot of code that calls length(), for error checking. I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g. Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to improve the documentation. In addition, to address the speed issue, I did some changes to Bio::PrimarySeq and Bio::PrimarySeqI : ? The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no extra call to seq() (which would incur the cost of copying the entire sequence between functions). ? The length is saved as an object attribute. So, calling length() is very cheap since it only needs to retrieve the stored value for the length. ? There is a constructor called -direct, which skips sequence validation. However, it was only active in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented purpose, I made it -direct work when a sequence is set through -seq as well. ? This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the new object can be constructed with the -direct constructor, to save some time. ? Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and these calls were easily replaced by simple calls to substr() to save some time. A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I anticipate that the gains would be bigger for even longer sequences. All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of Bioperl, so please, have a look at it and let me know if there are any glaring errors. Thanks, Florent From shalabh.sharma7 at gmail.com Thu Nov 1 15:36:35 2012 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Thu, 1 Nov 2012 15:36:35 -0400 Subject: [Bioperl-l] blast question Message-ID: Hi All, First of all i am really very sorry for posting blast question in this forum, I am not sure if this is the right place. I will really appreciate if anyone can guide me to the right direction. I am using blastall to get a top hit from a database so i am using -v 1 -b 1 (i hope this is right). But the strange part is that i am getting wrong results. for example: if i use -v 1 -b 1 then for one of the hit i am getting this: Sequences producing significant alignments: (bits) Value fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 4e-04 If i use -v 3 -b 3 then i am getting this for the same query: Sequences producing significant alignments: (bits) Value fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 e-167 fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 9e-07 fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 1.0 As you can see the top hit in the first case is totally wrong. I would really appreciate if someone can help me out, or direct to in the right direction. Thanks Shalabh -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From cjfields at illinois.edu Thu Nov 1 17:41:43 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 1 Nov 2012 21:41:43 +0000 Subject: [Bioperl-l] blast question In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> That's a scary error, but the best place to submit this would be the BLAST help list at NCBI (cc'd) chris On Nov 1, 2012, at 2:36 PM, shalabh sharma wrote: > Hi All, > First of all i am really very sorry for posting blast question in > this forum, I am not sure if this is the right place. > I will really appreciate if anyone can guide me to the right direction. > > I am using blastall to get a top hit from a database so i am using -v 1 -b > 1 (i hope this is right). > But the strange part is that i am getting wrong results. > > for example: if i use -v 1 -b 1 then for one of the hit i am getting this: > > > Sequences producing significant alignments: (bits) > Value > > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > 4e-04 > > > If i use -v 3 -b 3 then i am getting this for the same query: > > Sequences producing significant alignments: (bits) > Value > > fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 > e-167 > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > 9e-07 > fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 > 1.0 > > As you can see the top hit in the first case is totally wrong. > > I would really appreciate if someone can help me out, or direct to in the > right direction. > > Thanks > Shalabh > > > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences > University of Georgia > Athens, GA 30602-3636 > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From shalabh.sharma7 at gmail.com Fri Nov 2 10:50:17 2012 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Fri, 2 Nov 2012 10:50:17 -0400 Subject: [Bioperl-l] blast question In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> Message-ID: I know, i am really worried about my past analysis now. Thanks a lot for cc'ing this mail Chris. -Shalabh On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J wrote: > That's a scary error, but the best place to submit this would be the BLAST > help list at NCBI (cc'd) > > chris > > On Nov 1, 2012, at 2:36 PM, shalabh sharma > wrote: > > > Hi All, > > First of all i am really very sorry for posting blast question > in > > this forum, I am not sure if this is the right place. > > I will really appreciate if anyone can guide me to the right direction. > > > > I am using blastall to get a top hit from a database so i am using -v 1 > -b > > 1 (i hope this is right). > > But the strange part is that i am getting wrong results. > > > > for example: if i use -v 1 -b 1 then for one of the hit i am getting > this: > > > > > > Sequences producing significant alignments: (bits) > > Value > > > > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > > 4e-04 > > > > > > If i use -v 3 -b 3 then i am getting this for the same query: > > > > Sequences producing significant alignments: (bits) > > Value > > > > fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 > > e-167 > > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > > 9e-07 > > fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 > > 1.0 > > > > As you can see the top hit in the first case is totally wrong. > > > > I would really appreciate if someone can help me out, or direct to in the > > right direction. > > > > Thanks > > Shalabh > > > > > > > > -- > > Shalabh Sharma > > Scientific Computing Professional Associate (Bioinformatics Specialist) > > Department of Marine Sciences > > University of Georgia > > Athens, GA 30602-3636 > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From Scott.Markel at accelrys.com Fri Nov 2 20:13:59 2012 From: Scott.Markel at accelrys.com (Scott Markel) Date: Fri, 2 Nov 2012 17:13:59 -0700 Subject: [Bioperl-l] Structure::IO::pdb update needed to comply with PDB file format specification change Message-ID: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net> In tracking down a regression test failure we discovered that the Structure::IO::pdb module is out of date relative to the PDB file format specification (http://www.wwpdb.org/docs.html). PDB now writes out to column 79, while pdb.pm is still using the old line length of 71. Our regression failure was caused by a reformatting of 1CRN; journal titles started getting truncated. Some of the Perl lines are really simple, e.g., $keywds = $self->_read_PDB_singlecontline("KEYWDS","11-70",\$buffer); with others being just a little more detailed, e.g., my ($rec, $subr, $cont, $rol) = unpack "A6 x6 A4 A2 x1 A51", $_; It doesn't look like pdb.pm has changed in about 1.5 years. Is there a current module owner? Or someone else working on this? If not, we're willing to either compile a list of needed changes (walking through the PDB file format specification and comparing the corresponding column indices in pdb.pm) or provide a new version of the entire file. Please let us know which is preferred. Scott Scott Markel, Ph.D. Principal Bioinformatics Architect? email:? smarkel at accelrys.com Accelrys (Pipeline Pilot R&D)?????? mobile: +1 858 205 3653 10188 Telesis Court, Suite 100????? voice:? +1 858 799 5603 San Diego, CA 92121???????????????? fax:??? +1 858 799 5222 USA???????????????????????????????? web:??? http://www.accelrys.com http://www.linkedin.com/in/smarkel Secretary, Board of Directors: ??? International Society for Computational Biology Chair: ISCB Publications and Communications Committee Associate Editor: PLoS Computational Biology Editorial Board: Briefings in Bioinformatics From cjfields at illinois.edu Fri Nov 2 22:08:52 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sat, 3 Nov 2012 02:08:52 +0000 Subject: [Bioperl-l] Structure::IO::pdb update needed to comply with PDB file format specification change In-Reply-To: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net> References: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC817F8@CHIMBX5.ad.uillinois.edu> On Nov 2, 2012, at 7:13 PM, Scott Markel wrote: > In tracking down a regression test failure we discovered that the Structure::IO::pdb module is out of date relative to the PDB file format specification (http://www.wwpdb.org/docs.html). PDB now writes out to column 79, while pdb.pm is still using the old line length of 71. Our regression failure was caused by a reformatting of 1CRN; journal titles started getting truncated. > > Some of the Perl lines are really simple, e.g., > > $keywds = $self->_read_PDB_singlecontline("KEYWDS","11-70",\$buffer); > > with others being just a little more detailed, e.g., > > my ($rec, $subr, $cont, $rol) = unpack "A6 x6 A4 A2 x1 A51", $_; > > It doesn't look like pdb.pm has changed in about 1.5 years. Is there a current module owner? Or someone else working on this? No one has really taken ownership, so as far as I'm concerned it's open. Any objections? > If not, we're willing to either compile a list of needed changes (walking through the PDB file format specification and comparing the corresponding column indices in pdb.pm) or provide a new version of the entire file. Please let us know which is preferred. A new version of the file is fine if you have someone who can work on it. We would also like to change relevant tests and documentation if there is time. > Scott > > Scott Markel, Ph.D. > Principal Bioinformatics Architect email: smarkel at accelrys.com > Accelrys (Pipeline Pilot R&D) mobile: +1 858 205 3653 > 10188 Telesis Court, Suite 100 voice: +1 858 799 5603 > San Diego, CA 92121 fax: +1 858 799 5222 > USA web: http://www.accelrys.com > > http://www.linkedin.com/in/smarkel > Secretary, Board of Directors: > International Society for Computational Biology > Chair: ISCB Publications and Communications Committee > Associate Editor: PLoS Computational Biology > Editorial Board: Briefings in Bioinformatics Thanks Scott! chris From Russell.Smithies at agresearch.co.nz Sun Nov 4 16:00:37 2012 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Mon, 5 Nov 2012 10:00:37 +1300 Subject: [Bioperl-l] blast question In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> What version of blast are you using? There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+ --Russell -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma Sent: Saturday, 3 November 2012 3:50 a.m. To: Fields, Christopher J Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov Subject: Re: [Bioperl-l] blast question I know, i am really worried about my past analysis now. Thanks a lot for cc'ing this mail Chris. -Shalabh On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J wrote: > That's a scary error, but the best place to submit this would be the > BLAST help list at NCBI (cc'd) > > chris > > On Nov 1, 2012, at 2:36 PM, shalabh sharma > wrote: > > > Hi All, > > First of all i am really very sorry for posting blast > > question > in > > this forum, I am not sure if this is the right place. > > I will really appreciate if anyone can guide me to the right direction. > > > > I am using blastall to get a top hit from a database so i am using > > -v 1 > -b > > 1 (i hope this is right). > > But the strange part is that i am getting wrong results. > > > > for example: if i use -v 1 -b 1 then for one of the hit i am getting > this: > > > > > > Sequences producing significant alignments: (bits) > > Value > > > > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > > 4e-04 > > > > > > If i use -v 3 -b 3 then i am getting this for the same query: > > > > Sequences producing significant alignments: (bits) > > Value > > > > fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 > > e-167 > > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > > 9e-07 > > fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 > > 1.0 > > > > As you can see the top hit in the first case is totally wrong. > > > > I would really appreciate if someone can help me out, or direct to > > in the right direction. > > > > Thanks > > Shalabh > > > > > > > > -- > > Shalabh Sharma > > Scientific Computing Professional Associate (Bioinformatics > > Specialist) Department of Marine Sciences University of Georgia > > Athens, GA 30602-3636 > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From cjfields at illinois.edu Sun Nov 4 17:13:37 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sun, 4 Nov 2012 22:13:37 +0000 Subject: [Bioperl-l] blast question In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu> That in fact is the recommendation (migrate to BLAST+). chris On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" wrote: > What version of blast are you using? > There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+ > > > --Russell > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma > Sent: Saturday, 3 November 2012 3:50 a.m. > To: Fields, Christopher J > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov > Subject: Re: [Bioperl-l] blast question > > I know, i am really worried about my past analysis now. > Thanks a lot for cc'ing this mail Chris. > > -Shalabh > > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J > wrote: > >> That's a scary error, but the best place to submit this would be the >> BLAST help list at NCBI (cc'd) >> >> chris >> >> On Nov 1, 2012, at 2:36 PM, shalabh sharma >> wrote: >> >>> Hi All, >>> First of all i am really very sorry for posting blast >>> question >> in >>> this forum, I am not sure if this is the right place. >>> I will really appreciate if anyone can guide me to the right direction. >>> >>> I am using blastall to get a top hit from a database so i am using >>> -v 1 >> -b >>> 1 (i hope this is right). >>> But the strange part is that i am getting wrong results. >>> >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting >> this: >>> >>> >>> Sequences producing significant alignments: (bits) >>> Value >>> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 >>> 4e-04 >>> >>> >>> If i use -v 3 -b 3 then i am getting this for the same query: >>> >>> Sequences producing significant alignments: (bits) >>> Value >>> >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 >>> e-167 >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 >>> 9e-07 >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 >>> 1.0 >>> >>> As you can see the top hit in the first case is totally wrong. >>> >>> I would really appreciate if someone can help me out, or direct to >>> in the right direction. >>> >>> Thanks >>> Shalabh >>> >>> >>> >>> -- >>> Shalabh Sharma >>> Scientific Computing Professional Associate (Bioinformatics >>> Specialist) Department of Marine Sciences University of Georgia >>> Athens, GA 30602-3636 >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From florent.angly at gmail.com Sun Nov 4 19:46:44 2012 From: florent.angly at gmail.com (Florent Angly) Date: Mon, 05 Nov 2012 10:46:44 +1000 Subject: [Bioperl-l] Bio::PrimarySeq speedup In-Reply-To: <50920D59.4010307@gmail.com> References: <50920D59.4010307@gmail.com> Message-ID: <50970C74.7070605@gmail.com> I am planning on merging the branch with master this week. Best, Florent On 01/11/12 15:49, Florent Angly wrote: > Hi all, > > I was working with Ben Woodcroft on identifying ways to speed up > Grinder, which relies heavily on Bioperl. Ben did some profiling with > NYTProf and we realized that a lot of computation time was spent in > Bio::PrimarySeq, doing calls to subseq() and length(). The sequences > we used for the profiling were microbial genomes, i.e. several Mbp > long sequences, which is quite long. A lot of the performance cost was > associated with passing full genomes between functions. For example, > when doing a call to length(), length() requests the full sequence > from seq(), which returns it back to length() (it makes a copy!). So, > every call to length is very expensive for long sequences. And there > is a lot of code that calls length(), for error checking. > > I know that there are a few Bioperl modules that are more adapted to > handling very long sequences, e.g. Bio::DB::Fasta or > Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at > Bio::PrimarySeq with Ben and I released this commit: > https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. > But in fact, there were more things that I wanted to try to improve, > which led me to start this new branch: > https://github.com/bioperl/bioperl-live/tree/seqlength > > I wrote quite a few tests for functionalities that were not previously > covered by tests, and tried to improve the documentation. In addition, > to address the speed issue, I did some changes to Bio::PrimarySeq and > Bio::PrimarySeqI : > ? The length of a sequence is now computed as soon as the sequence is > set, not after. This way, there is no extra call to seq() (which would > incur the cost of copying the entire sequence between functions). > ? The length is saved as an object attribute. So, calling length() is > very cheap since it only needs to retrieve the stored value for the > length. > ? There is a constructor called -direct, which skips sequence > validation. However, it was only active in conjunction with the > -ref_to_seq constructor. To make -direct conform better to its > documented purpose, I made it -direct work when a sequence is set > through -seq as well. > ? This brings us to trunc(), revcom() and other methods of > Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq > object from an existing (already validated!) Bio::PrimarySeq object, > the new object can be constructed with the -direct constructor, to > save some time. > ? Finally, I noticed that subseq() used calls to eval() to do its > work. eval() is notoriously slow and these calls were easily replaced > by simple calls to substr() to save some time. > > A real-world test I performed with Grinder took 3m28s before the > changes (and ~1 min is spent doing something unrelated). After the > changes, the same test took only 2min28s. So, it's quite a significant > improvement and on more specific test cases, performance gains can > obviously be much bigger. Also, I anticipate that the gains would be > bigger for even longer sequences. > > All the changes I made are meant to be backward compatible and all the > tests in the Bioperl test suite passed. So, there _should_ not be any > issues. However, I know that Bio::PrimarySeq is a central module of > Bioperl, so please, have a look at it and let me know if there are any > glaring errors. > > Thanks, > > Florent > From cjfields at illinois.edu Sun Nov 4 21:43:28 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 5 Nov 2012 02:43:28 +0000 Subject: [Bioperl-l] Bio::PrimarySeq speedup In-Reply-To: <50970C74.7070605@gmail.com> References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu> Florent, Ran tests on it, they pass but I am seeing this (if these are expected, you can catch the warnings using Test::Warn): [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr t/Seq/PrimarySeq.t t/Seq/PrimarySeq.t .. 1/167 --------------------- WARNING --------------------- MSG: Got a sequence without letters. Could not guess alphabet --------------------------------------------------- --------------------- WARNING --------------------- MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is ! --------------------------------------------------- --------------------- WARNING --------------------- MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $ --------------------------------------------------- --------------------- WARNING --------------------- MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is & --------------------------------------------------- --------------------- WARNING --------------------- MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is \,$,+ --------------------------------------------------- --------------------- WARNING --------------------- MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/ --------------------------------------------------- t/Seq/PrimarySeq.t .. ok All tests successful. Files=1, Tests=167, 0 wallclock secs ( 0.03 usr 0.01 sys + 0.18 cusr 0.01 csys = 0.23 CPU) Result: PASS chris On Nov 4, 2012, at 6:46 PM, Florent Angly wrote: > I am planning on merging the branch with master this week. > Best, > Florent > > > On 01/11/12 15:49, Florent Angly wrote: >> Hi all, >> >> I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was associated with passing full genomes between functions. For example, when doing a call to length(), length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So, every call to length is very expensive for long sequences. And there is a lot of code that calls length(), for error checking. >> >> I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g. Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength >> >> I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to improve the documentation. In addition, to address the speed issue, I did some changes to Bio::PrimarySeq and Bio::PrimarySeqI : >> ? The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no extra call to seq() (which would incur the cost of copying the entire sequence between functions). >> ? The length is saved as an object attribute. So, calling length() is very cheap since it only needs to retrieve the stored value for the length. >> ? There is a constructor called -direct, which skips sequence validation. However, it was only active in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented purpose, I made it -direct work when a sequence is set through -seq as well. >> ? This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the new object can be constructed with the -direct constructor, to save some time. >> ? Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and these calls were easily replaced by simple calls to substr() to save some time. >> >> A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I anticipate that the gains would be bigger for even longer sequences. >> >> All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of Bioperl, so please, have a look at it and let me know if there are any glaring errors. >> >> Thanks, >> >> Florent >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From shalabh.sharma7 at gmail.com Mon Nov 5 12:03:38 2012 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Mon, 5 Nov 2012 12:03:38 -0500 Subject: [Bioperl-l] blast question In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu> References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu> Message-ID: Hi All, thanks for all your responses. Currently i am using the old version of blastall 2.2.22. @Peter: I will update my blast and will see if the problem still exist. But i can't restrict my blast with e value because i work on environmental samples , i have to reduce the size of my blast files as i am only interested in the top hit and my data sets are really huge. Thanks Shalabh On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J wrote: > That in fact is the recommendation (migrate to BLAST+). > > chris > > On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" < > Russell.Smithies at agresearch.co.nz> wrote: > > > What version of blast are you using? > > There have been quite a few bug fixes and I suspect any responses from > NCBI will suggest upgrading to the current version of blast+ > > > > > > --Russell > > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto: > bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma > > Sent: Saturday, 3 November 2012 3:50 a.m. > > To: Fields, Christopher J > > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov > > Subject: Re: [Bioperl-l] blast question > > > > I know, i am really worried about my past analysis now. > > Thanks a lot for cc'ing this mail Chris. > > > > -Shalabh > > > > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J < > cjfields at illinois.edu > >> wrote: > > > >> That's a scary error, but the best place to submit this would be the > >> BLAST help list at NCBI (cc'd) > >> > >> chris > >> > >> On Nov 1, 2012, at 2:36 PM, shalabh sharma > >> wrote: > >> > >>> Hi All, > >>> First of all i am really very sorry for posting blast > >>> question > >> in > >>> this forum, I am not sure if this is the right place. > >>> I will really appreciate if anyone can guide me to the right direction. > >>> > >>> I am using blastall to get a top hit from a database so i am using > >>> -v 1 > >> -b > >>> 1 (i hope this is right). > >>> But the strange part is that i am getting wrong results. > >>> > >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting > >> this: > >>> > >>> > >>> Sequences producing significant alignments: (bits) > >>> Value > >>> > >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA > 38 > >>> 4e-04 > >>> > >>> > >>> If i use -v 3 -b 3 then i am getting this for the same query: > >>> > >>> Sequences producing significant alignments: (bits) > >>> Value > >>> > >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... > 570 > >>> e-167 > >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA > 38 > >>> 9e-07 > >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... > 18 > >>> 1.0 > >>> > >>> As you can see the top hit in the first case is totally wrong. > >>> > >>> I would really appreciate if someone can help me out, or direct to > >>> in the right direction. > >>> > >>> Thanks > >>> Shalabh > >>> > >>> > >>> > >>> -- > >>> Shalabh Sharma > >>> Scientific Computing Professional Associate (Bioinformatics > >>> Specialist) Department of Marine Sciences University of Georgia > >>> Athens, GA 30602-3636 > >>> _______________________________________________ > >>> Bioperl-l mailing list > >>> Bioperl-l at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > >> > > > > > > -- > > Shalabh Sharma > > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences University of Georgia Athens, GA 30602-3636 > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > ======================================================================= > > Attention: The information contained in this message and/or attachments > > from AgResearch Limited is intended only for the persons or entities > > to which it is addressed and may contain confidential and/or privileged > > material. Any review, retransmission, dissemination or other use of, or > > taking of any action in reliance upon, this information by persons or > > entities other than the intended recipients is prohibited by AgResearch > > Limited. If you have received this message in error, please notify the > > sender immediately. > > ======================================================================= > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From Russell.Smithies at agresearch.co.nz Mon Nov 5 16:04:07 2012 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Tue, 6 Nov 2012 10:04:07 +1300 Subject: [Bioperl-l] blast question In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu> Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz> If you're using an older version of blast there was a bug where not all results were returned - I think the limit was 10,000 hits? Not usually a problem running basic queries but a big problem for environmental or metagenomic samples, or when aligning short reads. --Russell From: shalabh sharma [mailto:shalabh.sharma7 at gmail.com] Sent: Tuesday, 6 November 2012 6:04 a.m. To: Fields, Christopher J Cc: Smithies, Russell; bioperl-l Subject: Re: [Bioperl-l] blast question Hi All, thanks for all your responses. Currently i am using the old version of blastall 2.2.22. @Peter: I will update my blast and will see if the problem still exist. But i can't restrict my blast with e value because i work on environmental samples , i have to reduce the size of my blast files as i am only interested in the top hit and my data sets are really huge. Thanks Shalabh On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J > wrote: That in fact is the recommendation (migrate to BLAST+). chris On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" > wrote: > What version of blast are you using? > There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+ > > > --Russell > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma > Sent: Saturday, 3 November 2012 3:50 a.m. > To: Fields, Christopher J > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov > Subject: Re: [Bioperl-l] blast question > > I know, i am really worried about my past analysis now. > Thanks a lot for cc'ing this mail Chris. > > -Shalabh > > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J >> wrote: > >> That's a scary error, but the best place to submit this would be the >> BLAST help list at NCBI (cc'd) >> >> chris >> >> On Nov 1, 2012, at 2:36 PM, shalabh sharma > >> wrote: >> >>> Hi All, >>> First of all i am really very sorry for posting blast >>> question >> in >>> this forum, I am not sure if this is the right place. >>> I will really appreciate if anyone can guide me to the right direction. >>> >>> I am using blastall to get a top hit from a database so i am using >>> -v 1 >> -b >>> 1 (i hope this is right). >>> But the strange part is that i am getting wrong results. >>> >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting >> this: >>> >>> >>> Sequences producing significant alignments: (bits) >>> Value >>> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 >>> 4e-04 >>> >>> >>> If i use -v 3 -b 3 then i am getting this for the same query: >>> >>> Sequences producing significant alignments: (bits) >>> Value >>> >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 >>> e-167 >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 >>> 9e-07 >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 >>> 1.0 >>> >>> As you can see the top hit in the first case is totally wrong. >>> >>> I would really appreciate if someone can help me out, or direct to >>> in the right direction. >>> >>> Thanks >>> Shalabh >>> >>> >>> >>> -- >>> Shalabh Sharma >>> Scientific Computing Professional Associate (Bioinformatics >>> Specialist) Department of Marine Sciences University of Georgia >>> Athens, GA 30602-3636 >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From shalabh.sharma7 at gmail.com Mon Nov 5 16:09:03 2012 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Mon, 5 Nov 2012 16:09:03 -0500 Subject: [Bioperl-l] blast question In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz> References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz> Message-ID: Hi All, Thanks for all the suggestion. The problem is fixed by using latest blast+ . Thanks Shalabh On Mon, Nov 5, 2012 at 4:04 PM, Smithies, Russell < Russell.Smithies at agresearch.co.nz> wrote: > If you?re using an older version of blast there was a bug where not all > results were returned ? I think the limit was 10,000 hits?**** > > Not usually a problem running basic queries but a big problem for > environmental or metagenomic samples, or when aligning short reads.**** > > ** ** > > --Russell**** > > ** ** > > *From:* shalabh sharma [mailto:shalabh.sharma7 at gmail.com] > *Sent:* Tuesday, 6 November 2012 6:04 a.m. > *To:* Fields, Christopher J > *Cc:* Smithies, Russell; bioperl-l > > *Subject:* Re: [Bioperl-l] blast question**** > > ** ** > > Hi All,**** > > thanks for all your responses.**** > > ** ** > > Currently i am using the old version of blastall 2.2.22.**** > > ** ** > > @Peter: I will update my blast and will see if the problem still exist. > But i can't restrict my blast with e value because i work on environmental > samples , i have to reduce the size of my blast files as i am only > interested in the top hit and my data sets are really huge.**** > > ** ** > > Thanks**** > > Shalabh**** > > ** ** > > On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J < > cjfields at illinois.edu> wrote:**** > > That in fact is the recommendation (migrate to BLAST+). > > chris**** > > > On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" < > Russell.Smithies at agresearch.co.nz> wrote: > > > What version of blast are you using? > > There have been quite a few bug fixes and I suspect any responses from > NCBI will suggest upgrading to the current version of blast+ > > > > > > --Russell > > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto: > bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma > > Sent: Saturday, 3 November 2012 3:50 a.m. > > To: Fields, Christopher J > > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov > > Subject: Re: [Bioperl-l] blast question > > > > I know, i am really worried about my past analysis now. > > Thanks a lot for cc'ing this mail Chris. > > > > -Shalabh > > > > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J < > cjfields at illinois.edu > >> wrote: > > > >> That's a scary error, but the best place to submit this would be the > >> BLAST help list at NCBI (cc'd) > >> > >> chris > >> > >> On Nov 1, 2012, at 2:36 PM, shalabh sharma > >> wrote: > >> > >>> Hi All, > >>> First of all i am really very sorry for posting blast > >>> question > >> in > >>> this forum, I am not sure if this is the right place. > >>> I will really appreciate if anyone can guide me to the right direction. > >>> > >>> I am using blastall to get a top hit from a database so i am using > >>> -v 1 > >> -b > >>> 1 (i hope this is right). > >>> But the strange part is that i am getting wrong results. > >>> > >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting > >> this: > >>> > >>> > >>> Sequences producing significant alignments: (bits) > >>> Value > >>> > >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA > 38 > >>> 4e-04 > >>> > >>> > >>> If i use -v 3 -b 3 then i am getting this for the same query: > >>> > >>> Sequences producing significant alignments: (bits) > >>> Value > >>> > >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... > 570 > >>> e-167 > >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA > 38 > >>> 9e-07 > >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... > 18 > >>> 1.0 > >>> > >>> As you can see the top hit in the first case is totally wrong. > >>> > >>> I would really appreciate if someone can help me out, or direct to > >>> in the right direction. > >>> > >>> Thanks > >>> Shalabh > >>> > >>> > >>> > >>> -- > >>> Shalabh Sharma > >>> Scientific Computing Professional Associate (Bioinformatics > >>> Specialist) Department of Marine Sciences University of Georgia > >>> Athens, GA 30602-3636 > >>> _______________________________________________ > >>> Bioperl-l mailing list > >>> Bioperl-l at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > >> > > > > > > -- > > Shalabh Sharma > > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences University of Georgia Athens, GA 30602-3636 > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > ======================================================================= > > Attention: The information contained in this message and/or attachments > > from AgResearch Limited is intended only for the persons or entities > > to which it is addressed and may contain confidential and/or privileged > > material. Any review, retransmission, dissemination or other use of, or > > taking of any action in reliance upon, this information by persons or > > entities other than the intended recipients is prohibited by AgResearch > > Limited. If you have received this message in error, please notify the > > sender immediately. > > ======================================================================= > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l**** > > > > **** > > ** ** > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences > University of Georgia > Athens, GA 30602-3636**** > > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From florent.angly at gmail.com Tue Nov 6 06:06:56 2012 From: florent.angly at gmail.com (Florent Angly) Date: Tue, 06 Nov 2012 21:06:56 +1000 Subject: [Bioperl-l] Bio::PrimarySeq speedup In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu> References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com> <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu> Message-ID: <5098EF50.5040208@gmail.com> Yes, good idea, Chris. Actually, thinking about it, most of these warnings were redundant. So, I changed the behaviour of Bio::PrimarySeq::validate_seq() so that it issues exceptions if requested. Florent On 05/11/12 12:43, Fields, Christopher J wrote: > Florent, > > Ran tests on it, they pass but I am seeing this (if these are expected, you can catch the warnings using Test::Warn): > > [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr t/Seq/PrimarySeq.t > t/Seq/PrimarySeq.t .. 1/167 > --------------------- WARNING --------------------- > MSG: Got a sequence without letters. Could not guess alphabet > --------------------------------------------------- > > --------------------- WARNING --------------------- > MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is ! > --------------------------------------------------- > > --------------------- WARNING --------------------- > MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $ > --------------------------------------------------- > > --------------------- WARNING --------------------- > MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is & > --------------------------------------------------- > > --------------------- WARNING --------------------- > MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is \,$,+ > --------------------------------------------------- > > --------------------- WARNING --------------------- > MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/ > --------------------------------------------------- > t/Seq/PrimarySeq.t .. ok > All tests successful. > Files=1, Tests=167, 0 wallclock secs ( 0.03 usr 0.01 sys + 0.18 cusr 0.01 csys = 0.23 CPU) > Result: PASS > > chris > > On Nov 4, 2012, at 6:46 PM, Florent Angly > wrote: > >> I am planning on merging the branch with master this week. >> Best, >> Florent >> >> >> On 01/11/12 15:49, Florent Angly wrote: >>> Hi all, >>> >>> I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was associated with passing full genomes between functions. For example, when doing a call to length(), length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So, every call to length is very expensive for long sequences. And there is a lot of code that calls length(), for error checking. >>> >>> I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g. Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength >>> >>> I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to improve the documentation. In addition, to address the speed issue, I did some changes to Bio::PrimarySeq and Bio::PrimarySeqI : >>> ? The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no extra call to seq() (which would incur the cost of copying the entire sequence between functions). >>> ? The length is saved as an object attribute. So, calling length() is very cheap since it only needs to retrieve the stored value for the length. >>> ? There is a constructor called -direct, which skips sequence validation. However, it was only active in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented purpose, I made it -direct work when a sequence is set through -seq as well. >>> ? This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the new object can be constructed with the -direct constructor, to save some time. >>> ? Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and these calls were easily replaced by simple calls to substr() to save some time. >>> >>> A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I anticipate that the gains would be bigger for even longer sequences. >>> >>> All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of Bioperl, so please, have a look at it and let me know if there are any glaring errors. >>> >>> Thanks, >>> >>> Florent >>> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l From shlomif at shlomifish.org Tue Nov 6 07:27:00 2012 From: shlomif at shlomifish.org (Shlomi Fish) Date: Tue, 6 Nov 2012 14:27:00 +0200 Subject: [Bioperl-l] [Request] Please Help Add Some Information about Perl for Bio-Informatics to http://perl-begin.org/uses/bio-info/ In-Reply-To: <20121026192203.6d1e59c0@lap.shlomifish.org> References: <20121026192203.6d1e59c0@lap.shlomifish.org> Message-ID: <20121106142700.192f456e@lap.shlomifish.org> Hi, Can anyone help with that? Regards, Shlomi Fish On Fri, 26 Oct 2012 19:22:03 +0200 Shlomi Fish wrote: > Hi all, > > I am the maintainer of http://perl-begin.org/ , the Perl Beginners' Site. I > had this page there for a long time, but it's empty: > > http://perl-begin.org/uses/bio-info/ > > Can someone help me add some information there? A short XHTML page will be OK. > For reference, see the other pages in the section > ( http://perl-begin.org/uses/ ) such as: > > * http://perl-begin.org/uses/web/ > > * http://perl-begin.org/uses/sys-admin/ > > * http://perl-begin.org/uses/qa/ > > Note that you agree that the content will be licensed under the Creative > Commons Attribution 3.0 Unported License (or higher versions) and so you > should make sure it is original. > > I shall be obliged for any help. > > Regards, > > Shlomi Fish > -- ----------------------------------------------------------------- Shlomi Fish http://www.shlomifish.org/ Perl Humour - http://perl-begin.org/humour/ A wiseman can learn from a fool much more than a fool can ever learn from a wiseman. ? http://en.wikiquote.org/wiki/Cato_the_Elder Please reply to list if it's a mailing list post - http://shlom.in/reply . From florent.angly at gmail.com Thu Nov 15 11:29:30 2012 From: florent.angly at gmail.com (Florent Angly) Date: Fri, 16 Nov 2012 02:29:30 +1000 Subject: [Bioperl-l] Bio::PrimarySeq speedup In-Reply-To: <5098EF50.5040208@gmail.com> References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com> <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu> <5098EF50.5040208@gmail.com> Message-ID: <50A5186A.4060304@gmail.com> I now merged the branch with master. Best, Florent On 06/11/12 21:06, Florent Angly wrote: > Yes, good idea, Chris. > > Actually, thinking about it, most of these warnings were redundant. > So, I changed the behaviour of Bio::PrimarySeq::validate_seq() so that > it issues exceptions if requested. > > Florent > > > On 05/11/12 12:43, Fields, Christopher J wrote: >> Florent, >> >> Ran tests on it, they pass but I am seeing this (if these are >> expected, you can catch the warnings using Test::Warn): >> >> [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr >> t/Seq/PrimarySeq.t >> t/Seq/PrimarySeq.t .. 1/167 >> --------------------- WARNING --------------------- >> MSG: Got a sequence without letters. Could not guess alphabet >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is ! >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $ >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is & >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is >> \,$,+ >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/ >> --------------------------------------------------- >> t/Seq/PrimarySeq.t .. ok >> All tests successful. >> Files=1, Tests=167, 0 wallclock secs ( 0.03 usr 0.01 sys + 0.18 >> cusr 0.01 csys = 0.23 CPU) >> Result: PASS >> >> chris >> >> On Nov 4, 2012, at 6:46 PM, Florent Angly >> wrote: >> >>> I am planning on merging the branch with master this week. >>> Best, >>> Florent >>> >>> >>> On 01/11/12 15:49, Florent Angly wrote: >>>> Hi all, >>>> >>>> I was working with Ben Woodcroft on identifying ways to speed up >>>> Grinder, which relies heavily on Bioperl. Ben did some profiling >>>> with NYTProf and we realized that a lot of computation time was >>>> spent in Bio::PrimarySeq, doing calls to subseq() and length(). The >>>> sequences we used for the profiling were microbial genomes, i.e. >>>> several Mbp long sequences, which is quite long. A lot of the >>>> performance cost was associated with passing full genomes between >>>> functions. For example, when doing a call to length(), length() >>>> requests the full sequence from seq(), which returns it back to >>>> length() (it makes a copy!). So, every call to length is very >>>> expensive for long sequences. And there is a lot of code that calls >>>> length(), for error checking. >>>> >>>> I know that there are a few Bioperl modules that are more adapted >>>> to handling very long sequences, e.g. Bio::DB::Fasta or >>>> Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look >>>> at Bio::PrimarySeq with Ben and I released this commit: >>>> https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. >>>> But in fact, there were more things that I wanted to try to >>>> improve, which led me to start this new branch: >>>> https://github.com/bioperl/bioperl-live/tree/seqlength >>>> >>>> I wrote quite a few tests for functionalities that were not >>>> previously covered by tests, and tried to improve the >>>> documentation. In addition, to address the speed issue, I did some >>>> changes to Bio::PrimarySeq and Bio::PrimarySeqI : >>>> ? The length of a sequence is now computed as soon as the sequence >>>> is set, not after. This way, there is no extra call to seq() (which >>>> would incur the cost of copying the entire sequence between >>>> functions). >>>> ? The length is saved as an object attribute. So, calling length() >>>> is very cheap since it only needs to retrieve the stored value for >>>> the length. >>>> ? There is a constructor called -direct, which skips sequence >>>> validation. However, it was only active in conjunction with the >>>> -ref_to_seq constructor. To make -direct conform better to its >>>> documented purpose, I made it -direct work when a sequence is set >>>> through -seq as well. >>>> ? This brings us to trunc(), revcom() and other methods of >>>> Bio::PrimarySeqI. Since all these methods create a new >>>> Bio::PrimarySeq object from an existing (already validated!) >>>> Bio::PrimarySeq object, the new object can be constructed with the >>>> -direct constructor, to save some time. >>>> ? Finally, I noticed that subseq() used calls to eval() to do its >>>> work. eval() is notoriously slow and these calls were easily >>>> replaced by simple calls to substr() to save some time. >>>> >>>> A real-world test I performed with Grinder took 3m28s before the >>>> changes (and ~1 min is spent doing something unrelated). After the >>>> changes, the same test took only 2min28s. So, it's quite a >>>> significant improvement and on more specific test cases, >>>> performance gains can obviously be much bigger. Also, I anticipate >>>> that the gains would be bigger for even longer sequences. >>>> >>>> All the changes I made are meant to be backward compatible and all >>>> the tests in the Bioperl test suite passed. So, there _should_ not >>>> be any issues. However, I know that Bio::PrimarySeq is a central >>>> module of Bioperl, so please, have a look at it and let me know if >>>> there are any glaring errors. >>>> >>>> Thanks, >>>> >>>> Florent >>>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > From mahakadry at aucegypt.edu Tue Nov 20 13:44:53 2012 From: mahakadry at aucegypt.edu (maha ahmed) Date: Tue, 20 Nov 2012 20:44:53 +0200 Subject: [Bioperl-l] Parsing a blast report with multiple queries into separate one query files that only contain the fasta sequences Message-ID: Dear BioPerl list, I blasted a file that has several fasta queries against nr, however I need to align each query with its hits for further computational analysis so I need to parse the produced blast report into several files that each has only the fasta query sequence and its hits in fasta format. I found this script online, use Bio::Search::Result::BlastResult;use Bio::SearchIO; my $report = Bio::SearchIO->new( -file=>'full-report.bls', -format => blast);my $result = $report->next_result;my %hits_by_query;while (my $hit = $result->next_hit) { push @{$hits_by_query{$hit->name}}, $hit;} foreach my $qid ( keys %hits_by_query ) { my $result = Bio::Search::Result::BlastResult->new(); $result->add_hit($_) for ( @{$hits_by_query{$qid}} ); my $blio = Bio::SearchIO->new( -file => ">$qid\.bls", -format =>'blast' ); $blio->write_result($result);} however on using it this produced the following error message BlastResult::new(): Not adding iterations. ------------- EXCEPTION: Bio::Root::NoSuchThing ------------- MSG: No such iteration number: 0. Valid range=1-0 VALUE: The number zero (0) STACK: Error::throw STACK: Bio::Root::Root::throw /usr/local/share/perl/5.10.0/Bio/Root/Root.pm:472 STACK: Bio::Search::Result::BlastResult::iteration /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:327 STACK: Bio::Search::Result::BlastResult::add_hit /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:257 STACK: ./parsing.blast.results.into.per.query.files.pl:15 I tried to search for other scripts but I couldn't find any I would really appreciate your comments to this Thank you From cjfields at illinois.edu Tue Nov 20 14:21:25 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 20 Nov 2012 19:21:25 +0000 Subject: [Bioperl-l] Parsing a blast report with multiple queries into separate one query files that only contain the fasta sequences In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF22E15@CITESMBX5.ad.uillinois.edu> Maha, Do you need only the sequence reported in the report (e.g. the HSP alignments) or the original FASTA sequences? The former can be recovered from the Bio::Search::HSP::GenericHSP objects as an alignment, and this can be redirected to a FASTA file. The latter is a little trickier, as you will have to retrieve the sequences from their original source files. chris On Nov 20, 2012, at 12:44 PM, maha ahmed wrote: > Dear BioPerl list, > I blasted a file that has several fasta queries against nr, however I need > to align each query with its hits for further computational analysis so I > need to parse the produced blast report into several files that each has > only the fasta query sequence and its hits in fasta format. > I found this script online, > > use Bio::Search::Result::BlastResult;use Bio::SearchIO; > my $report = Bio::SearchIO->new( -file=>'full-report.bls', -format > => blast);my $result = > $report->next_result;my %hits_by_query;while (my $hit = > $result->next_hit) { > push > @{$hits_by_query{$hit->name}}, $hit;} > foreach my $qid ( keys > %hits_by_query ) { > my $result = Bio::Search::Result::BlastResult->new(); > $result->add_hit($_) for ( @{$hits_by_query{$qid}} ); > my $blio = Bio::SearchIO->new( -file => ">$qid\.bls", -format > =>'blast' ); > $blio->write_result($result);} > > > > however on using it this produced the following error message > > > > BlastResult::new(): Not adding iterations. > > ------------- EXCEPTION: Bio::Root::NoSuchThing ------------- > MSG: No such iteration number: 0. Valid range=1-0 > VALUE: The number zero (0) > STACK: Error::throw > STACK: Bio::Root::Root::throw /usr/local/share/perl/5.10.0/Bio/Root/Root.pm:472 > STACK: Bio::Search::Result::BlastResult::iteration > /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:327 > STACK: Bio::Search::Result::BlastResult::add_hit > /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:257 > STACK: ./parsing.blast.results.into.per.query.files.pl:15 > > I tried to search for other scripts but I couldn't find any > I would really appreciate your comments to this > Thank you > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From rfhorns at gmail.com Thu Nov 1 20:01:34 2012 From: rfhorns at gmail.com (Felix Horns) Date: Fri, 02 Nov 2012 00:01:34 -0000 Subject: [Bioperl-l] get_Stream_by_query() appears to return empty stream Message-ID: Hello everyone. I am having trouble using the get_Stream_by_query() function in Bio::DB::GenBank. It seems to return an empty stream, such that $stream->next_seq never returns anything. However, $query->count is returning the expected value (139). Also, get_Stream_by_query() seems to be querying the database, as when I pass it an array of GeneIDs that have not been properly formatted, i.e. GeneID:7816864, instead of simply 7816864, it returns warnings and errors: "MSG: Warning(s) from GenBank: GeneID 7817709...; MSG: Error from Genbank: No items found.". I have included my full code below. I have also included the output from the code below that. The code is intended to find genes located within a genomic region. I will later find the protein domains and pathways that those genes are involved in. Any help would be greatly appreciated. I realize that this is probably a very simple question, but I am relatively new to BioPerl and I've spent the better part of the day trying to figure out such issues, so I would be very thankful for help. Felix #!/usr/bin/perl use strict; use Bio::SeqIO; use Bio::DB::EntrezGene; use Bio::DB::GenBank; # Load reference sequence # Load from local .gb file # Note that .gb file does not include sequences # my $gbfile = "NC_012660.1.gb"; # my $seqio = Bio::SeqIO->new(-file => $gbfile); # my $ref_seq = $seqio->next_seq; # To access reference sequence programatically, uncomment this code my $gb = new Bio::DB::GenBank; my $ref_seq = $gb->get_Seq_by_acc("NC_012660.1"); # Specify coordinates of gap my $gap_start = 2050506; my $gap_end = 2190530; my $gene_count = 0; my @features; my @starts; my @ends; my @db_xrefs; my @products; my @protein_ids; # Get gene features in gap for my $feat ($ref_seq->get_SeqFeatures) { my $start=$feat->location->start; my $end=$feat->location->end; if (($feat->primary_tag eq 'gene') & ($gap_start < $start) & ($start < $gap_end) & ($gap_start < $end) & ($end < $gap_end)) { $gene_count += 1; # Get GeneID reference my $db_xref = ($feat->get_tag_values('db_xref'))[0]; $db_xref =~ s/GeneID://; # Trim "GeneID:" from start of $db_xref push @features, $feat; push @starts, $start; push @ends, $end; push @db_xrefs, $db_xref; } } # Get data about gene features from GeneID reference my $query = Bio::DB::Query::GenBank->new(-db => 'gene', -ids => [@db_xrefs]); my $stream = $gb->get_Stream_by_query($query); while (my $seq = $stream->next_seq) { for my $feat ($seq->all_SeqFeatures) { print "primary tag: ", $feat->primary_tag, "\n"; for my $tag ($feat->get_all_tags) { print " tag: ", $tag, "\n"; for my $value ($feat->get_tag_values($tag)) { print " value: ", $value, "\n"; } } } } print $query->count,"\n"; print $gene_count, "\n"; OUTPUT > perl analyze_gap.pl 139 139 Note that no "primary tag; tag; value" items are printed. Furthermore, when I put a print line immediately after the (while (my $seq = $stream->next_seq)) statement, it was never called, seemingly indicating that the stream is empty. From mooldhu at gmail.com Tue Nov 6 02:38:57 2012 From: mooldhu at gmail.com (=?GB2312?B?uvq9rQ==?=) Date: Tue, 6 Nov 2012 15:38:57 +0800 Subject: [Bioperl-l] Ask for help about Bioperl Message-ID: hi, when I use bioperl ,it report errors like this :--------------------- WARNING --------------------- MSG: sequence '1' doesn't validate, mismatch is )81,95,0(,:: --------------------------------------------------- Error providing evidence type: GeneModel The error was: ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Attempting to set the sequence '1' to [)81hvf95x0(DSTD=qeSrytkiyP::oiV] which does not look healthy STACK: Error::throw STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:486 STACK: Bio::PrimarySeq::seq /usr/share/perl5/Bio/PrimarySeq.pm:285 STACK: Bio::PrimarySeq::new /usr/share/perl5/Bio/PrimarySeq.pm:239 STACK: Bio::PrimarySeqI::revcom /usr/share/perl5/Bio/PrimarySeqI.pm:383 but,I am sure that the input file only cotain [ATGCN],I also try to use another sequences ,but the errors are the same.my bioperl is Bioperl-live 1.006902; -- ???? From assayagy at gmail.com Sat Nov 10 13:27:03 2012 From: assayagy at gmail.com (eyla4ever) Date: Sat, 10 Nov 2012 10:27:03 -0800 (PST) Subject: [Bioperl-l] Extracting sequences from Genbank files In-Reply-To: References: Message-ID: <34664632.post@talk.nabble.com> hello Brian i wuold like you to send me your script, i think it can help me to solve a big problem and help me to finish my final project. i hope it will be posible regards Eyla BForde wrote: > > Hello, > > I have been modifying a script which extracts all the protein sequences > from a genbank file and saves them in a multi-fasta file. > > I wish the fasta header to have both the locus_tag of the protein and the > product. However I cannot get the product tag to write to the fasta > header > > this is the relevant section of the script > > $s->display_id($f->has_tag('locus_tag') ? join(',',sort > $f->each_tag_value('locus_tag')) : > $f->has_tag('product') ? > join(',',$f->each_tag_value('product')): > $s->display_id); > > is "product" not an actual tag > > regards > > Brian > > > > -- > Brian Forde > Microbiology Dept. > Bioscience Institute. Room 4.11 > University College Cork > Cork > Ireland > tel:+353 21 4901306 > email: b.m.forde at umail.ucc.ie > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- View this message in context: http://old.nabble.com/Extracting-sequences-from-Genbank-files-tp33901023p34664632.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From bosborne11 at verizon.net Tue Nov 20 18:50:00 2012 From: bosborne11 at verizon.net (Brian Osborne) Date: Tue, 20 Nov 2012 18:50:00 -0500 Subject: [Bioperl-l] get_Stream_by_query() appears to return empty stream In-Reply-To: References: Message-ID: <5F077DEA-DEBD-42BC-87E7-327697764CFE@verizon.net> Felix, I took a look at the Bio::DB::Query::GenBank documentation, it says this: If you provide an array reference of IDs in -ids, the query will be ignored and the list of IDs will be used when the query is passed to a Bio::DB::GenBank object's get_Stream_by_query() method. Bio::DB::Genbank queries "nucleotide", by default. You have GeneIDs. I see that you're setting "-id" to "gene" but note that you're passing that query to a plain Bio::DB::GenBank object. Not sure what the expected behavior is here. I would try using the NCBI Eutilities for that second query, rather than Bio::DB::Query::GenBank (http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook). Brian O. On Nov 1, 2012, at 8:01 PM, Felix Horns wrote: > Hello everyone. > > I am having trouble using the get_Stream_by_query() function > in Bio::DB::GenBank. It seems to return an empty stream, such that > $stream->next_seq never returns anything. > > However, $query->count is returning the expected value (139). Also, > get_Stream_by_query() seems to be querying the database, as when I pass it > an array of GeneIDs that have not been properly formatted, i.e. > GeneID:7816864, instead of simply 7816864, it returns warnings and errors: > "MSG: Warning(s) from GenBank: GeneID 7817709...; MSG: > Error from Genbank: No items found.". > > I have included my full code below. I have also included the output from > the code below that. The code is intended to find genes located within a > genomic region. I will later find the protein domains and pathways that > those genes are involved in. > > Any help would be greatly appreciated. I realize that this is probably a > very simple question, but I am relatively new to BioPerl and I've spent the > better part of the day trying to figure out such issues, so I would be very > thankful for help. > > Felix > > > #!/usr/bin/perl > use strict; > use Bio::SeqIO; > use Bio::DB::EntrezGene; > use Bio::DB::GenBank; > > # Load reference sequence > # Load from local .gb file > # Note that .gb file does not include sequences > # my $gbfile = "NC_012660.1.gb"; > # my $seqio = Bio::SeqIO->new(-file => $gbfile); > # my $ref_seq = $seqio->next_seq; > > # To access reference sequence programatically, uncomment this code > my $gb = new Bio::DB::GenBank; > my $ref_seq = $gb->get_Seq_by_acc("NC_012660.1"); > > # Specify coordinates of gap > my $gap_start = 2050506; > my $gap_end = 2190530; > > my $gene_count = 0; > my @features; > my @starts; > my @ends; > my @db_xrefs; > > my @products; > my @protein_ids; > > # Get gene features in gap > for my $feat ($ref_seq->get_SeqFeatures) { > my $start=$feat->location->start; > my $end=$feat->location->end; > > if (($feat->primary_tag eq 'gene') & > ($gap_start < $start) & ($start < $gap_end) & > ($gap_start < $end) & ($end < $gap_end)) { > > $gene_count += 1; > > # Get GeneID reference > my $db_xref = ($feat->get_tag_values('db_xref'))[0]; > $db_xref =~ s/GeneID://; # Trim "GeneID:" from start of $db_xref > > push @features, $feat; > push @starts, $start; > push @ends, $end; > push @db_xrefs, $db_xref; > } > } > > # Get data about gene features from GeneID reference > my $query = Bio::DB::Query::GenBank->new(-db => 'gene', > -ids => [@db_xrefs]); > my $stream = $gb->get_Stream_by_query($query); > > while (my $seq = $stream->next_seq) { > for my $feat ($seq->all_SeqFeatures) { > print "primary tag: ", $feat->primary_tag, "\n"; > for my $tag ($feat->get_all_tags) { > print " tag: ", $tag, "\n"; > for my $value ($feat->get_tag_values($tag)) { > print " value: ", $value, "\n"; > } > } > } > } > > print $query->count,"\n"; > print $gene_count, "\n"; > > > OUTPUT >> perl analyze_gap.pl > 139 > 139 > > Note that no "primary tag; tag; value" items are printed. Furthermore, > when I put a print line immediately after the (while (my $seq = > $stream->next_seq)) statement, it was never called, seemingly indicating > that the stream is empty. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bosborne11 at verizon.net Tue Nov 20 18:52:00 2012 From: bosborne11 at verizon.net (Brian Osborne) Date: Tue, 20 Nov 2012 18:52:00 -0500 Subject: [Bioperl-l] Ask for help about Bioperl In-Reply-To: References: Message-ID: <3B472C83-7F6C-41E2-A629-E6A2BDC6B075@verizon.net> ????, You're going to have show us your code, we can't help you just by seeing the error messages. Show us the input file as well, or the beginning of it. Brian O. On Nov 6, 2012, at 2:38 AM, ???? wrote: > hi, > when I use bioperl ,it report errors like this :--------------------- > WARNING --------------------- > MSG: sequence '1' doesn't validate, mismatch is )81,95,0(,:: > --------------------------------------------------- > Error providing evidence type: GeneModel > The error was: > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: Attempting to set the sequence '1' to > [)81hvf95x0(DSTD=qeSrytkiyP::oiV] which does not look healthy > STACK: Error::throw > STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:486 > STACK: Bio::PrimarySeq::seq /usr/share/perl5/Bio/PrimarySeq.pm:285 > STACK: Bio::PrimarySeq::new /usr/share/perl5/Bio/PrimarySeq.pm:239 > STACK: Bio::PrimarySeqI::revcom /usr/share/perl5/Bio/PrimarySeqI.pm:383 > > > but,I am sure that the input file only cotain [ATGCN],I also try to use > another sequences ,but the errors are the same.my bioperl is Bioperl-live > 1.006902; > > -- > ???? > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at drycafe.net Tue Nov 20 21:24:50 2012 From: hlapp at drycafe.net (Hilmar Lapp) Date: Tue, 20 Nov 2012 21:24:50 -0500 Subject: [Bioperl-l] handle with file in perl In-Reply-To: <34626730.post@talk.nabble.com> References: <34626730.post@talk.nabble.com> Message-ID: <1DE09B34-5124-478C-8925-0045EC119CFC@drycafe.net> This sounds like a homework assignment. We're not here to do your homework or assignments for you. You can post if you run into a specific problem when solving your assignment with Bioperl, and we'll help with that. -hilmar Sent with a tap. On Oct 31, 2012, at 7:45 PM, eyla4ever wrote: > > hi > > i want to write a function that get as parameters : file_name, hsp , hit. > and i want her to print all the blast Field that i need to this file. > > i do it because i have 2 files with the same Fields. > > > 10X > -- > View this message in context: http://old.nabble.com/handle-with-file-in-perl-tp34626730p34626730.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From mahakadry at aucegypt.edu Fri Nov 23 20:33:59 2012 From: mahakadry at aucegypt.edu (maha ahmed) Date: Sat, 24 Nov 2012 03:33:59 +0200 Subject: [Bioperl-l] retrieving a subset of files from a folder Message-ID: Dear Bioperl list, I have a folder that has 60,000 files (one file for each phylogenetic tree) However I only need to work with a subset of 1,000 files from that folder (the files are not numbered in order so I cant use the i++ loop in my bioperl script) Is there a way to write a script that only moves files with the names given in a list in a text file i.e. I have a file that has the names of the files I want to copy fro m the folder and I want to write script that does this Thank you so much From kellert at ohsu.edu Sat Nov 24 13:08:11 2012 From: kellert at ohsu.edu (Tom Keller) Date: Sat, 24 Nov 2012 10:08:11 -0800 Subject: [Bioperl-l] use cookbook to work with a directory of files In-Reply-To: References: Message-ID: A search with the phrase "perl cookbook filenames from directory" should help you find what you need. On Nov 24, 2012, at 9:00 AM, bioperl-l-request at lists.open-bio.org wrote: > Send Bioperl-l mailing list submissions to > bioperl-l at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/bioperl-l > or, via email, send a message with subject or body 'help' to > bioperl-l-request at lists.open-bio.org > > You can reach the person managing the list at > bioperl-l-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Bioperl-l digest..." > > > Today's Topics: > > 1. retrieving a subset of files from a folder (maha ahmed) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sat, 24 Nov 2012 03:33:59 +0200 > From: maha ahmed > Subject: [Bioperl-l] retrieving a subset of files from a folder > To: Bioperl-l at lists.open-bio.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > Dear Bioperl list, > I have a folder that has 60,000 files (one file for each phylogenetic tree) > However I only need to work with a subset of 1,000 files from that folder > (the files are not numbered in order so I cant use the i++ loop in my > bioperl script) > Is there a way to write a script that only moves files with the names given > in a list in a text file > i.e. I have a file that has the names of the files I want to copy fro m the > folder and I want to write script that does this > Thank you so much > > > ------------------------------ > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > End of Bioperl-l Digest, Vol 115, Issue 8 > ***************************************** From minou.nowrousian at rub.de Sat Nov 24 13:24:02 2012 From: minou.nowrousian at rub.de (Minou Nowrousian) Date: 24 Nov 2012 19:24:02 +0100 Subject: [Bioperl-l] retrieving a subset of files from a folder Message-ID: <000001cdca70$e1a97720$a4fc6560$@rub.de> >Dear Bioperl list, >I have a folder that has 60,000 files (one file for each phylogenetic tree) However I only need to work with a subset of 1,000 files from that folder >(the files are not numbered in order so I cant use the i++ loop in my bioperl script) Is there a way to write a script that only moves files with the >names given in a list in a text file i.e. I have a file that has the names of the files I want to copy fro m the folder and I want to write script that does >this Thank you so much I don't know if there is a BioPerl solution, but you could use the File::Copy module (available from CPAN): use File::Copy; copy("path_to_file_you_want_to_copy","path_to_target_file") or die "Copy failed: $!"; Regards, Minou From mahakadry at aucegypt.edu Sat Nov 24 14:04:09 2012 From: mahakadry at aucegypt.edu (maha ahmed) Date: Sat, 24 Nov 2012 21:04:09 +0200 Subject: [Bioperl-l] retrieving a subset of files from a folder In-Reply-To: <000001cdca70$e1a97720$a4fc6560$@rub.de> References: <000001cdca70$e1a97720$a4fc6560$@rub.de> Message-ID: Thanks everyone , I actually found a one line command that I am going to try: xargs -a file_list.txt mv -t /path/to/des thanks for your help I will read have a look at the readings you suggested thank you On Sat, Nov 24, 2012 at 8:24 PM, Minou Nowrousian wrote: > > >Dear Bioperl list, > >I have a folder that has 60,000 files (one file for each phylogenetic > tree) > However I only need to work with a subset of 1,000 files from that folder > >(the files are not numbered in order so I cant use the i++ loop in my > bioperl script) Is there a way to write a script that only moves files with > the >names given in a list in a text file i.e. I have a file that has the > names of the files I want to copy fro m the folder and I want to write > script that does >this Thank you so much > > I don't know if there is a BioPerl solution, but you could use the > File::Copy module (available from CPAN): > > use File::Copy; > copy("path_to_file_you_want_to_copy","path_to_target_file") or die "Copy > failed: $!"; > > Regards, > Minou > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From maj at fortinbras.us Tue Nov 27 08:49:46 2012 From: maj at fortinbras.us (Mark A. Jensen) Date: Tue, 27 Nov 2012 13:49:46 +0000 Subject: [Bioperl-l] Neo4j : applying user defined validation and constraints Message-ID: Hi Folks, Since there was some enthusiasm about REST::Neo4p, my interface to Neo4j, I thought I would let you know about https://metacpan.org/module/REST::Neo4p::Constrain This is a framework that lets you apply constraints on node and relationship properties, relationships, and relationship types. You can specify your constraints, and have REST::Neo4p throw exceptions when the constraints aren't met, or you can do validation on existing database items. The pod has a full explanation and examples aplenty. Please have a look and send bugs my way via RT. Cheers all, MAJ From francescomusacchia at gmail.com Wed Nov 28 05:27:16 2012 From: francescomusacchia at gmail.com (Francesco Musacchia) Date: Wed, 28 Nov 2012 02:27:16 -0800 (PST) Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access Message-ID: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com> Hi all, I have a big problem with using GFF3 database with BioPerl. This is not a question about what is the way to write some bioperl code. I'm experiencing that when I have to do a lot of accessess on a GFF database (with Bio:DB::SeqFeature::Store) the slowness increase until my script can stay running for more than a day. How can I solve it? Or it cannot be done? Thanks a lot! From florent.angly at gmail.com Thu Nov 1 01:49:13 2012 From: florent.angly at gmail.com (Florent Angly) Date: Thu, 01 Nov 2012 15:49:13 +1000 Subject: [Bioperl-l] Bio::PrimarySeq speedup Message-ID: <50920D59.4010307@gmail.com> Hi all, I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was associated with passing full genomes between functions. For example, when doing a call to length(), length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So, every call to length is very expensive for long sequences. And there is a lot of code that calls length(), for error checking. I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g. Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to improve the documentation. In addition, to address the speed issue, I did some changes to Bio::PrimarySeq and Bio::PrimarySeqI : ? The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no extra call to seq() (which would incur the cost of copying the entire sequence between functions). ? The length is saved as an object attribute. So, calling length() is very cheap since it only needs to retrieve the stored value for the length. ? There is a constructor called -direct, which skips sequence validation. However, it was only active in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented purpose, I made it -direct work when a sequence is set through -seq as well. ? This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the new object can be constructed with the -direct constructor, to save some time. ? Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and these calls were easily replaced by simple calls to substr() to save some time. A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I anticipate that the gains would be bigger for even longer sequences. All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of Bioperl, so please, have a look at it and let me know if there are any glaring errors. Thanks, Florent From shalabh.sharma7 at gmail.com Thu Nov 1 15:36:35 2012 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Thu, 1 Nov 2012 15:36:35 -0400 Subject: [Bioperl-l] blast question Message-ID: Hi All, First of all i am really very sorry for posting blast question in this forum, I am not sure if this is the right place. I will really appreciate if anyone can guide me to the right direction. I am using blastall to get a top hit from a database so i am using -v 1 -b 1 (i hope this is right). But the strange part is that i am getting wrong results. for example: if i use -v 1 -b 1 then for one of the hit i am getting this: Sequences producing significant alignments: (bits) Value fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 4e-04 If i use -v 3 -b 3 then i am getting this for the same query: Sequences producing significant alignments: (bits) Value fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 e-167 fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 9e-07 fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 1.0 As you can see the top hit in the first case is totally wrong. I would really appreciate if someone can help me out, or direct to in the right direction. Thanks Shalabh -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From cjfields at illinois.edu Thu Nov 1 17:41:43 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 1 Nov 2012 21:41:43 +0000 Subject: [Bioperl-l] blast question In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> That's a scary error, but the best place to submit this would be the BLAST help list at NCBI (cc'd) chris On Nov 1, 2012, at 2:36 PM, shalabh sharma wrote: > Hi All, > First of all i am really very sorry for posting blast question in > this forum, I am not sure if this is the right place. > I will really appreciate if anyone can guide me to the right direction. > > I am using blastall to get a top hit from a database so i am using -v 1 -b > 1 (i hope this is right). > But the strange part is that i am getting wrong results. > > for example: if i use -v 1 -b 1 then for one of the hit i am getting this: > > > Sequences producing significant alignments: (bits) > Value > > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > 4e-04 > > > If i use -v 3 -b 3 then i am getting this for the same query: > > Sequences producing significant alignments: (bits) > Value > > fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 > e-167 > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > 9e-07 > fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 > 1.0 > > As you can see the top hit in the first case is totally wrong. > > I would really appreciate if someone can help me out, or direct to in the > right direction. > > Thanks > Shalabh > > > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences > University of Georgia > Athens, GA 30602-3636 > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From shalabh.sharma7 at gmail.com Fri Nov 2 10:50:17 2012 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Fri, 2 Nov 2012 10:50:17 -0400 Subject: [Bioperl-l] blast question In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> Message-ID: I know, i am really worried about my past analysis now. Thanks a lot for cc'ing this mail Chris. -Shalabh On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J wrote: > That's a scary error, but the best place to submit this would be the BLAST > help list at NCBI (cc'd) > > chris > > On Nov 1, 2012, at 2:36 PM, shalabh sharma > wrote: > > > Hi All, > > First of all i am really very sorry for posting blast question > in > > this forum, I am not sure if this is the right place. > > I will really appreciate if anyone can guide me to the right direction. > > > > I am using blastall to get a top hit from a database so i am using -v 1 > -b > > 1 (i hope this is right). > > But the strange part is that i am getting wrong results. > > > > for example: if i use -v 1 -b 1 then for one of the hit i am getting > this: > > > > > > Sequences producing significant alignments: (bits) > > Value > > > > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > > 4e-04 > > > > > > If i use -v 3 -b 3 then i am getting this for the same query: > > > > Sequences producing significant alignments: (bits) > > Value > > > > fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 > > e-167 > > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > > 9e-07 > > fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 > > 1.0 > > > > As you can see the top hit in the first case is totally wrong. > > > > I would really appreciate if someone can help me out, or direct to in the > > right direction. > > > > Thanks > > Shalabh > > > > > > > > -- > > Shalabh Sharma > > Scientific Computing Professional Associate (Bioinformatics Specialist) > > Department of Marine Sciences > > University of Georgia > > Athens, GA 30602-3636 > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From Scott.Markel at accelrys.com Fri Nov 2 20:13:59 2012 From: Scott.Markel at accelrys.com (Scott Markel) Date: Fri, 2 Nov 2012 17:13:59 -0700 Subject: [Bioperl-l] Structure::IO::pdb update needed to comply with PDB file format specification change Message-ID: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net> In tracking down a regression test failure we discovered that the Structure::IO::pdb module is out of date relative to the PDB file format specification (http://www.wwpdb.org/docs.html). PDB now writes out to column 79, while pdb.pm is still using the old line length of 71. Our regression failure was caused by a reformatting of 1CRN; journal titles started getting truncated. Some of the Perl lines are really simple, e.g., $keywds = $self->_read_PDB_singlecontline("KEYWDS","11-70",\$buffer); with others being just a little more detailed, e.g., my ($rec, $subr, $cont, $rol) = unpack "A6 x6 A4 A2 x1 A51", $_; It doesn't look like pdb.pm has changed in about 1.5 years. Is there a current module owner? Or someone else working on this? If not, we're willing to either compile a list of needed changes (walking through the PDB file format specification and comparing the corresponding column indices in pdb.pm) or provide a new version of the entire file. Please let us know which is preferred. Scott Scott Markel, Ph.D. Principal Bioinformatics Architect? email:? smarkel at accelrys.com Accelrys (Pipeline Pilot R&D)?????? mobile: +1 858 205 3653 10188 Telesis Court, Suite 100????? voice:? +1 858 799 5603 San Diego, CA 92121???????????????? fax:??? +1 858 799 5222 USA???????????????????????????????? web:??? http://www.accelrys.com http://www.linkedin.com/in/smarkel Secretary, Board of Directors: ??? International Society for Computational Biology Chair: ISCB Publications and Communications Committee Associate Editor: PLoS Computational Biology Editorial Board: Briefings in Bioinformatics From cjfields at illinois.edu Fri Nov 2 22:08:52 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sat, 3 Nov 2012 02:08:52 +0000 Subject: [Bioperl-l] Structure::IO::pdb update needed to comply with PDB file format specification change In-Reply-To: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net> References: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC817F8@CHIMBX5.ad.uillinois.edu> On Nov 2, 2012, at 7:13 PM, Scott Markel wrote: > In tracking down a regression test failure we discovered that the Structure::IO::pdb module is out of date relative to the PDB file format specification (http://www.wwpdb.org/docs.html). PDB now writes out to column 79, while pdb.pm is still using the old line length of 71. Our regression failure was caused by a reformatting of 1CRN; journal titles started getting truncated. > > Some of the Perl lines are really simple, e.g., > > $keywds = $self->_read_PDB_singlecontline("KEYWDS","11-70",\$buffer); > > with others being just a little more detailed, e.g., > > my ($rec, $subr, $cont, $rol) = unpack "A6 x6 A4 A2 x1 A51", $_; > > It doesn't look like pdb.pm has changed in about 1.5 years. Is there a current module owner? Or someone else working on this? No one has really taken ownership, so as far as I'm concerned it's open. Any objections? > If not, we're willing to either compile a list of needed changes (walking through the PDB file format specification and comparing the corresponding column indices in pdb.pm) or provide a new version of the entire file. Please let us know which is preferred. A new version of the file is fine if you have someone who can work on it. We would also like to change relevant tests and documentation if there is time. > Scott > > Scott Markel, Ph.D. > Principal Bioinformatics Architect email: smarkel at accelrys.com > Accelrys (Pipeline Pilot R&D) mobile: +1 858 205 3653 > 10188 Telesis Court, Suite 100 voice: +1 858 799 5603 > San Diego, CA 92121 fax: +1 858 799 5222 > USA web: http://www.accelrys.com > > http://www.linkedin.com/in/smarkel > Secretary, Board of Directors: > International Society for Computational Biology > Chair: ISCB Publications and Communications Committee > Associate Editor: PLoS Computational Biology > Editorial Board: Briefings in Bioinformatics Thanks Scott! chris From Russell.Smithies at agresearch.co.nz Sun Nov 4 16:00:37 2012 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Mon, 5 Nov 2012 10:00:37 +1300 Subject: [Bioperl-l] blast question In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> What version of blast are you using? There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+ --Russell -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma Sent: Saturday, 3 November 2012 3:50 a.m. To: Fields, Christopher J Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov Subject: Re: [Bioperl-l] blast question I know, i am really worried about my past analysis now. Thanks a lot for cc'ing this mail Chris. -Shalabh On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J wrote: > That's a scary error, but the best place to submit this would be the > BLAST help list at NCBI (cc'd) > > chris > > On Nov 1, 2012, at 2:36 PM, shalabh sharma > wrote: > > > Hi All, > > First of all i am really very sorry for posting blast > > question > in > > this forum, I am not sure if this is the right place. > > I will really appreciate if anyone can guide me to the right direction. > > > > I am using blastall to get a top hit from a database so i am using > > -v 1 > -b > > 1 (i hope this is right). > > But the strange part is that i am getting wrong results. > > > > for example: if i use -v 1 -b 1 then for one of the hit i am getting > this: > > > > > > Sequences producing significant alignments: (bits) > > Value > > > > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > > 4e-04 > > > > > > If i use -v 3 -b 3 then i am getting this for the same query: > > > > Sequences producing significant alignments: (bits) > > Value > > > > fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 > > e-167 > > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > > 9e-07 > > fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 > > 1.0 > > > > As you can see the top hit in the first case is totally wrong. > > > > I would really appreciate if someone can help me out, or direct to > > in the right direction. > > > > Thanks > > Shalabh > > > > > > > > -- > > Shalabh Sharma > > Scientific Computing Professional Associate (Bioinformatics > > Specialist) Department of Marine Sciences University of Georgia > > Athens, GA 30602-3636 > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From cjfields at illinois.edu Sun Nov 4 17:13:37 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sun, 4 Nov 2012 22:13:37 +0000 Subject: [Bioperl-l] blast question In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu> That in fact is the recommendation (migrate to BLAST+). chris On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" wrote: > What version of blast are you using? > There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+ > > > --Russell > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma > Sent: Saturday, 3 November 2012 3:50 a.m. > To: Fields, Christopher J > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov > Subject: Re: [Bioperl-l] blast question > > I know, i am really worried about my past analysis now. > Thanks a lot for cc'ing this mail Chris. > > -Shalabh > > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J > wrote: > >> That's a scary error, but the best place to submit this would be the >> BLAST help list at NCBI (cc'd) >> >> chris >> >> On Nov 1, 2012, at 2:36 PM, shalabh sharma >> wrote: >> >>> Hi All, >>> First of all i am really very sorry for posting blast >>> question >> in >>> this forum, I am not sure if this is the right place. >>> I will really appreciate if anyone can guide me to the right direction. >>> >>> I am using blastall to get a top hit from a database so i am using >>> -v 1 >> -b >>> 1 (i hope this is right). >>> But the strange part is that i am getting wrong results. >>> >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting >> this: >>> >>> >>> Sequences producing significant alignments: (bits) >>> Value >>> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 >>> 4e-04 >>> >>> >>> If i use -v 3 -b 3 then i am getting this for the same query: >>> >>> Sequences producing significant alignments: (bits) >>> Value >>> >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 >>> e-167 >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 >>> 9e-07 >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 >>> 1.0 >>> >>> As you can see the top hit in the first case is totally wrong. >>> >>> I would really appreciate if someone can help me out, or direct to >>> in the right direction. >>> >>> Thanks >>> Shalabh >>> >>> >>> >>> -- >>> Shalabh Sharma >>> Scientific Computing Professional Associate (Bioinformatics >>> Specialist) Department of Marine Sciences University of Georgia >>> Athens, GA 30602-3636 >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From florent.angly at gmail.com Sun Nov 4 19:46:44 2012 From: florent.angly at gmail.com (Florent Angly) Date: Mon, 05 Nov 2012 10:46:44 +1000 Subject: [Bioperl-l] Bio::PrimarySeq speedup In-Reply-To: <50920D59.4010307@gmail.com> References: <50920D59.4010307@gmail.com> Message-ID: <50970C74.7070605@gmail.com> I am planning on merging the branch with master this week. Best, Florent On 01/11/12 15:49, Florent Angly wrote: > Hi all, > > I was working with Ben Woodcroft on identifying ways to speed up > Grinder, which relies heavily on Bioperl. Ben did some profiling with > NYTProf and we realized that a lot of computation time was spent in > Bio::PrimarySeq, doing calls to subseq() and length(). The sequences > we used for the profiling were microbial genomes, i.e. several Mbp > long sequences, which is quite long. A lot of the performance cost was > associated with passing full genomes between functions. For example, > when doing a call to length(), length() requests the full sequence > from seq(), which returns it back to length() (it makes a copy!). So, > every call to length is very expensive for long sequences. And there > is a lot of code that calls length(), for error checking. > > I know that there are a few Bioperl modules that are more adapted to > handling very long sequences, e.g. Bio::DB::Fasta or > Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at > Bio::PrimarySeq with Ben and I released this commit: > https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. > But in fact, there were more things that I wanted to try to improve, > which led me to start this new branch: > https://github.com/bioperl/bioperl-live/tree/seqlength > > I wrote quite a few tests for functionalities that were not previously > covered by tests, and tried to improve the documentation. In addition, > to address the speed issue, I did some changes to Bio::PrimarySeq and > Bio::PrimarySeqI : > ? The length of a sequence is now computed as soon as the sequence is > set, not after. This way, there is no extra call to seq() (which would > incur the cost of copying the entire sequence between functions). > ? The length is saved as an object attribute. So, calling length() is > very cheap since it only needs to retrieve the stored value for the > length. > ? There is a constructor called -direct, which skips sequence > validation. However, it was only active in conjunction with the > -ref_to_seq constructor. To make -direct conform better to its > documented purpose, I made it -direct work when a sequence is set > through -seq as well. > ? This brings us to trunc(), revcom() and other methods of > Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq > object from an existing (already validated!) Bio::PrimarySeq object, > the new object can be constructed with the -direct constructor, to > save some time. > ? Finally, I noticed that subseq() used calls to eval() to do its > work. eval() is notoriously slow and these calls were easily replaced > by simple calls to substr() to save some time. > > A real-world test I performed with Grinder took 3m28s before the > changes (and ~1 min is spent doing something unrelated). After the > changes, the same test took only 2min28s. So, it's quite a significant > improvement and on more specific test cases, performance gains can > obviously be much bigger. Also, I anticipate that the gains would be > bigger for even longer sequences. > > All the changes I made are meant to be backward compatible and all the > tests in the Bioperl test suite passed. So, there _should_ not be any > issues. However, I know that Bio::PrimarySeq is a central module of > Bioperl, so please, have a look at it and let me know if there are any > glaring errors. > > Thanks, > > Florent > From cjfields at illinois.edu Sun Nov 4 21:43:28 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 5 Nov 2012 02:43:28 +0000 Subject: [Bioperl-l] Bio::PrimarySeq speedup In-Reply-To: <50970C74.7070605@gmail.com> References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu> Florent, Ran tests on it, they pass but I am seeing this (if these are expected, you can catch the warnings using Test::Warn): [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr t/Seq/PrimarySeq.t t/Seq/PrimarySeq.t .. 1/167 --------------------- WARNING --------------------- MSG: Got a sequence without letters. Could not guess alphabet --------------------------------------------------- --------------------- WARNING --------------------- MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is ! --------------------------------------------------- --------------------- WARNING --------------------- MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $ --------------------------------------------------- --------------------- WARNING --------------------- MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is & --------------------------------------------------- --------------------- WARNING --------------------- MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is \,$,+ --------------------------------------------------- --------------------- WARNING --------------------- MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/ --------------------------------------------------- t/Seq/PrimarySeq.t .. ok All tests successful. Files=1, Tests=167, 0 wallclock secs ( 0.03 usr 0.01 sys + 0.18 cusr 0.01 csys = 0.23 CPU) Result: PASS chris On Nov 4, 2012, at 6:46 PM, Florent Angly wrote: > I am planning on merging the branch with master this week. > Best, > Florent > > > On 01/11/12 15:49, Florent Angly wrote: >> Hi all, >> >> I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was associated with passing full genomes between functions. For example, when doing a call to length(), length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So, every call to length is very expensive for long sequences. And there is a lot of code that calls length(), for error checking. >> >> I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g. Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength >> >> I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to improve the documentation. In addition, to address the speed issue, I did some changes to Bio::PrimarySeq and Bio::PrimarySeqI : >> ? The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no extra call to seq() (which would incur the cost of copying the entire sequence between functions). >> ? The length is saved as an object attribute. So, calling length() is very cheap since it only needs to retrieve the stored value for the length. >> ? There is a constructor called -direct, which skips sequence validation. However, it was only active in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented purpose, I made it -direct work when a sequence is set through -seq as well. >> ? This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the new object can be constructed with the -direct constructor, to save some time. >> ? Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and these calls were easily replaced by simple calls to substr() to save some time. >> >> A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I anticipate that the gains would be bigger for even longer sequences. >> >> All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of Bioperl, so please, have a look at it and let me know if there are any glaring errors. >> >> Thanks, >> >> Florent >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From shalabh.sharma7 at gmail.com Mon Nov 5 12:03:38 2012 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Mon, 5 Nov 2012 12:03:38 -0500 Subject: [Bioperl-l] blast question In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu> References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu> Message-ID: Hi All, thanks for all your responses. Currently i am using the old version of blastall 2.2.22. @Peter: I will update my blast and will see if the problem still exist. But i can't restrict my blast with e value because i work on environmental samples , i have to reduce the size of my blast files as i am only interested in the top hit and my data sets are really huge. Thanks Shalabh On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J wrote: > That in fact is the recommendation (migrate to BLAST+). > > chris > > On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" < > Russell.Smithies at agresearch.co.nz> wrote: > > > What version of blast are you using? > > There have been quite a few bug fixes and I suspect any responses from > NCBI will suggest upgrading to the current version of blast+ > > > > > > --Russell > > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto: > bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma > > Sent: Saturday, 3 November 2012 3:50 a.m. > > To: Fields, Christopher J > > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov > > Subject: Re: [Bioperl-l] blast question > > > > I know, i am really worried about my past analysis now. > > Thanks a lot for cc'ing this mail Chris. > > > > -Shalabh > > > > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J < > cjfields at illinois.edu > >> wrote: > > > >> That's a scary error, but the best place to submit this would be the > >> BLAST help list at NCBI (cc'd) > >> > >> chris > >> > >> On Nov 1, 2012, at 2:36 PM, shalabh sharma > >> wrote: > >> > >>> Hi All, > >>> First of all i am really very sorry for posting blast > >>> question > >> in > >>> this forum, I am not sure if this is the right place. > >>> I will really appreciate if anyone can guide me to the right direction. > >>> > >>> I am using blastall to get a top hit from a database so i am using > >>> -v 1 > >> -b > >>> 1 (i hope this is right). > >>> But the strange part is that i am getting wrong results. > >>> > >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting > >> this: > >>> > >>> > >>> Sequences producing significant alignments: (bits) > >>> Value > >>> > >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA > 38 > >>> 4e-04 > >>> > >>> > >>> If i use -v 3 -b 3 then i am getting this for the same query: > >>> > >>> Sequences producing significant alignments: (bits) > >>> Value > >>> > >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... > 570 > >>> e-167 > >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA > 38 > >>> 9e-07 > >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... > 18 > >>> 1.0 > >>> > >>> As you can see the top hit in the first case is totally wrong. > >>> > >>> I would really appreciate if someone can help me out, or direct to > >>> in the right direction. > >>> > >>> Thanks > >>> Shalabh > >>> > >>> > >>> > >>> -- > >>> Shalabh Sharma > >>> Scientific Computing Professional Associate (Bioinformatics > >>> Specialist) Department of Marine Sciences University of Georgia > >>> Athens, GA 30602-3636 > >>> _______________________________________________ > >>> Bioperl-l mailing list > >>> Bioperl-l at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > >> > > > > > > -- > > Shalabh Sharma > > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences University of Georgia Athens, GA 30602-3636 > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > ======================================================================= > > Attention: The information contained in this message and/or attachments > > from AgResearch Limited is intended only for the persons or entities > > to which it is addressed and may contain confidential and/or privileged > > material. Any review, retransmission, dissemination or other use of, or > > taking of any action in reliance upon, this information by persons or > > entities other than the intended recipients is prohibited by AgResearch > > Limited. If you have received this message in error, please notify the > > sender immediately. > > ======================================================================= > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From Russell.Smithies at agresearch.co.nz Mon Nov 5 16:04:07 2012 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Tue, 6 Nov 2012 10:04:07 +1300 Subject: [Bioperl-l] blast question In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu> Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz> If you're using an older version of blast there was a bug where not all results were returned - I think the limit was 10,000 hits? Not usually a problem running basic queries but a big problem for environmental or metagenomic samples, or when aligning short reads. --Russell From: shalabh sharma [mailto:shalabh.sharma7 at gmail.com] Sent: Tuesday, 6 November 2012 6:04 a.m. To: Fields, Christopher J Cc: Smithies, Russell; bioperl-l Subject: Re: [Bioperl-l] blast question Hi All, thanks for all your responses. Currently i am using the old version of blastall 2.2.22. @Peter: I will update my blast and will see if the problem still exist. But i can't restrict my blast with e value because i work on environmental samples , i have to reduce the size of my blast files as i am only interested in the top hit and my data sets are really huge. Thanks Shalabh On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J > wrote: That in fact is the recommendation (migrate to BLAST+). chris On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" > wrote: > What version of blast are you using? > There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+ > > > --Russell > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma > Sent: Saturday, 3 November 2012 3:50 a.m. > To: Fields, Christopher J > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov > Subject: Re: [Bioperl-l] blast question > > I know, i am really worried about my past analysis now. > Thanks a lot for cc'ing this mail Chris. > > -Shalabh > > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J >> wrote: > >> That's a scary error, but the best place to submit this would be the >> BLAST help list at NCBI (cc'd) >> >> chris >> >> On Nov 1, 2012, at 2:36 PM, shalabh sharma > >> wrote: >> >>> Hi All, >>> First of all i am really very sorry for posting blast >>> question >> in >>> this forum, I am not sure if this is the right place. >>> I will really appreciate if anyone can guide me to the right direction. >>> >>> I am using blastall to get a top hit from a database so i am using >>> -v 1 >> -b >>> 1 (i hope this is right). >>> But the strange part is that i am getting wrong results. >>> >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting >> this: >>> >>> >>> Sequences producing significant alignments: (bits) >>> Value >>> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 >>> 4e-04 >>> >>> >>> If i use -v 3 -b 3 then i am getting this for the same query: >>> >>> Sequences producing significant alignments: (bits) >>> Value >>> >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 >>> e-167 >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 >>> 9e-07 >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 >>> 1.0 >>> >>> As you can see the top hit in the first case is totally wrong. >>> >>> I would really appreciate if someone can help me out, or direct to >>> in the right direction. >>> >>> Thanks >>> Shalabh >>> >>> >>> >>> -- >>> Shalabh Sharma >>> Scientific Computing Professional Associate (Bioinformatics >>> Specialist) Department of Marine Sciences University of Georgia >>> Athens, GA 30602-3636 >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From shalabh.sharma7 at gmail.com Mon Nov 5 16:09:03 2012 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Mon, 5 Nov 2012 16:09:03 -0500 Subject: [Bioperl-l] blast question In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz> References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz> Message-ID: Hi All, Thanks for all the suggestion. The problem is fixed by using latest blast+ . Thanks Shalabh On Mon, Nov 5, 2012 at 4:04 PM, Smithies, Russell < Russell.Smithies at agresearch.co.nz> wrote: > If you?re using an older version of blast there was a bug where not all > results were returned ? I think the limit was 10,000 hits?**** > > Not usually a problem running basic queries but a big problem for > environmental or metagenomic samples, or when aligning short reads.**** > > ** ** > > --Russell**** > > ** ** > > *From:* shalabh sharma [mailto:shalabh.sharma7 at gmail.com] > *Sent:* Tuesday, 6 November 2012 6:04 a.m. > *To:* Fields, Christopher J > *Cc:* Smithies, Russell; bioperl-l > > *Subject:* Re: [Bioperl-l] blast question**** > > ** ** > > Hi All,**** > > thanks for all your responses.**** > > ** ** > > Currently i am using the old version of blastall 2.2.22.**** > > ** ** > > @Peter: I will update my blast and will see if the problem still exist. > But i can't restrict my blast with e value because i work on environmental > samples , i have to reduce the size of my blast files as i am only > interested in the top hit and my data sets are really huge.**** > > ** ** > > Thanks**** > > Shalabh**** > > ** ** > > On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J < > cjfields at illinois.edu> wrote:**** > > That in fact is the recommendation (migrate to BLAST+). > > chris**** > > > On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" < > Russell.Smithies at agresearch.co.nz> wrote: > > > What version of blast are you using? > > There have been quite a few bug fixes and I suspect any responses from > NCBI will suggest upgrading to the current version of blast+ > > > > > > --Russell > > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto: > bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma > > Sent: Saturday, 3 November 2012 3:50 a.m. > > To: Fields, Christopher J > > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov > > Subject: Re: [Bioperl-l] blast question > > > > I know, i am really worried about my past analysis now. > > Thanks a lot for cc'ing this mail Chris. > > > > -Shalabh > > > > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J < > cjfields at illinois.edu > >> wrote: > > > >> That's a scary error, but the best place to submit this would be the > >> BLAST help list at NCBI (cc'd) > >> > >> chris > >> > >> On Nov 1, 2012, at 2:36 PM, shalabh sharma > >> wrote: > >> > >>> Hi All, > >>> First of all i am really very sorry for posting blast > >>> question > >> in > >>> this forum, I am not sure if this is the right place. > >>> I will really appreciate if anyone can guide me to the right direction. > >>> > >>> I am using blastall to get a top hit from a database so i am using > >>> -v 1 > >> -b > >>> 1 (i hope this is right). > >>> But the strange part is that i am getting wrong results. > >>> > >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting > >> this: > >>> > >>> > >>> Sequences producing significant alignments: (bits) > >>> Value > >>> > >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA > 38 > >>> 4e-04 > >>> > >>> > >>> If i use -v 3 -b 3 then i am getting this for the same query: > >>> > >>> Sequences producing significant alignments: (bits) > >>> Value > >>> > >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... > 570 > >>> e-167 > >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA > 38 > >>> 9e-07 > >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... > 18 > >>> 1.0 > >>> > >>> As you can see the top hit in the first case is totally wrong. > >>> > >>> I would really appreciate if someone can help me out, or direct to > >>> in the right direction. > >>> > >>> Thanks > >>> Shalabh > >>> > >>> > >>> > >>> -- > >>> Shalabh Sharma > >>> Scientific Computing Professional Associate (Bioinformatics > >>> Specialist) Department of Marine Sciences University of Georgia > >>> Athens, GA 30602-3636 > >>> _______________________________________________ > >>> Bioperl-l mailing list > >>> Bioperl-l at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > >> > > > > > > -- > > Shalabh Sharma > > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences University of Georgia Athens, GA 30602-3636 > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > ======================================================================= > > Attention: The information contained in this message and/or attachments > > from AgResearch Limited is intended only for the persons or entities > > to which it is addressed and may contain confidential and/or privileged > > material. Any review, retransmission, dissemination or other use of, or > > taking of any action in reliance upon, this information by persons or > > entities other than the intended recipients is prohibited by AgResearch > > Limited. If you have received this message in error, please notify the > > sender immediately. > > ======================================================================= > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l**** > > > > **** > > ** ** > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences > University of Georgia > Athens, GA 30602-3636**** > > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From florent.angly at gmail.com Tue Nov 6 06:06:56 2012 From: florent.angly at gmail.com (Florent Angly) Date: Tue, 06 Nov 2012 21:06:56 +1000 Subject: [Bioperl-l] Bio::PrimarySeq speedup In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu> References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com> <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu> Message-ID: <5098EF50.5040208@gmail.com> Yes, good idea, Chris. Actually, thinking about it, most of these warnings were redundant. So, I changed the behaviour of Bio::PrimarySeq::validate_seq() so that it issues exceptions if requested. Florent On 05/11/12 12:43, Fields, Christopher J wrote: > Florent, > > Ran tests on it, they pass but I am seeing this (if these are expected, you can catch the warnings using Test::Warn): > > [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr t/Seq/PrimarySeq.t > t/Seq/PrimarySeq.t .. 1/167 > --------------------- WARNING --------------------- > MSG: Got a sequence without letters. Could not guess alphabet > --------------------------------------------------- > > --------------------- WARNING --------------------- > MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is ! > --------------------------------------------------- > > --------------------- WARNING --------------------- > MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $ > --------------------------------------------------- > > --------------------- WARNING --------------------- > MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is & > --------------------------------------------------- > > --------------------- WARNING --------------------- > MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is \,$,+ > --------------------------------------------------- > > --------------------- WARNING --------------------- > MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/ > --------------------------------------------------- > t/Seq/PrimarySeq.t .. ok > All tests successful. > Files=1, Tests=167, 0 wallclock secs ( 0.03 usr 0.01 sys + 0.18 cusr 0.01 csys = 0.23 CPU) > Result: PASS > > chris > > On Nov 4, 2012, at 6:46 PM, Florent Angly > wrote: > >> I am planning on merging the branch with master this week. >> Best, >> Florent >> >> >> On 01/11/12 15:49, Florent Angly wrote: >>> Hi all, >>> >>> I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was associated with passing full genomes between functions. For example, when doing a call to length(), length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So, every call to length is very expensive for long sequences. And there is a lot of code that calls length(), for error checking. >>> >>> I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g. Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength >>> >>> I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to improve the documentation. In addition, to address the speed issue, I did some changes to Bio::PrimarySeq and Bio::PrimarySeqI : >>> ? The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no extra call to seq() (which would incur the cost of copying the entire sequence between functions). >>> ? The length is saved as an object attribute. So, calling length() is very cheap since it only needs to retrieve the stored value for the length. >>> ? There is a constructor called -direct, which skips sequence validation. However, it was only active in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented purpose, I made it -direct work when a sequence is set through -seq as well. >>> ? This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the new object can be constructed with the -direct constructor, to save some time. >>> ? Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and these calls were easily replaced by simple calls to substr() to save some time. >>> >>> A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I anticipate that the gains would be bigger for even longer sequences. >>> >>> All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of Bioperl, so please, have a look at it and let me know if there are any glaring errors. >>> >>> Thanks, >>> >>> Florent >>> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l From shlomif at shlomifish.org Tue Nov 6 07:27:00 2012 From: shlomif at shlomifish.org (Shlomi Fish) Date: Tue, 6 Nov 2012 14:27:00 +0200 Subject: [Bioperl-l] [Request] Please Help Add Some Information about Perl for Bio-Informatics to http://perl-begin.org/uses/bio-info/ In-Reply-To: <20121026192203.6d1e59c0@lap.shlomifish.org> References: <20121026192203.6d1e59c0@lap.shlomifish.org> Message-ID: <20121106142700.192f456e@lap.shlomifish.org> Hi, Can anyone help with that? Regards, Shlomi Fish On Fri, 26 Oct 2012 19:22:03 +0200 Shlomi Fish wrote: > Hi all, > > I am the maintainer of http://perl-begin.org/ , the Perl Beginners' Site. I > had this page there for a long time, but it's empty: > > http://perl-begin.org/uses/bio-info/ > > Can someone help me add some information there? A short XHTML page will be OK. > For reference, see the other pages in the section > ( http://perl-begin.org/uses/ ) such as: > > * http://perl-begin.org/uses/web/ > > * http://perl-begin.org/uses/sys-admin/ > > * http://perl-begin.org/uses/qa/ > > Note that you agree that the content will be licensed under the Creative > Commons Attribution 3.0 Unported License (or higher versions) and so you > should make sure it is original. > > I shall be obliged for any help. > > Regards, > > Shlomi Fish > -- ----------------------------------------------------------------- Shlomi Fish http://www.shlomifish.org/ Perl Humour - http://perl-begin.org/humour/ A wiseman can learn from a fool much more than a fool can ever learn from a wiseman. ? http://en.wikiquote.org/wiki/Cato_the_Elder Please reply to list if it's a mailing list post - http://shlom.in/reply . From florent.angly at gmail.com Thu Nov 15 11:29:30 2012 From: florent.angly at gmail.com (Florent Angly) Date: Fri, 16 Nov 2012 02:29:30 +1000 Subject: [Bioperl-l] Bio::PrimarySeq speedup In-Reply-To: <5098EF50.5040208@gmail.com> References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com> <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu> <5098EF50.5040208@gmail.com> Message-ID: <50A5186A.4060304@gmail.com> I now merged the branch with master. Best, Florent On 06/11/12 21:06, Florent Angly wrote: > Yes, good idea, Chris. > > Actually, thinking about it, most of these warnings were redundant. > So, I changed the behaviour of Bio::PrimarySeq::validate_seq() so that > it issues exceptions if requested. > > Florent > > > On 05/11/12 12:43, Fields, Christopher J wrote: >> Florent, >> >> Ran tests on it, they pass but I am seeing this (if these are >> expected, you can catch the warnings using Test::Warn): >> >> [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr >> t/Seq/PrimarySeq.t >> t/Seq/PrimarySeq.t .. 1/167 >> --------------------- WARNING --------------------- >> MSG: Got a sequence without letters. Could not guess alphabet >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is ! >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $ >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is & >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is >> \,$,+ >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/ >> --------------------------------------------------- >> t/Seq/PrimarySeq.t .. ok >> All tests successful. >> Files=1, Tests=167, 0 wallclock secs ( 0.03 usr 0.01 sys + 0.18 >> cusr 0.01 csys = 0.23 CPU) >> Result: PASS >> >> chris >> >> On Nov 4, 2012, at 6:46 PM, Florent Angly >> wrote: >> >>> I am planning on merging the branch with master this week. >>> Best, >>> Florent >>> >>> >>> On 01/11/12 15:49, Florent Angly wrote: >>>> Hi all, >>>> >>>> I was working with Ben Woodcroft on identifying ways to speed up >>>> Grinder, which relies heavily on Bioperl. Ben did some profiling >>>> with NYTProf and we realized that a lot of computation time was >>>> spent in Bio::PrimarySeq, doing calls to subseq() and length(). The >>>> sequences we used for the profiling were microbial genomes, i.e. >>>> several Mbp long sequences, which is quite long. A lot of the >>>> performance cost was associated with passing full genomes between >>>> functions. For example, when doing a call to length(), length() >>>> requests the full sequence from seq(), which returns it back to >>>> length() (it makes a copy!). So, every call to length is very >>>> expensive for long sequences. And there is a lot of code that calls >>>> length(), for error checking. >>>> >>>> I know that there are a few Bioperl modules that are more adapted >>>> to handling very long sequences, e.g. Bio::DB::Fasta or >>>> Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look >>>> at Bio::PrimarySeq with Ben and I released this commit: >>>> https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. >>>> But in fact, there were more things that I wanted to try to >>>> improve, which led me to start this new branch: >>>> https://github.com/bioperl/bioperl-live/tree/seqlength >>>> >>>> I wrote quite a few tests for functionalities that were not >>>> previously covered by tests, and tried to improve the >>>> documentation. In addition, to address the speed issue, I did some >>>> changes to Bio::PrimarySeq and Bio::PrimarySeqI : >>>> ? The length of a sequence is now computed as soon as the sequence >>>> is set, not after. This way, there is no extra call to seq() (which >>>> would incur the cost of copying the entire sequence between >>>> functions). >>>> ? The length is saved as an object attribute. So, calling length() >>>> is very cheap since it only needs to retrieve the stored value for >>>> the length. >>>> ? There is a constructor called -direct, which skips sequence >>>> validation. However, it was only active in conjunction with the >>>> -ref_to_seq constructor. To make -direct conform better to its >>>> documented purpose, I made it -direct work when a sequence is set >>>> through -seq as well. >>>> ? This brings us to trunc(), revcom() and other methods of >>>> Bio::PrimarySeqI. Since all these methods create a new >>>> Bio::PrimarySeq object from an existing (already validated!) >>>> Bio::PrimarySeq object, the new object can be constructed with the >>>> -direct constructor, to save some time. >>>> ? Finally, I noticed that subseq() used calls to eval() to do its >>>> work. eval() is notoriously slow and these calls were easily >>>> replaced by simple calls to substr() to save some time. >>>> >>>> A real-world test I performed with Grinder took 3m28s before the >>>> changes (and ~1 min is spent doing something unrelated). After the >>>> changes, the same test took only 2min28s. So, it's quite a >>>> significant improvement and on more specific test cases, >>>> performance gains can obviously be much bigger. Also, I anticipate >>>> that the gains would be bigger for even longer sequences. >>>> >>>> All the changes I made are meant to be backward compatible and all >>>> the tests in the Bioperl test suite passed. So, there _should_ not >>>> be any issues. However, I know that Bio::PrimarySeq is a central >>>> module of Bioperl, so please, have a look at it and let me know if >>>> there are any glaring errors. >>>> >>>> Thanks, >>>> >>>> Florent >>>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > From mahakadry at aucegypt.edu Tue Nov 20 13:44:53 2012 From: mahakadry at aucegypt.edu (maha ahmed) Date: Tue, 20 Nov 2012 20:44:53 +0200 Subject: [Bioperl-l] Parsing a blast report with multiple queries into separate one query files that only contain the fasta sequences Message-ID: Dear BioPerl list, I blasted a file that has several fasta queries against nr, however I need to align each query with its hits for further computational analysis so I need to parse the produced blast report into several files that each has only the fasta query sequence and its hits in fasta format. I found this script online, use Bio::Search::Result::BlastResult;use Bio::SearchIO; my $report = Bio::SearchIO->new( -file=>'full-report.bls', -format => blast);my $result = $report->next_result;my %hits_by_query;while (my $hit = $result->next_hit) { push @{$hits_by_query{$hit->name}}, $hit;} foreach my $qid ( keys %hits_by_query ) { my $result = Bio::Search::Result::BlastResult->new(); $result->add_hit($_) for ( @{$hits_by_query{$qid}} ); my $blio = Bio::SearchIO->new( -file => ">$qid\.bls", -format =>'blast' ); $blio->write_result($result);} however on using it this produced the following error message BlastResult::new(): Not adding iterations. ------------- EXCEPTION: Bio::Root::NoSuchThing ------------- MSG: No such iteration number: 0. Valid range=1-0 VALUE: The number zero (0) STACK: Error::throw STACK: Bio::Root::Root::throw /usr/local/share/perl/5.10.0/Bio/Root/Root.pm:472 STACK: Bio::Search::Result::BlastResult::iteration /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:327 STACK: Bio::Search::Result::BlastResult::add_hit /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:257 STACK: ./parsing.blast.results.into.per.query.files.pl:15 I tried to search for other scripts but I couldn't find any I would really appreciate your comments to this Thank you From cjfields at illinois.edu Tue Nov 20 14:21:25 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 20 Nov 2012 19:21:25 +0000 Subject: [Bioperl-l] Parsing a blast report with multiple queries into separate one query files that only contain the fasta sequences In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF22E15@CITESMBX5.ad.uillinois.edu> Maha, Do you need only the sequence reported in the report (e.g. the HSP alignments) or the original FASTA sequences? The former can be recovered from the Bio::Search::HSP::GenericHSP objects as an alignment, and this can be redirected to a FASTA file. The latter is a little trickier, as you will have to retrieve the sequences from their original source files. chris On Nov 20, 2012, at 12:44 PM, maha ahmed wrote: > Dear BioPerl list, > I blasted a file that has several fasta queries against nr, however I need > to align each query with its hits for further computational analysis so I > need to parse the produced blast report into several files that each has > only the fasta query sequence and its hits in fasta format. > I found this script online, > > use Bio::Search::Result::BlastResult;use Bio::SearchIO; > my $report = Bio::SearchIO->new( -file=>'full-report.bls', -format > => blast);my $result = > $report->next_result;my %hits_by_query;while (my $hit = > $result->next_hit) { > push > @{$hits_by_query{$hit->name}}, $hit;} > foreach my $qid ( keys > %hits_by_query ) { > my $result = Bio::Search::Result::BlastResult->new(); > $result->add_hit($_) for ( @{$hits_by_query{$qid}} ); > my $blio = Bio::SearchIO->new( -file => ">$qid\.bls", -format > =>'blast' ); > $blio->write_result($result);} > > > > however on using it this produced the following error message > > > > BlastResult::new(): Not adding iterations. > > ------------- EXCEPTION: Bio::Root::NoSuchThing ------------- > MSG: No such iteration number: 0. Valid range=1-0 > VALUE: The number zero (0) > STACK: Error::throw > STACK: Bio::Root::Root::throw /usr/local/share/perl/5.10.0/Bio/Root/Root.pm:472 > STACK: Bio::Search::Result::BlastResult::iteration > /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:327 > STACK: Bio::Search::Result::BlastResult::add_hit > /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:257 > STACK: ./parsing.blast.results.into.per.query.files.pl:15 > > I tried to search for other scripts but I couldn't find any > I would really appreciate your comments to this > Thank you > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From rfhorns at gmail.com Thu Nov 1 20:01:34 2012 From: rfhorns at gmail.com (Felix Horns) Date: Fri, 02 Nov 2012 00:01:34 -0000 Subject: [Bioperl-l] get_Stream_by_query() appears to return empty stream Message-ID: Hello everyone. I am having trouble using the get_Stream_by_query() function in Bio::DB::GenBank. It seems to return an empty stream, such that $stream->next_seq never returns anything. However, $query->count is returning the expected value (139). Also, get_Stream_by_query() seems to be querying the database, as when I pass it an array of GeneIDs that have not been properly formatted, i.e. GeneID:7816864, instead of simply 7816864, it returns warnings and errors: "MSG: Warning(s) from GenBank: GeneID 7817709...; MSG: Error from Genbank: No items found.". I have included my full code below. I have also included the output from the code below that. The code is intended to find genes located within a genomic region. I will later find the protein domains and pathways that those genes are involved in. Any help would be greatly appreciated. I realize that this is probably a very simple question, but I am relatively new to BioPerl and I've spent the better part of the day trying to figure out such issues, so I would be very thankful for help. Felix #!/usr/bin/perl use strict; use Bio::SeqIO; use Bio::DB::EntrezGene; use Bio::DB::GenBank; # Load reference sequence # Load from local .gb file # Note that .gb file does not include sequences # my $gbfile = "NC_012660.1.gb"; # my $seqio = Bio::SeqIO->new(-file => $gbfile); # my $ref_seq = $seqio->next_seq; # To access reference sequence programatically, uncomment this code my $gb = new Bio::DB::GenBank; my $ref_seq = $gb->get_Seq_by_acc("NC_012660.1"); # Specify coordinates of gap my $gap_start = 2050506; my $gap_end = 2190530; my $gene_count = 0; my @features; my @starts; my @ends; my @db_xrefs; my @products; my @protein_ids; # Get gene features in gap for my $feat ($ref_seq->get_SeqFeatures) { my $start=$feat->location->start; my $end=$feat->location->end; if (($feat->primary_tag eq 'gene') & ($gap_start < $start) & ($start < $gap_end) & ($gap_start < $end) & ($end < $gap_end)) { $gene_count += 1; # Get GeneID reference my $db_xref = ($feat->get_tag_values('db_xref'))[0]; $db_xref =~ s/GeneID://; # Trim "GeneID:" from start of $db_xref push @features, $feat; push @starts, $start; push @ends, $end; push @db_xrefs, $db_xref; } } # Get data about gene features from GeneID reference my $query = Bio::DB::Query::GenBank->new(-db => 'gene', -ids => [@db_xrefs]); my $stream = $gb->get_Stream_by_query($query); while (my $seq = $stream->next_seq) { for my $feat ($seq->all_SeqFeatures) { print "primary tag: ", $feat->primary_tag, "\n"; for my $tag ($feat->get_all_tags) { print " tag: ", $tag, "\n"; for my $value ($feat->get_tag_values($tag)) { print " value: ", $value, "\n"; } } } } print $query->count,"\n"; print $gene_count, "\n"; OUTPUT > perl analyze_gap.pl 139 139 Note that no "primary tag; tag; value" items are printed. Furthermore, when I put a print line immediately after the (while (my $seq = $stream->next_seq)) statement, it was never called, seemingly indicating that the stream is empty. From mooldhu at gmail.com Tue Nov 6 02:38:57 2012 From: mooldhu at gmail.com (=?GB2312?B?uvq9rQ==?=) Date: Tue, 6 Nov 2012 15:38:57 +0800 Subject: [Bioperl-l] Ask for help about Bioperl Message-ID: hi, when I use bioperl ,it report errors like this :--------------------- WARNING --------------------- MSG: sequence '1' doesn't validate, mismatch is )81,95,0(,:: --------------------------------------------------- Error providing evidence type: GeneModel The error was: ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Attempting to set the sequence '1' to [)81hvf95x0(DSTD=qeSrytkiyP::oiV] which does not look healthy STACK: Error::throw STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:486 STACK: Bio::PrimarySeq::seq /usr/share/perl5/Bio/PrimarySeq.pm:285 STACK: Bio::PrimarySeq::new /usr/share/perl5/Bio/PrimarySeq.pm:239 STACK: Bio::PrimarySeqI::revcom /usr/share/perl5/Bio/PrimarySeqI.pm:383 but,I am sure that the input file only cotain [ATGCN],I also try to use another sequences ,but the errors are the same.my bioperl is Bioperl-live 1.006902; -- ???? From assayagy at gmail.com Sat Nov 10 13:27:03 2012 From: assayagy at gmail.com (eyla4ever) Date: Sat, 10 Nov 2012 10:27:03 -0800 (PST) Subject: [Bioperl-l] Extracting sequences from Genbank files In-Reply-To: References: Message-ID: <34664632.post@talk.nabble.com> hello Brian i wuold like you to send me your script, i think it can help me to solve a big problem and help me to finish my final project. i hope it will be posible regards Eyla BForde wrote: > > Hello, > > I have been modifying a script which extracts all the protein sequences > from a genbank file and saves them in a multi-fasta file. > > I wish the fasta header to have both the locus_tag of the protein and the > product. However I cannot get the product tag to write to the fasta > header > > this is the relevant section of the script > > $s->display_id($f->has_tag('locus_tag') ? join(',',sort > $f->each_tag_value('locus_tag')) : > $f->has_tag('product') ? > join(',',$f->each_tag_value('product')): > $s->display_id); > > is "product" not an actual tag > > regards > > Brian > > > > -- > Brian Forde > Microbiology Dept. > Bioscience Institute. Room 4.11 > University College Cork > Cork > Ireland > tel:+353 21 4901306 > email: b.m.forde at umail.ucc.ie > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- View this message in context: http://old.nabble.com/Extracting-sequences-from-Genbank-files-tp33901023p34664632.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From bosborne11 at verizon.net Tue Nov 20 18:50:00 2012 From: bosborne11 at verizon.net (Brian Osborne) Date: Tue, 20 Nov 2012 18:50:00 -0500 Subject: [Bioperl-l] get_Stream_by_query() appears to return empty stream In-Reply-To: References: Message-ID: <5F077DEA-DEBD-42BC-87E7-327697764CFE@verizon.net> Felix, I took a look at the Bio::DB::Query::GenBank documentation, it says this: If you provide an array reference of IDs in -ids, the query will be ignored and the list of IDs will be used when the query is passed to a Bio::DB::GenBank object's get_Stream_by_query() method. Bio::DB::Genbank queries "nucleotide", by default. You have GeneIDs. I see that you're setting "-id" to "gene" but note that you're passing that query to a plain Bio::DB::GenBank object. Not sure what the expected behavior is here. I would try using the NCBI Eutilities for that second query, rather than Bio::DB::Query::GenBank (http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook). Brian O. On Nov 1, 2012, at 8:01 PM, Felix Horns wrote: > Hello everyone. > > I am having trouble using the get_Stream_by_query() function > in Bio::DB::GenBank. It seems to return an empty stream, such that > $stream->next_seq never returns anything. > > However, $query->count is returning the expected value (139). Also, > get_Stream_by_query() seems to be querying the database, as when I pass it > an array of GeneIDs that have not been properly formatted, i.e. > GeneID:7816864, instead of simply 7816864, it returns warnings and errors: > "MSG: Warning(s) from GenBank: GeneID 7817709...; MSG: > Error from Genbank: No items found.". > > I have included my full code below. I have also included the output from > the code below that. The code is intended to find genes located within a > genomic region. I will later find the protein domains and pathways that > those genes are involved in. > > Any help would be greatly appreciated. I realize that this is probably a > very simple question, but I am relatively new to BioPerl and I've spent the > better part of the day trying to figure out such issues, so I would be very > thankful for help. > > Felix > > > #!/usr/bin/perl > use strict; > use Bio::SeqIO; > use Bio::DB::EntrezGene; > use Bio::DB::GenBank; > > # Load reference sequence > # Load from local .gb file > # Note that .gb file does not include sequences > # my $gbfile = "NC_012660.1.gb"; > # my $seqio = Bio::SeqIO->new(-file => $gbfile); > # my $ref_seq = $seqio->next_seq; > > # To access reference sequence programatically, uncomment this code > my $gb = new Bio::DB::GenBank; > my $ref_seq = $gb->get_Seq_by_acc("NC_012660.1"); > > # Specify coordinates of gap > my $gap_start = 2050506; > my $gap_end = 2190530; > > my $gene_count = 0; > my @features; > my @starts; > my @ends; > my @db_xrefs; > > my @products; > my @protein_ids; > > # Get gene features in gap > for my $feat ($ref_seq->get_SeqFeatures) { > my $start=$feat->location->start; > my $end=$feat->location->end; > > if (($feat->primary_tag eq 'gene') & > ($gap_start < $start) & ($start < $gap_end) & > ($gap_start < $end) & ($end < $gap_end)) { > > $gene_count += 1; > > # Get GeneID reference > my $db_xref = ($feat->get_tag_values('db_xref'))[0]; > $db_xref =~ s/GeneID://; # Trim "GeneID:" from start of $db_xref > > push @features, $feat; > push @starts, $start; > push @ends, $end; > push @db_xrefs, $db_xref; > } > } > > # Get data about gene features from GeneID reference > my $query = Bio::DB::Query::GenBank->new(-db => 'gene', > -ids => [@db_xrefs]); > my $stream = $gb->get_Stream_by_query($query); > > while (my $seq = $stream->next_seq) { > for my $feat ($seq->all_SeqFeatures) { > print "primary tag: ", $feat->primary_tag, "\n"; > for my $tag ($feat->get_all_tags) { > print " tag: ", $tag, "\n"; > for my $value ($feat->get_tag_values($tag)) { > print " value: ", $value, "\n"; > } > } > } > } > > print $query->count,"\n"; > print $gene_count, "\n"; > > > OUTPUT >> perl analyze_gap.pl > 139 > 139 > > Note that no "primary tag; tag; value" items are printed. Furthermore, > when I put a print line immediately after the (while (my $seq = > $stream->next_seq)) statement, it was never called, seemingly indicating > that the stream is empty. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bosborne11 at verizon.net Tue Nov 20 18:52:00 2012 From: bosborne11 at verizon.net (Brian Osborne) Date: Tue, 20 Nov 2012 18:52:00 -0500 Subject: [Bioperl-l] Ask for help about Bioperl In-Reply-To: References: Message-ID: <3B472C83-7F6C-41E2-A629-E6A2BDC6B075@verizon.net> ????, You're going to have show us your code, we can't help you just by seeing the error messages. Show us the input file as well, or the beginning of it. Brian O. On Nov 6, 2012, at 2:38 AM, ???? wrote: > hi, > when I use bioperl ,it report errors like this :--------------------- > WARNING --------------------- > MSG: sequence '1' doesn't validate, mismatch is )81,95,0(,:: > --------------------------------------------------- > Error providing evidence type: GeneModel > The error was: > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: Attempting to set the sequence '1' to > [)81hvf95x0(DSTD=qeSrytkiyP::oiV] which does not look healthy > STACK: Error::throw > STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:486 > STACK: Bio::PrimarySeq::seq /usr/share/perl5/Bio/PrimarySeq.pm:285 > STACK: Bio::PrimarySeq::new /usr/share/perl5/Bio/PrimarySeq.pm:239 > STACK: Bio::PrimarySeqI::revcom /usr/share/perl5/Bio/PrimarySeqI.pm:383 > > > but,I am sure that the input file only cotain [ATGCN],I also try to use > another sequences ,but the errors are the same.my bioperl is Bioperl-live > 1.006902; > > -- > ???? > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at drycafe.net Tue Nov 20 21:24:50 2012 From: hlapp at drycafe.net (Hilmar Lapp) Date: Tue, 20 Nov 2012 21:24:50 -0500 Subject: [Bioperl-l] handle with file in perl In-Reply-To: <34626730.post@talk.nabble.com> References: <34626730.post@talk.nabble.com> Message-ID: <1DE09B34-5124-478C-8925-0045EC119CFC@drycafe.net> This sounds like a homework assignment. We're not here to do your homework or assignments for you. You can post if you run into a specific problem when solving your assignment with Bioperl, and we'll help with that. -hilmar Sent with a tap. On Oct 31, 2012, at 7:45 PM, eyla4ever wrote: > > hi > > i want to write a function that get as parameters : file_name, hsp , hit. > and i want her to print all the blast Field that i need to this file. > > i do it because i have 2 files with the same Fields. > > > 10X > -- > View this message in context: http://old.nabble.com/handle-with-file-in-perl-tp34626730p34626730.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From mahakadry at aucegypt.edu Fri Nov 23 20:33:59 2012 From: mahakadry at aucegypt.edu (maha ahmed) Date: Sat, 24 Nov 2012 03:33:59 +0200 Subject: [Bioperl-l] retrieving a subset of files from a folder Message-ID: Dear Bioperl list, I have a folder that has 60,000 files (one file for each phylogenetic tree) However I only need to work with a subset of 1,000 files from that folder (the files are not numbered in order so I cant use the i++ loop in my bioperl script) Is there a way to write a script that only moves files with the names given in a list in a text file i.e. I have a file that has the names of the files I want to copy fro m the folder and I want to write script that does this Thank you so much From kellert at ohsu.edu Sat Nov 24 13:08:11 2012 From: kellert at ohsu.edu (Tom Keller) Date: Sat, 24 Nov 2012 10:08:11 -0800 Subject: [Bioperl-l] use cookbook to work with a directory of files In-Reply-To: References: Message-ID: A search with the phrase "perl cookbook filenames from directory" should help you find what you need. On Nov 24, 2012, at 9:00 AM, bioperl-l-request at lists.open-bio.org wrote: > Send Bioperl-l mailing list submissions to > bioperl-l at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/bioperl-l > or, via email, send a message with subject or body 'help' to > bioperl-l-request at lists.open-bio.org > > You can reach the person managing the list at > bioperl-l-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Bioperl-l digest..." > > > Today's Topics: > > 1. retrieving a subset of files from a folder (maha ahmed) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sat, 24 Nov 2012 03:33:59 +0200 > From: maha ahmed > Subject: [Bioperl-l] retrieving a subset of files from a folder > To: Bioperl-l at lists.open-bio.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > Dear Bioperl list, > I have a folder that has 60,000 files (one file for each phylogenetic tree) > However I only need to work with a subset of 1,000 files from that folder > (the files are not numbered in order so I cant use the i++ loop in my > bioperl script) > Is there a way to write a script that only moves files with the names given > in a list in a text file > i.e. I have a file that has the names of the files I want to copy fro m the > folder and I want to write script that does this > Thank you so much > > > ------------------------------ > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > End of Bioperl-l Digest, Vol 115, Issue 8 > ***************************************** From minou.nowrousian at rub.de Sat Nov 24 13:24:02 2012 From: minou.nowrousian at rub.de (Minou Nowrousian) Date: 24 Nov 2012 19:24:02 +0100 Subject: [Bioperl-l] retrieving a subset of files from a folder Message-ID: <000001cdca70$e1a97720$a4fc6560$@rub.de> >Dear Bioperl list, >I have a folder that has 60,000 files (one file for each phylogenetic tree) However I only need to work with a subset of 1,000 files from that folder >(the files are not numbered in order so I cant use the i++ loop in my bioperl script) Is there a way to write a script that only moves files with the >names given in a list in a text file i.e. I have a file that has the names of the files I want to copy fro m the folder and I want to write script that does >this Thank you so much I don't know if there is a BioPerl solution, but you could use the File::Copy module (available from CPAN): use File::Copy; copy("path_to_file_you_want_to_copy","path_to_target_file") or die "Copy failed: $!"; Regards, Minou From mahakadry at aucegypt.edu Sat Nov 24 14:04:09 2012 From: mahakadry at aucegypt.edu (maha ahmed) Date: Sat, 24 Nov 2012 21:04:09 +0200 Subject: [Bioperl-l] retrieving a subset of files from a folder In-Reply-To: <000001cdca70$e1a97720$a4fc6560$@rub.de> References: <000001cdca70$e1a97720$a4fc6560$@rub.de> Message-ID: Thanks everyone , I actually found a one line command that I am going to try: xargs -a file_list.txt mv -t /path/to/des thanks for your help I will read have a look at the readings you suggested thank you On Sat, Nov 24, 2012 at 8:24 PM, Minou Nowrousian wrote: > > >Dear Bioperl list, > >I have a folder that has 60,000 files (one file for each phylogenetic > tree) > However I only need to work with a subset of 1,000 files from that folder > >(the files are not numbered in order so I cant use the i++ loop in my > bioperl script) Is there a way to write a script that only moves files with > the >names given in a list in a text file i.e. I have a file that has the > names of the files I want to copy fro m the folder and I want to write > script that does >this Thank you so much > > I don't know if there is a BioPerl solution, but you could use the > File::Copy module (available from CPAN): > > use File::Copy; > copy("path_to_file_you_want_to_copy","path_to_target_file") or die "Copy > failed: $!"; > > Regards, > Minou > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From maj at fortinbras.us Tue Nov 27 08:49:46 2012 From: maj at fortinbras.us (Mark A. Jensen) Date: Tue, 27 Nov 2012 13:49:46 +0000 Subject: [Bioperl-l] Neo4j : applying user defined validation and constraints Message-ID: Hi Folks, Since there was some enthusiasm about REST::Neo4p, my interface to Neo4j, I thought I would let you know about https://metacpan.org/module/REST::Neo4p::Constrain This is a framework that lets you apply constraints on node and relationship properties, relationships, and relationship types. You can specify your constraints, and have REST::Neo4p throw exceptions when the constraints aren't met, or you can do validation on existing database items. The pod has a full explanation and examples aplenty. Please have a look and send bugs my way via RT. Cheers all, MAJ From francescomusacchia at gmail.com Wed Nov 28 05:27:16 2012 From: francescomusacchia at gmail.com (Francesco Musacchia) Date: Wed, 28 Nov 2012 02:27:16 -0800 (PST) Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access Message-ID: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com> Hi all, I have a big problem with using GFF3 database with BioPerl. This is not a question about what is the way to write some bioperl code. I'm experiencing that when I have to do a lot of accessess on a GFF database (with Bio:DB::SeqFeature::Store) the slowness increase until my script can stay running for more than a day. How can I solve it? Or it cannot be done? Thanks a lot! From florent.angly at gmail.com Thu Nov 1 01:49:13 2012 From: florent.angly at gmail.com (Florent Angly) Date: Thu, 01 Nov 2012 15:49:13 +1000 Subject: [Bioperl-l] Bio::PrimarySeq speedup Message-ID: <50920D59.4010307@gmail.com> Hi all, I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was associated with passing full genomes between functions. For example, when doing a call to length(), length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So, every call to length is very expensive for long sequences. And there is a lot of code that calls length(), for error checking. I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g. Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to improve the documentation. In addition, to address the speed issue, I did some changes to Bio::PrimarySeq and Bio::PrimarySeqI : ? The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no extra call to seq() (which would incur the cost of copying the entire sequence between functions). ? The length is saved as an object attribute. So, calling length() is very cheap since it only needs to retrieve the stored value for the length. ? There is a constructor called -direct, which skips sequence validation. However, it was only active in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented purpose, I made it -direct work when a sequence is set through -seq as well. ? This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the new object can be constructed with the -direct constructor, to save some time. ? Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and these calls were easily replaced by simple calls to substr() to save some time. A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I anticipate that the gains would be bigger for even longer sequences. All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of Bioperl, so please, have a look at it and let me know if there are any glaring errors. Thanks, Florent From shalabh.sharma7 at gmail.com Thu Nov 1 15:36:35 2012 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Thu, 1 Nov 2012 15:36:35 -0400 Subject: [Bioperl-l] blast question Message-ID: Hi All, First of all i am really very sorry for posting blast question in this forum, I am not sure if this is the right place. I will really appreciate if anyone can guide me to the right direction. I am using blastall to get a top hit from a database so i am using -v 1 -b 1 (i hope this is right). But the strange part is that i am getting wrong results. for example: if i use -v 1 -b 1 then for one of the hit i am getting this: Sequences producing significant alignments: (bits) Value fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 4e-04 If i use -v 3 -b 3 then i am getting this for the same query: Sequences producing significant alignments: (bits) Value fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 e-167 fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 9e-07 fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 1.0 As you can see the top hit in the first case is totally wrong. I would really appreciate if someone can help me out, or direct to in the right direction. Thanks Shalabh -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From cjfields at illinois.edu Thu Nov 1 17:41:43 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 1 Nov 2012 21:41:43 +0000 Subject: [Bioperl-l] blast question In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> That's a scary error, but the best place to submit this would be the BLAST help list at NCBI (cc'd) chris On Nov 1, 2012, at 2:36 PM, shalabh sharma wrote: > Hi All, > First of all i am really very sorry for posting blast question in > this forum, I am not sure if this is the right place. > I will really appreciate if anyone can guide me to the right direction. > > I am using blastall to get a top hit from a database so i am using -v 1 -b > 1 (i hope this is right). > But the strange part is that i am getting wrong results. > > for example: if i use -v 1 -b 1 then for one of the hit i am getting this: > > > Sequences producing significant alignments: (bits) > Value > > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > 4e-04 > > > If i use -v 3 -b 3 then i am getting this for the same query: > > Sequences producing significant alignments: (bits) > Value > > fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 > e-167 > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > 9e-07 > fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 > 1.0 > > As you can see the top hit in the first case is totally wrong. > > I would really appreciate if someone can help me out, or direct to in the > right direction. > > Thanks > Shalabh > > > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences > University of Georgia > Athens, GA 30602-3636 > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From shalabh.sharma7 at gmail.com Fri Nov 2 10:50:17 2012 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Fri, 2 Nov 2012 10:50:17 -0400 Subject: [Bioperl-l] blast question In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> Message-ID: I know, i am really worried about my past analysis now. Thanks a lot for cc'ing this mail Chris. -Shalabh On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J wrote: > That's a scary error, but the best place to submit this would be the BLAST > help list at NCBI (cc'd) > > chris > > On Nov 1, 2012, at 2:36 PM, shalabh sharma > wrote: > > > Hi All, > > First of all i am really very sorry for posting blast question > in > > this forum, I am not sure if this is the right place. > > I will really appreciate if anyone can guide me to the right direction. > > > > I am using blastall to get a top hit from a database so i am using -v 1 > -b > > 1 (i hope this is right). > > But the strange part is that i am getting wrong results. > > > > for example: if i use -v 1 -b 1 then for one of the hit i am getting > this: > > > > > > Sequences producing significant alignments: (bits) > > Value > > > > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > > 4e-04 > > > > > > If i use -v 3 -b 3 then i am getting this for the same query: > > > > Sequences producing significant alignments: (bits) > > Value > > > > fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 > > e-167 > > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > > 9e-07 > > fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 > > 1.0 > > > > As you can see the top hit in the first case is totally wrong. > > > > I would really appreciate if someone can help me out, or direct to in the > > right direction. > > > > Thanks > > Shalabh > > > > > > > > -- > > Shalabh Sharma > > Scientific Computing Professional Associate (Bioinformatics Specialist) > > Department of Marine Sciences > > University of Georgia > > Athens, GA 30602-3636 > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From Scott.Markel at accelrys.com Fri Nov 2 20:13:59 2012 From: Scott.Markel at accelrys.com (Scott Markel) Date: Fri, 2 Nov 2012 17:13:59 -0700 Subject: [Bioperl-l] Structure::IO::pdb update needed to comply with PDB file format specification change Message-ID: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net> In tracking down a regression test failure we discovered that the Structure::IO::pdb module is out of date relative to the PDB file format specification (http://www.wwpdb.org/docs.html). PDB now writes out to column 79, while pdb.pm is still using the old line length of 71. Our regression failure was caused by a reformatting of 1CRN; journal titles started getting truncated. Some of the Perl lines are really simple, e.g., $keywds = $self->_read_PDB_singlecontline("KEYWDS","11-70",\$buffer); with others being just a little more detailed, e.g., my ($rec, $subr, $cont, $rol) = unpack "A6 x6 A4 A2 x1 A51", $_; It doesn't look like pdb.pm has changed in about 1.5 years. Is there a current module owner? Or someone else working on this? If not, we're willing to either compile a list of needed changes (walking through the PDB file format specification and comparing the corresponding column indices in pdb.pm) or provide a new version of the entire file. Please let us know which is preferred. Scott Scott Markel, Ph.D. Principal Bioinformatics Architect? email:? smarkel at accelrys.com Accelrys (Pipeline Pilot R&D)?????? mobile: +1 858 205 3653 10188 Telesis Court, Suite 100????? voice:? +1 858 799 5603 San Diego, CA 92121???????????????? fax:??? +1 858 799 5222 USA???????????????????????????????? web:??? http://www.accelrys.com http://www.linkedin.com/in/smarkel Secretary, Board of Directors: ??? International Society for Computational Biology Chair: ISCB Publications and Communications Committee Associate Editor: PLoS Computational Biology Editorial Board: Briefings in Bioinformatics From cjfields at illinois.edu Fri Nov 2 22:08:52 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sat, 3 Nov 2012 02:08:52 +0000 Subject: [Bioperl-l] Structure::IO::pdb update needed to comply with PDB file format specification change In-Reply-To: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net> References: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC817F8@CHIMBX5.ad.uillinois.edu> On Nov 2, 2012, at 7:13 PM, Scott Markel wrote: > In tracking down a regression test failure we discovered that the Structure::IO::pdb module is out of date relative to the PDB file format specification (http://www.wwpdb.org/docs.html). PDB now writes out to column 79, while pdb.pm is still using the old line length of 71. Our regression failure was caused by a reformatting of 1CRN; journal titles started getting truncated. > > Some of the Perl lines are really simple, e.g., > > $keywds = $self->_read_PDB_singlecontline("KEYWDS","11-70",\$buffer); > > with others being just a little more detailed, e.g., > > my ($rec, $subr, $cont, $rol) = unpack "A6 x6 A4 A2 x1 A51", $_; > > It doesn't look like pdb.pm has changed in about 1.5 years. Is there a current module owner? Or someone else working on this? No one has really taken ownership, so as far as I'm concerned it's open. Any objections? > If not, we're willing to either compile a list of needed changes (walking through the PDB file format specification and comparing the corresponding column indices in pdb.pm) or provide a new version of the entire file. Please let us know which is preferred. A new version of the file is fine if you have someone who can work on it. We would also like to change relevant tests and documentation if there is time. > Scott > > Scott Markel, Ph.D. > Principal Bioinformatics Architect email: smarkel at accelrys.com > Accelrys (Pipeline Pilot R&D) mobile: +1 858 205 3653 > 10188 Telesis Court, Suite 100 voice: +1 858 799 5603 > San Diego, CA 92121 fax: +1 858 799 5222 > USA web: http://www.accelrys.com > > http://www.linkedin.com/in/smarkel > Secretary, Board of Directors: > International Society for Computational Biology > Chair: ISCB Publications and Communications Committee > Associate Editor: PLoS Computational Biology > Editorial Board: Briefings in Bioinformatics Thanks Scott! chris From Russell.Smithies at agresearch.co.nz Sun Nov 4 16:00:37 2012 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Mon, 5 Nov 2012 10:00:37 +1300 Subject: [Bioperl-l] blast question In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> What version of blast are you using? There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+ --Russell -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma Sent: Saturday, 3 November 2012 3:50 a.m. To: Fields, Christopher J Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov Subject: Re: [Bioperl-l] blast question I know, i am really worried about my past analysis now. Thanks a lot for cc'ing this mail Chris. -Shalabh On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J wrote: > That's a scary error, but the best place to submit this would be the > BLAST help list at NCBI (cc'd) > > chris > > On Nov 1, 2012, at 2:36 PM, shalabh sharma > wrote: > > > Hi All, > > First of all i am really very sorry for posting blast > > question > in > > this forum, I am not sure if this is the right place. > > I will really appreciate if anyone can guide me to the right direction. > > > > I am using blastall to get a top hit from a database so i am using > > -v 1 > -b > > 1 (i hope this is right). > > But the strange part is that i am getting wrong results. > > > > for example: if i use -v 1 -b 1 then for one of the hit i am getting > this: > > > > > > Sequences producing significant alignments: (bits) > > Value > > > > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > > 4e-04 > > > > > > If i use -v 3 -b 3 then i am getting this for the same query: > > > > Sequences producing significant alignments: (bits) > > Value > > > > fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 > > e-167 > > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > > 9e-07 > > fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 > > 1.0 > > > > As you can see the top hit in the first case is totally wrong. > > > > I would really appreciate if someone can help me out, or direct to > > in the right direction. > > > > Thanks > > Shalabh > > > > > > > > -- > > Shalabh Sharma > > Scientific Computing Professional Associate (Bioinformatics > > Specialist) Department of Marine Sciences University of Georgia > > Athens, GA 30602-3636 > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From cjfields at illinois.edu Sun Nov 4 17:13:37 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sun, 4 Nov 2012 22:13:37 +0000 Subject: [Bioperl-l] blast question In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu> That in fact is the recommendation (migrate to BLAST+). chris On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" wrote: > What version of blast are you using? > There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+ > > > --Russell > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma > Sent: Saturday, 3 November 2012 3:50 a.m. > To: Fields, Christopher J > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov > Subject: Re: [Bioperl-l] blast question > > I know, i am really worried about my past analysis now. > Thanks a lot for cc'ing this mail Chris. > > -Shalabh > > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J > wrote: > >> That's a scary error, but the best place to submit this would be the >> BLAST help list at NCBI (cc'd) >> >> chris >> >> On Nov 1, 2012, at 2:36 PM, shalabh sharma >> wrote: >> >>> Hi All, >>> First of all i am really very sorry for posting blast >>> question >> in >>> this forum, I am not sure if this is the right place. >>> I will really appreciate if anyone can guide me to the right direction. >>> >>> I am using blastall to get a top hit from a database so i am using >>> -v 1 >> -b >>> 1 (i hope this is right). >>> But the strange part is that i am getting wrong results. >>> >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting >> this: >>> >>> >>> Sequences producing significant alignments: (bits) >>> Value >>> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 >>> 4e-04 >>> >>> >>> If i use -v 3 -b 3 then i am getting this for the same query: >>> >>> Sequences producing significant alignments: (bits) >>> Value >>> >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 >>> e-167 >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 >>> 9e-07 >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 >>> 1.0 >>> >>> As you can see the top hit in the first case is totally wrong. >>> >>> I would really appreciate if someone can help me out, or direct to >>> in the right direction. >>> >>> Thanks >>> Shalabh >>> >>> >>> >>> -- >>> Shalabh Sharma >>> Scientific Computing Professional Associate (Bioinformatics >>> Specialist) Department of Marine Sciences University of Georgia >>> Athens, GA 30602-3636 >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From florent.angly at gmail.com Sun Nov 4 19:46:44 2012 From: florent.angly at gmail.com (Florent Angly) Date: Mon, 05 Nov 2012 10:46:44 +1000 Subject: [Bioperl-l] Bio::PrimarySeq speedup In-Reply-To: <50920D59.4010307@gmail.com> References: <50920D59.4010307@gmail.com> Message-ID: <50970C74.7070605@gmail.com> I am planning on merging the branch with master this week. Best, Florent On 01/11/12 15:49, Florent Angly wrote: > Hi all, > > I was working with Ben Woodcroft on identifying ways to speed up > Grinder, which relies heavily on Bioperl. Ben did some profiling with > NYTProf and we realized that a lot of computation time was spent in > Bio::PrimarySeq, doing calls to subseq() and length(). The sequences > we used for the profiling were microbial genomes, i.e. several Mbp > long sequences, which is quite long. A lot of the performance cost was > associated with passing full genomes between functions. For example, > when doing a call to length(), length() requests the full sequence > from seq(), which returns it back to length() (it makes a copy!). So, > every call to length is very expensive for long sequences. And there > is a lot of code that calls length(), for error checking. > > I know that there are a few Bioperl modules that are more adapted to > handling very long sequences, e.g. Bio::DB::Fasta or > Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at > Bio::PrimarySeq with Ben and I released this commit: > https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. > But in fact, there were more things that I wanted to try to improve, > which led me to start this new branch: > https://github.com/bioperl/bioperl-live/tree/seqlength > > I wrote quite a few tests for functionalities that were not previously > covered by tests, and tried to improve the documentation. In addition, > to address the speed issue, I did some changes to Bio::PrimarySeq and > Bio::PrimarySeqI : > ? The length of a sequence is now computed as soon as the sequence is > set, not after. This way, there is no extra call to seq() (which would > incur the cost of copying the entire sequence between functions). > ? The length is saved as an object attribute. So, calling length() is > very cheap since it only needs to retrieve the stored value for the > length. > ? There is a constructor called -direct, which skips sequence > validation. However, it was only active in conjunction with the > -ref_to_seq constructor. To make -direct conform better to its > documented purpose, I made it -direct work when a sequence is set > through -seq as well. > ? This brings us to trunc(), revcom() and other methods of > Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq > object from an existing (already validated!) Bio::PrimarySeq object, > the new object can be constructed with the -direct constructor, to > save some time. > ? Finally, I noticed that subseq() used calls to eval() to do its > work. eval() is notoriously slow and these calls were easily replaced > by simple calls to substr() to save some time. > > A real-world test I performed with Grinder took 3m28s before the > changes (and ~1 min is spent doing something unrelated). After the > changes, the same test took only 2min28s. So, it's quite a significant > improvement and on more specific test cases, performance gains can > obviously be much bigger. Also, I anticipate that the gains would be > bigger for even longer sequences. > > All the changes I made are meant to be backward compatible and all the > tests in the Bioperl test suite passed. So, there _should_ not be any > issues. However, I know that Bio::PrimarySeq is a central module of > Bioperl, so please, have a look at it and let me know if there are any > glaring errors. > > Thanks, > > Florent > From cjfields at illinois.edu Sun Nov 4 21:43:28 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 5 Nov 2012 02:43:28 +0000 Subject: [Bioperl-l] Bio::PrimarySeq speedup In-Reply-To: <50970C74.7070605@gmail.com> References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu> Florent, Ran tests on it, they pass but I am seeing this (if these are expected, you can catch the warnings using Test::Warn): [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr t/Seq/PrimarySeq.t t/Seq/PrimarySeq.t .. 1/167 --------------------- WARNING --------------------- MSG: Got a sequence without letters. Could not guess alphabet --------------------------------------------------- --------------------- WARNING --------------------- MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is ! --------------------------------------------------- --------------------- WARNING --------------------- MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $ --------------------------------------------------- --------------------- WARNING --------------------- MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is & --------------------------------------------------- --------------------- WARNING --------------------- MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is \,$,+ --------------------------------------------------- --------------------- WARNING --------------------- MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/ --------------------------------------------------- t/Seq/PrimarySeq.t .. ok All tests successful. Files=1, Tests=167, 0 wallclock secs ( 0.03 usr 0.01 sys + 0.18 cusr 0.01 csys = 0.23 CPU) Result: PASS chris On Nov 4, 2012, at 6:46 PM, Florent Angly wrote: > I am planning on merging the branch with master this week. > Best, > Florent > > > On 01/11/12 15:49, Florent Angly wrote: >> Hi all, >> >> I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was associated with passing full genomes between functions. For example, when doing a call to length(), length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So, every call to length is very expensive for long sequences. And there is a lot of code that calls length(), for error checking. >> >> I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g. Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength >> >> I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to improve the documentation. In addition, to address the speed issue, I did some changes to Bio::PrimarySeq and Bio::PrimarySeqI : >> ? The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no extra call to seq() (which would incur the cost of copying the entire sequence between functions). >> ? The length is saved as an object attribute. So, calling length() is very cheap since it only needs to retrieve the stored value for the length. >> ? There is a constructor called -direct, which skips sequence validation. However, it was only active in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented purpose, I made it -direct work when a sequence is set through -seq as well. >> ? This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the new object can be constructed with the -direct constructor, to save some time. >> ? Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and these calls were easily replaced by simple calls to substr() to save some time. >> >> A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I anticipate that the gains would be bigger for even longer sequences. >> >> All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of Bioperl, so please, have a look at it and let me know if there are any glaring errors. >> >> Thanks, >> >> Florent >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From shalabh.sharma7 at gmail.com Mon Nov 5 12:03:38 2012 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Mon, 5 Nov 2012 12:03:38 -0500 Subject: [Bioperl-l] blast question In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu> References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu> Message-ID: Hi All, thanks for all your responses. Currently i am using the old version of blastall 2.2.22. @Peter: I will update my blast and will see if the problem still exist. But i can't restrict my blast with e value because i work on environmental samples , i have to reduce the size of my blast files as i am only interested in the top hit and my data sets are really huge. Thanks Shalabh On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J wrote: > That in fact is the recommendation (migrate to BLAST+). > > chris > > On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" < > Russell.Smithies at agresearch.co.nz> wrote: > > > What version of blast are you using? > > There have been quite a few bug fixes and I suspect any responses from > NCBI will suggest upgrading to the current version of blast+ > > > > > > --Russell > > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto: > bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma > > Sent: Saturday, 3 November 2012 3:50 a.m. > > To: Fields, Christopher J > > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov > > Subject: Re: [Bioperl-l] blast question > > > > I know, i am really worried about my past analysis now. > > Thanks a lot for cc'ing this mail Chris. > > > > -Shalabh > > > > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J < > cjfields at illinois.edu > >> wrote: > > > >> That's a scary error, but the best place to submit this would be the > >> BLAST help list at NCBI (cc'd) > >> > >> chris > >> > >> On Nov 1, 2012, at 2:36 PM, shalabh sharma > >> wrote: > >> > >>> Hi All, > >>> First of all i am really very sorry for posting blast > >>> question > >> in > >>> this forum, I am not sure if this is the right place. > >>> I will really appreciate if anyone can guide me to the right direction. > >>> > >>> I am using blastall to get a top hit from a database so i am using > >>> -v 1 > >> -b > >>> 1 (i hope this is right). > >>> But the strange part is that i am getting wrong results. > >>> > >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting > >> this: > >>> > >>> > >>> Sequences producing significant alignments: (bits) > >>> Value > >>> > >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA > 38 > >>> 4e-04 > >>> > >>> > >>> If i use -v 3 -b 3 then i am getting this for the same query: > >>> > >>> Sequences producing significant alignments: (bits) > >>> Value > >>> > >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... > 570 > >>> e-167 > >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA > 38 > >>> 9e-07 > >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... > 18 > >>> 1.0 > >>> > >>> As you can see the top hit in the first case is totally wrong. > >>> > >>> I would really appreciate if someone can help me out, or direct to > >>> in the right direction. > >>> > >>> Thanks > >>> Shalabh > >>> > >>> > >>> > >>> -- > >>> Shalabh Sharma > >>> Scientific Computing Professional Associate (Bioinformatics > >>> Specialist) Department of Marine Sciences University of Georgia > >>> Athens, GA 30602-3636 > >>> _______________________________________________ > >>> Bioperl-l mailing list > >>> Bioperl-l at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > >> > > > > > > -- > > Shalabh Sharma > > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences University of Georgia Athens, GA 30602-3636 > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > ======================================================================= > > Attention: The information contained in this message and/or attachments > > from AgResearch Limited is intended only for the persons or entities > > to which it is addressed and may contain confidential and/or privileged > > material. Any review, retransmission, dissemination or other use of, or > > taking of any action in reliance upon, this information by persons or > > entities other than the intended recipients is prohibited by AgResearch > > Limited. If you have received this message in error, please notify the > > sender immediately. > > ======================================================================= > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From Russell.Smithies at agresearch.co.nz Mon Nov 5 16:04:07 2012 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Tue, 6 Nov 2012 10:04:07 +1300 Subject: [Bioperl-l] blast question In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu> Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz> If you're using an older version of blast there was a bug where not all results were returned - I think the limit was 10,000 hits? Not usually a problem running basic queries but a big problem for environmental or metagenomic samples, or when aligning short reads. --Russell From: shalabh sharma [mailto:shalabh.sharma7 at gmail.com] Sent: Tuesday, 6 November 2012 6:04 a.m. To: Fields, Christopher J Cc: Smithies, Russell; bioperl-l Subject: Re: [Bioperl-l] blast question Hi All, thanks for all your responses. Currently i am using the old version of blastall 2.2.22. @Peter: I will update my blast and will see if the problem still exist. But i can't restrict my blast with e value because i work on environmental samples , i have to reduce the size of my blast files as i am only interested in the top hit and my data sets are really huge. Thanks Shalabh On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J > wrote: That in fact is the recommendation (migrate to BLAST+). chris On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" > wrote: > What version of blast are you using? > There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+ > > > --Russell > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma > Sent: Saturday, 3 November 2012 3:50 a.m. > To: Fields, Christopher J > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov > Subject: Re: [Bioperl-l] blast question > > I know, i am really worried about my past analysis now. > Thanks a lot for cc'ing this mail Chris. > > -Shalabh > > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J >> wrote: > >> That's a scary error, but the best place to submit this would be the >> BLAST help list at NCBI (cc'd) >> >> chris >> >> On Nov 1, 2012, at 2:36 PM, shalabh sharma > >> wrote: >> >>> Hi All, >>> First of all i am really very sorry for posting blast >>> question >> in >>> this forum, I am not sure if this is the right place. >>> I will really appreciate if anyone can guide me to the right direction. >>> >>> I am using blastall to get a top hit from a database so i am using >>> -v 1 >> -b >>> 1 (i hope this is right). >>> But the strange part is that i am getting wrong results. >>> >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting >> this: >>> >>> >>> Sequences producing significant alignments: (bits) >>> Value >>> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 >>> 4e-04 >>> >>> >>> If i use -v 3 -b 3 then i am getting this for the same query: >>> >>> Sequences producing significant alignments: (bits) >>> Value >>> >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 >>> e-167 >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 >>> 9e-07 >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 >>> 1.0 >>> >>> As you can see the top hit in the first case is totally wrong. >>> >>> I would really appreciate if someone can help me out, or direct to >>> in the right direction. >>> >>> Thanks >>> Shalabh >>> >>> >>> >>> -- >>> Shalabh Sharma >>> Scientific Computing Professional Associate (Bioinformatics >>> Specialist) Department of Marine Sciences University of Georgia >>> Athens, GA 30602-3636 >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From shalabh.sharma7 at gmail.com Mon Nov 5 16:09:03 2012 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Mon, 5 Nov 2012 16:09:03 -0500 Subject: [Bioperl-l] blast question In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz> References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz> Message-ID: Hi All, Thanks for all the suggestion. The problem is fixed by using latest blast+ . Thanks Shalabh On Mon, Nov 5, 2012 at 4:04 PM, Smithies, Russell < Russell.Smithies at agresearch.co.nz> wrote: > If you?re using an older version of blast there was a bug where not all > results were returned ? I think the limit was 10,000 hits?**** > > Not usually a problem running basic queries but a big problem for > environmental or metagenomic samples, or when aligning short reads.**** > > ** ** > > --Russell**** > > ** ** > > *From:* shalabh sharma [mailto:shalabh.sharma7 at gmail.com] > *Sent:* Tuesday, 6 November 2012 6:04 a.m. > *To:* Fields, Christopher J > *Cc:* Smithies, Russell; bioperl-l > > *Subject:* Re: [Bioperl-l] blast question**** > > ** ** > > Hi All,**** > > thanks for all your responses.**** > > ** ** > > Currently i am using the old version of blastall 2.2.22.**** > > ** ** > > @Peter: I will update my blast and will see if the problem still exist. > But i can't restrict my blast with e value because i work on environmental > samples , i have to reduce the size of my blast files as i am only > interested in the top hit and my data sets are really huge.**** > > ** ** > > Thanks**** > > Shalabh**** > > ** ** > > On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J < > cjfields at illinois.edu> wrote:**** > > That in fact is the recommendation (migrate to BLAST+). > > chris**** > > > On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" < > Russell.Smithies at agresearch.co.nz> wrote: > > > What version of blast are you using? > > There have been quite a few bug fixes and I suspect any responses from > NCBI will suggest upgrading to the current version of blast+ > > > > > > --Russell > > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto: > bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma > > Sent: Saturday, 3 November 2012 3:50 a.m. > > To: Fields, Christopher J > > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov > > Subject: Re: [Bioperl-l] blast question > > > > I know, i am really worried about my past analysis now. > > Thanks a lot for cc'ing this mail Chris. > > > > -Shalabh > > > > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J < > cjfields at illinois.edu > >> wrote: > > > >> That's a scary error, but the best place to submit this would be the > >> BLAST help list at NCBI (cc'd) > >> > >> chris > >> > >> On Nov 1, 2012, at 2:36 PM, shalabh sharma > >> wrote: > >> > >>> Hi All, > >>> First of all i am really very sorry for posting blast > >>> question > >> in > >>> this forum, I am not sure if this is the right place. > >>> I will really appreciate if anyone can guide me to the right direction. > >>> > >>> I am using blastall to get a top hit from a database so i am using > >>> -v 1 > >> -b > >>> 1 (i hope this is right). > >>> But the strange part is that i am getting wrong results. > >>> > >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting > >> this: > >>> > >>> > >>> Sequences producing significant alignments: (bits) > >>> Value > >>> > >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA > 38 > >>> 4e-04 > >>> > >>> > >>> If i use -v 3 -b 3 then i am getting this for the same query: > >>> > >>> Sequences producing significant alignments: (bits) > >>> Value > >>> > >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... > 570 > >>> e-167 > >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA > 38 > >>> 9e-07 > >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... > 18 > >>> 1.0 > >>> > >>> As you can see the top hit in the first case is totally wrong. > >>> > >>> I would really appreciate if someone can help me out, or direct to > >>> in the right direction. > >>> > >>> Thanks > >>> Shalabh > >>> > >>> > >>> > >>> -- > >>> Shalabh Sharma > >>> Scientific Computing Professional Associate (Bioinformatics > >>> Specialist) Department of Marine Sciences University of Georgia > >>> Athens, GA 30602-3636 > >>> _______________________________________________ > >>> Bioperl-l mailing list > >>> Bioperl-l at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > >> > > > > > > -- > > Shalabh Sharma > > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences University of Georgia Athens, GA 30602-3636 > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > ======================================================================= > > Attention: The information contained in this message and/or attachments > > from AgResearch Limited is intended only for the persons or entities > > to which it is addressed and may contain confidential and/or privileged > > material. Any review, retransmission, dissemination or other use of, or > > taking of any action in reliance upon, this information by persons or > > entities other than the intended recipients is prohibited by AgResearch > > Limited. If you have received this message in error, please notify the > > sender immediately. > > ======================================================================= > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l**** > > > > **** > > ** ** > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences > University of Georgia > Athens, GA 30602-3636**** > > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From florent.angly at gmail.com Tue Nov 6 06:06:56 2012 From: florent.angly at gmail.com (Florent Angly) Date: Tue, 06 Nov 2012 21:06:56 +1000 Subject: [Bioperl-l] Bio::PrimarySeq speedup In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu> References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com> <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu> Message-ID: <5098EF50.5040208@gmail.com> Yes, good idea, Chris. Actually, thinking about it, most of these warnings were redundant. So, I changed the behaviour of Bio::PrimarySeq::validate_seq() so that it issues exceptions if requested. Florent On 05/11/12 12:43, Fields, Christopher J wrote: > Florent, > > Ran tests on it, they pass but I am seeing this (if these are expected, you can catch the warnings using Test::Warn): > > [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr t/Seq/PrimarySeq.t > t/Seq/PrimarySeq.t .. 1/167 > --------------------- WARNING --------------------- > MSG: Got a sequence without letters. Could not guess alphabet > --------------------------------------------------- > > --------------------- WARNING --------------------- > MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is ! > --------------------------------------------------- > > --------------------- WARNING --------------------- > MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $ > --------------------------------------------------- > > --------------------- WARNING --------------------- > MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is & > --------------------------------------------------- > > --------------------- WARNING --------------------- > MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is \,$,+ > --------------------------------------------------- > > --------------------- WARNING --------------------- > MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/ > --------------------------------------------------- > t/Seq/PrimarySeq.t .. ok > All tests successful. > Files=1, Tests=167, 0 wallclock secs ( 0.03 usr 0.01 sys + 0.18 cusr 0.01 csys = 0.23 CPU) > Result: PASS > > chris > > On Nov 4, 2012, at 6:46 PM, Florent Angly > wrote: > >> I am planning on merging the branch with master this week. >> Best, >> Florent >> >> >> On 01/11/12 15:49, Florent Angly wrote: >>> Hi all, >>> >>> I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was associated with passing full genomes between functions. For example, when doing a call to length(), length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So, every call to length is very expensive for long sequences. And there is a lot of code that calls length(), for error checking. >>> >>> I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g. Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength >>> >>> I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to improve the documentation. In addition, to address the speed issue, I did some changes to Bio::PrimarySeq and Bio::PrimarySeqI : >>> ? The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no extra call to seq() (which would incur the cost of copying the entire sequence between functions). >>> ? The length is saved as an object attribute. So, calling length() is very cheap since it only needs to retrieve the stored value for the length. >>> ? There is a constructor called -direct, which skips sequence validation. However, it was only active in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented purpose, I made it -direct work when a sequence is set through -seq as well. >>> ? This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the new object can be constructed with the -direct constructor, to save some time. >>> ? Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and these calls were easily replaced by simple calls to substr() to save some time. >>> >>> A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I anticipate that the gains would be bigger for even longer sequences. >>> >>> All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of Bioperl, so please, have a look at it and let me know if there are any glaring errors. >>> >>> Thanks, >>> >>> Florent >>> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l From shlomif at shlomifish.org Tue Nov 6 07:27:00 2012 From: shlomif at shlomifish.org (Shlomi Fish) Date: Tue, 6 Nov 2012 14:27:00 +0200 Subject: [Bioperl-l] [Request] Please Help Add Some Information about Perl for Bio-Informatics to http://perl-begin.org/uses/bio-info/ In-Reply-To: <20121026192203.6d1e59c0@lap.shlomifish.org> References: <20121026192203.6d1e59c0@lap.shlomifish.org> Message-ID: <20121106142700.192f456e@lap.shlomifish.org> Hi, Can anyone help with that? Regards, Shlomi Fish On Fri, 26 Oct 2012 19:22:03 +0200 Shlomi Fish wrote: > Hi all, > > I am the maintainer of http://perl-begin.org/ , the Perl Beginners' Site. I > had this page there for a long time, but it's empty: > > http://perl-begin.org/uses/bio-info/ > > Can someone help me add some information there? A short XHTML page will be OK. > For reference, see the other pages in the section > ( http://perl-begin.org/uses/ ) such as: > > * http://perl-begin.org/uses/web/ > > * http://perl-begin.org/uses/sys-admin/ > > * http://perl-begin.org/uses/qa/ > > Note that you agree that the content will be licensed under the Creative > Commons Attribution 3.0 Unported License (or higher versions) and so you > should make sure it is original. > > I shall be obliged for any help. > > Regards, > > Shlomi Fish > -- ----------------------------------------------------------------- Shlomi Fish http://www.shlomifish.org/ Perl Humour - http://perl-begin.org/humour/ A wiseman can learn from a fool much more than a fool can ever learn from a wiseman. ? http://en.wikiquote.org/wiki/Cato_the_Elder Please reply to list if it's a mailing list post - http://shlom.in/reply . From florent.angly at gmail.com Thu Nov 15 11:29:30 2012 From: florent.angly at gmail.com (Florent Angly) Date: Fri, 16 Nov 2012 02:29:30 +1000 Subject: [Bioperl-l] Bio::PrimarySeq speedup In-Reply-To: <5098EF50.5040208@gmail.com> References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com> <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu> <5098EF50.5040208@gmail.com> Message-ID: <50A5186A.4060304@gmail.com> I now merged the branch with master. Best, Florent On 06/11/12 21:06, Florent Angly wrote: > Yes, good idea, Chris. > > Actually, thinking about it, most of these warnings were redundant. > So, I changed the behaviour of Bio::PrimarySeq::validate_seq() so that > it issues exceptions if requested. > > Florent > > > On 05/11/12 12:43, Fields, Christopher J wrote: >> Florent, >> >> Ran tests on it, they pass but I am seeing this (if these are >> expected, you can catch the warnings using Test::Warn): >> >> [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr >> t/Seq/PrimarySeq.t >> t/Seq/PrimarySeq.t .. 1/167 >> --------------------- WARNING --------------------- >> MSG: Got a sequence without letters. Could not guess alphabet >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is ! >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $ >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is & >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is >> \,$,+ >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/ >> --------------------------------------------------- >> t/Seq/PrimarySeq.t .. ok >> All tests successful. >> Files=1, Tests=167, 0 wallclock secs ( 0.03 usr 0.01 sys + 0.18 >> cusr 0.01 csys = 0.23 CPU) >> Result: PASS >> >> chris >> >> On Nov 4, 2012, at 6:46 PM, Florent Angly >> wrote: >> >>> I am planning on merging the branch with master this week. >>> Best, >>> Florent >>> >>> >>> On 01/11/12 15:49, Florent Angly wrote: >>>> Hi all, >>>> >>>> I was working with Ben Woodcroft on identifying ways to speed up >>>> Grinder, which relies heavily on Bioperl. Ben did some profiling >>>> with NYTProf and we realized that a lot of computation time was >>>> spent in Bio::PrimarySeq, doing calls to subseq() and length(). The >>>> sequences we used for the profiling were microbial genomes, i.e. >>>> several Mbp long sequences, which is quite long. A lot of the >>>> performance cost was associated with passing full genomes between >>>> functions. For example, when doing a call to length(), length() >>>> requests the full sequence from seq(), which returns it back to >>>> length() (it makes a copy!). So, every call to length is very >>>> expensive for long sequences. And there is a lot of code that calls >>>> length(), for error checking. >>>> >>>> I know that there are a few Bioperl modules that are more adapted >>>> to handling very long sequences, e.g. Bio::DB::Fasta or >>>> Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look >>>> at Bio::PrimarySeq with Ben and I released this commit: >>>> https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. >>>> But in fact, there were more things that I wanted to try to >>>> improve, which led me to start this new branch: >>>> https://github.com/bioperl/bioperl-live/tree/seqlength >>>> >>>> I wrote quite a few tests for functionalities that were not >>>> previously covered by tests, and tried to improve the >>>> documentation. In addition, to address the speed issue, I did some >>>> changes to Bio::PrimarySeq and Bio::PrimarySeqI : >>>> ? The length of a sequence is now computed as soon as the sequence >>>> is set, not after. This way, there is no extra call to seq() (which >>>> would incur the cost of copying the entire sequence between >>>> functions). >>>> ? The length is saved as an object attribute. So, calling length() >>>> is very cheap since it only needs to retrieve the stored value for >>>> the length. >>>> ? There is a constructor called -direct, which skips sequence >>>> validation. However, it was only active in conjunction with the >>>> -ref_to_seq constructor. To make -direct conform better to its >>>> documented purpose, I made it -direct work when a sequence is set >>>> through -seq as well. >>>> ? This brings us to trunc(), revcom() and other methods of >>>> Bio::PrimarySeqI. Since all these methods create a new >>>> Bio::PrimarySeq object from an existing (already validated!) >>>> Bio::PrimarySeq object, the new object can be constructed with the >>>> -direct constructor, to save some time. >>>> ? Finally, I noticed that subseq() used calls to eval() to do its >>>> work. eval() is notoriously slow and these calls were easily >>>> replaced by simple calls to substr() to save some time. >>>> >>>> A real-world test I performed with Grinder took 3m28s before the >>>> changes (and ~1 min is spent doing something unrelated). After the >>>> changes, the same test took only 2min28s. So, it's quite a >>>> significant improvement and on more specific test cases, >>>> performance gains can obviously be much bigger. Also, I anticipate >>>> that the gains would be bigger for even longer sequences. >>>> >>>> All the changes I made are meant to be backward compatible and all >>>> the tests in the Bioperl test suite passed. So, there _should_ not >>>> be any issues. However, I know that Bio::PrimarySeq is a central >>>> module of Bioperl, so please, have a look at it and let me know if >>>> there are any glaring errors. >>>> >>>> Thanks, >>>> >>>> Florent >>>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > From mahakadry at aucegypt.edu Tue Nov 20 13:44:53 2012 From: mahakadry at aucegypt.edu (maha ahmed) Date: Tue, 20 Nov 2012 20:44:53 +0200 Subject: [Bioperl-l] Parsing a blast report with multiple queries into separate one query files that only contain the fasta sequences Message-ID: Dear BioPerl list, I blasted a file that has several fasta queries against nr, however I need to align each query with its hits for further computational analysis so I need to parse the produced blast report into several files that each has only the fasta query sequence and its hits in fasta format. I found this script online, use Bio::Search::Result::BlastResult;use Bio::SearchIO; my $report = Bio::SearchIO->new( -file=>'full-report.bls', -format => blast);my $result = $report->next_result;my %hits_by_query;while (my $hit = $result->next_hit) { push @{$hits_by_query{$hit->name}}, $hit;} foreach my $qid ( keys %hits_by_query ) { my $result = Bio::Search::Result::BlastResult->new(); $result->add_hit($_) for ( @{$hits_by_query{$qid}} ); my $blio = Bio::SearchIO->new( -file => ">$qid\.bls", -format =>'blast' ); $blio->write_result($result);} however on using it this produced the following error message BlastResult::new(): Not adding iterations. ------------- EXCEPTION: Bio::Root::NoSuchThing ------------- MSG: No such iteration number: 0. Valid range=1-0 VALUE: The number zero (0) STACK: Error::throw STACK: Bio::Root::Root::throw /usr/local/share/perl/5.10.0/Bio/Root/Root.pm:472 STACK: Bio::Search::Result::BlastResult::iteration /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:327 STACK: Bio::Search::Result::BlastResult::add_hit /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:257 STACK: ./parsing.blast.results.into.per.query.files.pl:15 I tried to search for other scripts but I couldn't find any I would really appreciate your comments to this Thank you From cjfields at illinois.edu Tue Nov 20 14:21:25 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 20 Nov 2012 19:21:25 +0000 Subject: [Bioperl-l] Parsing a blast report with multiple queries into separate one query files that only contain the fasta sequences In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF22E15@CITESMBX5.ad.uillinois.edu> Maha, Do you need only the sequence reported in the report (e.g. the HSP alignments) or the original FASTA sequences? The former can be recovered from the Bio::Search::HSP::GenericHSP objects as an alignment, and this can be redirected to a FASTA file. The latter is a little trickier, as you will have to retrieve the sequences from their original source files. chris On Nov 20, 2012, at 12:44 PM, maha ahmed wrote: > Dear BioPerl list, > I blasted a file that has several fasta queries against nr, however I need > to align each query with its hits for further computational analysis so I > need to parse the produced blast report into several files that each has > only the fasta query sequence and its hits in fasta format. > I found this script online, > > use Bio::Search::Result::BlastResult;use Bio::SearchIO; > my $report = Bio::SearchIO->new( -file=>'full-report.bls', -format > => blast);my $result = > $report->next_result;my %hits_by_query;while (my $hit = > $result->next_hit) { > push > @{$hits_by_query{$hit->name}}, $hit;} > foreach my $qid ( keys > %hits_by_query ) { > my $result = Bio::Search::Result::BlastResult->new(); > $result->add_hit($_) for ( @{$hits_by_query{$qid}} ); > my $blio = Bio::SearchIO->new( -file => ">$qid\.bls", -format > =>'blast' ); > $blio->write_result($result);} > > > > however on using it this produced the following error message > > > > BlastResult::new(): Not adding iterations. > > ------------- EXCEPTION: Bio::Root::NoSuchThing ------------- > MSG: No such iteration number: 0. Valid range=1-0 > VALUE: The number zero (0) > STACK: Error::throw > STACK: Bio::Root::Root::throw /usr/local/share/perl/5.10.0/Bio/Root/Root.pm:472 > STACK: Bio::Search::Result::BlastResult::iteration > /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:327 > STACK: Bio::Search::Result::BlastResult::add_hit > /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:257 > STACK: ./parsing.blast.results.into.per.query.files.pl:15 > > I tried to search for other scripts but I couldn't find any > I would really appreciate your comments to this > Thank you > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From rfhorns at gmail.com Thu Nov 1 20:01:34 2012 From: rfhorns at gmail.com (Felix Horns) Date: Fri, 02 Nov 2012 00:01:34 -0000 Subject: [Bioperl-l] get_Stream_by_query() appears to return empty stream Message-ID: Hello everyone. I am having trouble using the get_Stream_by_query() function in Bio::DB::GenBank. It seems to return an empty stream, such that $stream->next_seq never returns anything. However, $query->count is returning the expected value (139). Also, get_Stream_by_query() seems to be querying the database, as when I pass it an array of GeneIDs that have not been properly formatted, i.e. GeneID:7816864, instead of simply 7816864, it returns warnings and errors: "MSG: Warning(s) from GenBank: GeneID 7817709...; MSG: Error from Genbank: No items found.". I have included my full code below. I have also included the output from the code below that. The code is intended to find genes located within a genomic region. I will later find the protein domains and pathways that those genes are involved in. Any help would be greatly appreciated. I realize that this is probably a very simple question, but I am relatively new to BioPerl and I've spent the better part of the day trying to figure out such issues, so I would be very thankful for help. Felix #!/usr/bin/perl use strict; use Bio::SeqIO; use Bio::DB::EntrezGene; use Bio::DB::GenBank; # Load reference sequence # Load from local .gb file # Note that .gb file does not include sequences # my $gbfile = "NC_012660.1.gb"; # my $seqio = Bio::SeqIO->new(-file => $gbfile); # my $ref_seq = $seqio->next_seq; # To access reference sequence programatically, uncomment this code my $gb = new Bio::DB::GenBank; my $ref_seq = $gb->get_Seq_by_acc("NC_012660.1"); # Specify coordinates of gap my $gap_start = 2050506; my $gap_end = 2190530; my $gene_count = 0; my @features; my @starts; my @ends; my @db_xrefs; my @products; my @protein_ids; # Get gene features in gap for my $feat ($ref_seq->get_SeqFeatures) { my $start=$feat->location->start; my $end=$feat->location->end; if (($feat->primary_tag eq 'gene') & ($gap_start < $start) & ($start < $gap_end) & ($gap_start < $end) & ($end < $gap_end)) { $gene_count += 1; # Get GeneID reference my $db_xref = ($feat->get_tag_values('db_xref'))[0]; $db_xref =~ s/GeneID://; # Trim "GeneID:" from start of $db_xref push @features, $feat; push @starts, $start; push @ends, $end; push @db_xrefs, $db_xref; } } # Get data about gene features from GeneID reference my $query = Bio::DB::Query::GenBank->new(-db => 'gene', -ids => [@db_xrefs]); my $stream = $gb->get_Stream_by_query($query); while (my $seq = $stream->next_seq) { for my $feat ($seq->all_SeqFeatures) { print "primary tag: ", $feat->primary_tag, "\n"; for my $tag ($feat->get_all_tags) { print " tag: ", $tag, "\n"; for my $value ($feat->get_tag_values($tag)) { print " value: ", $value, "\n"; } } } } print $query->count,"\n"; print $gene_count, "\n"; OUTPUT > perl analyze_gap.pl 139 139 Note that no "primary tag; tag; value" items are printed. Furthermore, when I put a print line immediately after the (while (my $seq = $stream->next_seq)) statement, it was never called, seemingly indicating that the stream is empty. From mooldhu at gmail.com Tue Nov 6 02:38:57 2012 From: mooldhu at gmail.com (=?GB2312?B?uvq9rQ==?=) Date: Tue, 6 Nov 2012 15:38:57 +0800 Subject: [Bioperl-l] Ask for help about Bioperl Message-ID: hi, when I use bioperl ,it report errors like this :--------------------- WARNING --------------------- MSG: sequence '1' doesn't validate, mismatch is )81,95,0(,:: --------------------------------------------------- Error providing evidence type: GeneModel The error was: ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Attempting to set the sequence '1' to [)81hvf95x0(DSTD=qeSrytkiyP::oiV] which does not look healthy STACK: Error::throw STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:486 STACK: Bio::PrimarySeq::seq /usr/share/perl5/Bio/PrimarySeq.pm:285 STACK: Bio::PrimarySeq::new /usr/share/perl5/Bio/PrimarySeq.pm:239 STACK: Bio::PrimarySeqI::revcom /usr/share/perl5/Bio/PrimarySeqI.pm:383 but,I am sure that the input file only cotain [ATGCN],I also try to use another sequences ,but the errors are the same.my bioperl is Bioperl-live 1.006902; -- ???? From assayagy at gmail.com Sat Nov 10 13:27:03 2012 From: assayagy at gmail.com (eyla4ever) Date: Sat, 10 Nov 2012 10:27:03 -0800 (PST) Subject: [Bioperl-l] Extracting sequences from Genbank files In-Reply-To: References: Message-ID: <34664632.post@talk.nabble.com> hello Brian i wuold like you to send me your script, i think it can help me to solve a big problem and help me to finish my final project. i hope it will be posible regards Eyla BForde wrote: > > Hello, > > I have been modifying a script which extracts all the protein sequences > from a genbank file and saves them in a multi-fasta file. > > I wish the fasta header to have both the locus_tag of the protein and the > product. However I cannot get the product tag to write to the fasta > header > > this is the relevant section of the script > > $s->display_id($f->has_tag('locus_tag') ? join(',',sort > $f->each_tag_value('locus_tag')) : > $f->has_tag('product') ? > join(',',$f->each_tag_value('product')): > $s->display_id); > > is "product" not an actual tag > > regards > > Brian > > > > -- > Brian Forde > Microbiology Dept. > Bioscience Institute. Room 4.11 > University College Cork > Cork > Ireland > tel:+353 21 4901306 > email: b.m.forde at umail.ucc.ie > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- View this message in context: http://old.nabble.com/Extracting-sequences-from-Genbank-files-tp33901023p34664632.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From bosborne11 at verizon.net Tue Nov 20 18:50:00 2012 From: bosborne11 at verizon.net (Brian Osborne) Date: Tue, 20 Nov 2012 18:50:00 -0500 Subject: [Bioperl-l] get_Stream_by_query() appears to return empty stream In-Reply-To: References: Message-ID: <5F077DEA-DEBD-42BC-87E7-327697764CFE@verizon.net> Felix, I took a look at the Bio::DB::Query::GenBank documentation, it says this: If you provide an array reference of IDs in -ids, the query will be ignored and the list of IDs will be used when the query is passed to a Bio::DB::GenBank object's get_Stream_by_query() method. Bio::DB::Genbank queries "nucleotide", by default. You have GeneIDs. I see that you're setting "-id" to "gene" but note that you're passing that query to a plain Bio::DB::GenBank object. Not sure what the expected behavior is here. I would try using the NCBI Eutilities for that second query, rather than Bio::DB::Query::GenBank (http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook). Brian O. On Nov 1, 2012, at 8:01 PM, Felix Horns wrote: > Hello everyone. > > I am having trouble using the get_Stream_by_query() function > in Bio::DB::GenBank. It seems to return an empty stream, such that > $stream->next_seq never returns anything. > > However, $query->count is returning the expected value (139). Also, > get_Stream_by_query() seems to be querying the database, as when I pass it > an array of GeneIDs that have not been properly formatted, i.e. > GeneID:7816864, instead of simply 7816864, it returns warnings and errors: > "MSG: Warning(s) from GenBank: GeneID 7817709...; MSG: > Error from Genbank: No items found.". > > I have included my full code below. I have also included the output from > the code below that. The code is intended to find genes located within a > genomic region. I will later find the protein domains and pathways that > those genes are involved in. > > Any help would be greatly appreciated. I realize that this is probably a > very simple question, but I am relatively new to BioPerl and I've spent the > better part of the day trying to figure out such issues, so I would be very > thankful for help. > > Felix > > > #!/usr/bin/perl > use strict; > use Bio::SeqIO; > use Bio::DB::EntrezGene; > use Bio::DB::GenBank; > > # Load reference sequence > # Load from local .gb file > # Note that .gb file does not include sequences > # my $gbfile = "NC_012660.1.gb"; > # my $seqio = Bio::SeqIO->new(-file => $gbfile); > # my $ref_seq = $seqio->next_seq; > > # To access reference sequence programatically, uncomment this code > my $gb = new Bio::DB::GenBank; > my $ref_seq = $gb->get_Seq_by_acc("NC_012660.1"); > > # Specify coordinates of gap > my $gap_start = 2050506; > my $gap_end = 2190530; > > my $gene_count = 0; > my @features; > my @starts; > my @ends; > my @db_xrefs; > > my @products; > my @protein_ids; > > # Get gene features in gap > for my $feat ($ref_seq->get_SeqFeatures) { > my $start=$feat->location->start; > my $end=$feat->location->end; > > if (($feat->primary_tag eq 'gene') & > ($gap_start < $start) & ($start < $gap_end) & > ($gap_start < $end) & ($end < $gap_end)) { > > $gene_count += 1; > > # Get GeneID reference > my $db_xref = ($feat->get_tag_values('db_xref'))[0]; > $db_xref =~ s/GeneID://; # Trim "GeneID:" from start of $db_xref > > push @features, $feat; > push @starts, $start; > push @ends, $end; > push @db_xrefs, $db_xref; > } > } > > # Get data about gene features from GeneID reference > my $query = Bio::DB::Query::GenBank->new(-db => 'gene', > -ids => [@db_xrefs]); > my $stream = $gb->get_Stream_by_query($query); > > while (my $seq = $stream->next_seq) { > for my $feat ($seq->all_SeqFeatures) { > print "primary tag: ", $feat->primary_tag, "\n"; > for my $tag ($feat->get_all_tags) { > print " tag: ", $tag, "\n"; > for my $value ($feat->get_tag_values($tag)) { > print " value: ", $value, "\n"; > } > } > } > } > > print $query->count,"\n"; > print $gene_count, "\n"; > > > OUTPUT >> perl analyze_gap.pl > 139 > 139 > > Note that no "primary tag; tag; value" items are printed. Furthermore, > when I put a print line immediately after the (while (my $seq = > $stream->next_seq)) statement, it was never called, seemingly indicating > that the stream is empty. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bosborne11 at verizon.net Tue Nov 20 18:52:00 2012 From: bosborne11 at verizon.net (Brian Osborne) Date: Tue, 20 Nov 2012 18:52:00 -0500 Subject: [Bioperl-l] Ask for help about Bioperl In-Reply-To: References: Message-ID: <3B472C83-7F6C-41E2-A629-E6A2BDC6B075@verizon.net> ????, You're going to have show us your code, we can't help you just by seeing the error messages. Show us the input file as well, or the beginning of it. Brian O. On Nov 6, 2012, at 2:38 AM, ???? wrote: > hi, > when I use bioperl ,it report errors like this :--------------------- > WARNING --------------------- > MSG: sequence '1' doesn't validate, mismatch is )81,95,0(,:: > --------------------------------------------------- > Error providing evidence type: GeneModel > The error was: > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: Attempting to set the sequence '1' to > [)81hvf95x0(DSTD=qeSrytkiyP::oiV] which does not look healthy > STACK: Error::throw > STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:486 > STACK: Bio::PrimarySeq::seq /usr/share/perl5/Bio/PrimarySeq.pm:285 > STACK: Bio::PrimarySeq::new /usr/share/perl5/Bio/PrimarySeq.pm:239 > STACK: Bio::PrimarySeqI::revcom /usr/share/perl5/Bio/PrimarySeqI.pm:383 > > > but,I am sure that the input file only cotain [ATGCN],I also try to use > another sequences ,but the errors are the same.my bioperl is Bioperl-live > 1.006902; > > -- > ???? > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at drycafe.net Tue Nov 20 21:24:50 2012 From: hlapp at drycafe.net (Hilmar Lapp) Date: Tue, 20 Nov 2012 21:24:50 -0500 Subject: [Bioperl-l] handle with file in perl In-Reply-To: <34626730.post@talk.nabble.com> References: <34626730.post@talk.nabble.com> Message-ID: <1DE09B34-5124-478C-8925-0045EC119CFC@drycafe.net> This sounds like a homework assignment. We're not here to do your homework or assignments for you. You can post if you run into a specific problem when solving your assignment with Bioperl, and we'll help with that. -hilmar Sent with a tap. On Oct 31, 2012, at 7:45 PM, eyla4ever wrote: > > hi > > i want to write a function that get as parameters : file_name, hsp , hit. > and i want her to print all the blast Field that i need to this file. > > i do it because i have 2 files with the same Fields. > > > 10X > -- > View this message in context: http://old.nabble.com/handle-with-file-in-perl-tp34626730p34626730.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From mahakadry at aucegypt.edu Fri Nov 23 20:33:59 2012 From: mahakadry at aucegypt.edu (maha ahmed) Date: Sat, 24 Nov 2012 03:33:59 +0200 Subject: [Bioperl-l] retrieving a subset of files from a folder Message-ID: Dear Bioperl list, I have a folder that has 60,000 files (one file for each phylogenetic tree) However I only need to work with a subset of 1,000 files from that folder (the files are not numbered in order so I cant use the i++ loop in my bioperl script) Is there a way to write a script that only moves files with the names given in a list in a text file i.e. I have a file that has the names of the files I want to copy fro m the folder and I want to write script that does this Thank you so much From kellert at ohsu.edu Sat Nov 24 13:08:11 2012 From: kellert at ohsu.edu (Tom Keller) Date: Sat, 24 Nov 2012 10:08:11 -0800 Subject: [Bioperl-l] use cookbook to work with a directory of files In-Reply-To: References: Message-ID: A search with the phrase "perl cookbook filenames from directory" should help you find what you need. On Nov 24, 2012, at 9:00 AM, bioperl-l-request at lists.open-bio.org wrote: > Send Bioperl-l mailing list submissions to > bioperl-l at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/bioperl-l > or, via email, send a message with subject or body 'help' to > bioperl-l-request at lists.open-bio.org > > You can reach the person managing the list at > bioperl-l-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Bioperl-l digest..." > > > Today's Topics: > > 1. retrieving a subset of files from a folder (maha ahmed) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sat, 24 Nov 2012 03:33:59 +0200 > From: maha ahmed > Subject: [Bioperl-l] retrieving a subset of files from a folder > To: Bioperl-l at lists.open-bio.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > Dear Bioperl list, > I have a folder that has 60,000 files (one file for each phylogenetic tree) > However I only need to work with a subset of 1,000 files from that folder > (the files are not numbered in order so I cant use the i++ loop in my > bioperl script) > Is there a way to write a script that only moves files with the names given > in a list in a text file > i.e. I have a file that has the names of the files I want to copy fro m the > folder and I want to write script that does this > Thank you so much > > > ------------------------------ > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > End of Bioperl-l Digest, Vol 115, Issue 8 > ***************************************** From minou.nowrousian at rub.de Sat Nov 24 13:24:02 2012 From: minou.nowrousian at rub.de (Minou Nowrousian) Date: 24 Nov 2012 19:24:02 +0100 Subject: [Bioperl-l] retrieving a subset of files from a folder Message-ID: <000001cdca70$e1a97720$a4fc6560$@rub.de> >Dear Bioperl list, >I have a folder that has 60,000 files (one file for each phylogenetic tree) However I only need to work with a subset of 1,000 files from that folder >(the files are not numbered in order so I cant use the i++ loop in my bioperl script) Is there a way to write a script that only moves files with the >names given in a list in a text file i.e. I have a file that has the names of the files I want to copy fro m the folder and I want to write script that does >this Thank you so much I don't know if there is a BioPerl solution, but you could use the File::Copy module (available from CPAN): use File::Copy; copy("path_to_file_you_want_to_copy","path_to_target_file") or die "Copy failed: $!"; Regards, Minou From mahakadry at aucegypt.edu Sat Nov 24 14:04:09 2012 From: mahakadry at aucegypt.edu (maha ahmed) Date: Sat, 24 Nov 2012 21:04:09 +0200 Subject: [Bioperl-l] retrieving a subset of files from a folder In-Reply-To: <000001cdca70$e1a97720$a4fc6560$@rub.de> References: <000001cdca70$e1a97720$a4fc6560$@rub.de> Message-ID: Thanks everyone , I actually found a one line command that I am going to try: xargs -a file_list.txt mv -t /path/to/des thanks for your help I will read have a look at the readings you suggested thank you On Sat, Nov 24, 2012 at 8:24 PM, Minou Nowrousian wrote: > > >Dear Bioperl list, > >I have a folder that has 60,000 files (one file for each phylogenetic > tree) > However I only need to work with a subset of 1,000 files from that folder > >(the files are not numbered in order so I cant use the i++ loop in my > bioperl script) Is there a way to write a script that only moves files with > the >names given in a list in a text file i.e. I have a file that has the > names of the files I want to copy fro m the folder and I want to write > script that does >this Thank you so much > > I don't know if there is a BioPerl solution, but you could use the > File::Copy module (available from CPAN): > > use File::Copy; > copy("path_to_file_you_want_to_copy","path_to_target_file") or die "Copy > failed: $!"; > > Regards, > Minou > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From maj at fortinbras.us Tue Nov 27 08:49:46 2012 From: maj at fortinbras.us (Mark A. Jensen) Date: Tue, 27 Nov 2012 13:49:46 +0000 Subject: [Bioperl-l] Neo4j : applying user defined validation and constraints Message-ID: Hi Folks, Since there was some enthusiasm about REST::Neo4p, my interface to Neo4j, I thought I would let you know about https://metacpan.org/module/REST::Neo4p::Constrain This is a framework that lets you apply constraints on node and relationship properties, relationships, and relationship types. You can specify your constraints, and have REST::Neo4p throw exceptions when the constraints aren't met, or you can do validation on existing database items. The pod has a full explanation and examples aplenty. Please have a look and send bugs my way via RT. Cheers all, MAJ From francescomusacchia at gmail.com Wed Nov 28 05:27:16 2012 From: francescomusacchia at gmail.com (Francesco Musacchia) Date: Wed, 28 Nov 2012 02:27:16 -0800 (PST) Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access Message-ID: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com> Hi all, I have a big problem with using GFF3 database with BioPerl. This is not a question about what is the way to write some bioperl code. I'm experiencing that when I have to do a lot of accessess on a GFF database (with Bio:DB::SeqFeature::Store) the slowness increase until my script can stay running for more than a day. How can I solve it? Or it cannot be done? Thanks a lot! From florent.angly at gmail.com Thu Nov 1 05:49:13 2012 From: florent.angly at gmail.com (Florent Angly) Date: Thu, 01 Nov 2012 15:49:13 +1000 Subject: [Bioperl-l] Bio::PrimarySeq speedup Message-ID: <50920D59.4010307@gmail.com> Hi all, I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was associated with passing full genomes between functions. For example, when doing a call to length(), length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So, every call to length is very expensive for long sequences. And there is a lot of code that calls length(), for error checking. I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g. Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to improve the documentation. In addition, to address the speed issue, I did some changes to Bio::PrimarySeq and Bio::PrimarySeqI : ? The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no extra call to seq() (which would incur the cost of copying the entire sequence between functions). ? The length is saved as an object attribute. So, calling length() is very cheap since it only needs to retrieve the stored value for the length. ? There is a constructor called -direct, which skips sequence validation. However, it was only active in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented purpose, I made it -direct work when a sequence is set through -seq as well. ? This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the new object can be constructed with the -direct constructor, to save some time. ? Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and these calls were easily replaced by simple calls to substr() to save some time. A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I anticipate that the gains would be bigger for even longer sequences. All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of Bioperl, so please, have a look at it and let me know if there are any glaring errors. Thanks, Florent From shalabh.sharma7 at gmail.com Thu Nov 1 19:36:35 2012 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Thu, 1 Nov 2012 15:36:35 -0400 Subject: [Bioperl-l] blast question Message-ID: Hi All, First of all i am really very sorry for posting blast question in this forum, I am not sure if this is the right place. I will really appreciate if anyone can guide me to the right direction. I am using blastall to get a top hit from a database so i am using -v 1 -b 1 (i hope this is right). But the strange part is that i am getting wrong results. for example: if i use -v 1 -b 1 then for one of the hit i am getting this: Sequences producing significant alignments: (bits) Value fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 4e-04 If i use -v 3 -b 3 then i am getting this for the same query: Sequences producing significant alignments: (bits) Value fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 e-167 fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 9e-07 fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 1.0 As you can see the top hit in the first case is totally wrong. I would really appreciate if someone can help me out, or direct to in the right direction. Thanks Shalabh -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From cjfields at illinois.edu Thu Nov 1 21:41:43 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 1 Nov 2012 21:41:43 +0000 Subject: [Bioperl-l] blast question In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> That's a scary error, but the best place to submit this would be the BLAST help list at NCBI (cc'd) chris On Nov 1, 2012, at 2:36 PM, shalabh sharma wrote: > Hi All, > First of all i am really very sorry for posting blast question in > this forum, I am not sure if this is the right place. > I will really appreciate if anyone can guide me to the right direction. > > I am using blastall to get a top hit from a database so i am using -v 1 -b > 1 (i hope this is right). > But the strange part is that i am getting wrong results. > > for example: if i use -v 1 -b 1 then for one of the hit i am getting this: > > > Sequences producing significant alignments: (bits) > Value > > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > 4e-04 > > > If i use -v 3 -b 3 then i am getting this for the same query: > > Sequences producing significant alignments: (bits) > Value > > fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 > e-167 > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > 9e-07 > fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 > 1.0 > > As you can see the top hit in the first case is totally wrong. > > I would really appreciate if someone can help me out, or direct to in the > right direction. > > Thanks > Shalabh > > > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences > University of Georgia > Athens, GA 30602-3636 > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From shalabh.sharma7 at gmail.com Fri Nov 2 14:50:17 2012 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Fri, 2 Nov 2012 10:50:17 -0400 Subject: [Bioperl-l] blast question In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> Message-ID: I know, i am really worried about my past analysis now. Thanks a lot for cc'ing this mail Chris. -Shalabh On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J wrote: > That's a scary error, but the best place to submit this would be the BLAST > help list at NCBI (cc'd) > > chris > > On Nov 1, 2012, at 2:36 PM, shalabh sharma > wrote: > > > Hi All, > > First of all i am really very sorry for posting blast question > in > > this forum, I am not sure if this is the right place. > > I will really appreciate if anyone can guide me to the right direction. > > > > I am using blastall to get a top hit from a database so i am using -v 1 > -b > > 1 (i hope this is right). > > But the strange part is that i am getting wrong results. > > > > for example: if i use -v 1 -b 1 then for one of the hit i am getting > this: > > > > > > Sequences producing significant alignments: (bits) > > Value > > > > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > > 4e-04 > > > > > > If i use -v 3 -b 3 then i am getting this for the same query: > > > > Sequences producing significant alignments: (bits) > > Value > > > > fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 > > e-167 > > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > > 9e-07 > > fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 > > 1.0 > > > > As you can see the top hit in the first case is totally wrong. > > > > I would really appreciate if someone can help me out, or direct to in the > > right direction. > > > > Thanks > > Shalabh > > > > > > > > -- > > Shalabh Sharma > > Scientific Computing Professional Associate (Bioinformatics Specialist) > > Department of Marine Sciences > > University of Georgia > > Athens, GA 30602-3636 > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From Scott.Markel at accelrys.com Sat Nov 3 00:13:59 2012 From: Scott.Markel at accelrys.com (Scott Markel) Date: Fri, 2 Nov 2012 17:13:59 -0700 Subject: [Bioperl-l] Structure::IO::pdb update needed to comply with PDB file format specification change Message-ID: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net> In tracking down a regression test failure we discovered that the Structure::IO::pdb module is out of date relative to the PDB file format specification (http://www.wwpdb.org/docs.html). PDB now writes out to column 79, while pdb.pm is still using the old line length of 71. Our regression failure was caused by a reformatting of 1CRN; journal titles started getting truncated. Some of the Perl lines are really simple, e.g., $keywds = $self->_read_PDB_singlecontline("KEYWDS","11-70",\$buffer); with others being just a little more detailed, e.g., my ($rec, $subr, $cont, $rol) = unpack "A6 x6 A4 A2 x1 A51", $_; It doesn't look like pdb.pm has changed in about 1.5 years. Is there a current module owner? Or someone else working on this? If not, we're willing to either compile a list of needed changes (walking through the PDB file format specification and comparing the corresponding column indices in pdb.pm) or provide a new version of the entire file. Please let us know which is preferred. Scott Scott Markel, Ph.D. Principal Bioinformatics Architect? email:? smarkel at accelrys.com Accelrys (Pipeline Pilot R&D)?????? mobile: +1 858 205 3653 10188 Telesis Court, Suite 100????? voice:? +1 858 799 5603 San Diego, CA 92121???????????????? fax:??? +1 858 799 5222 USA???????????????????????????????? web:??? http://www.accelrys.com http://www.linkedin.com/in/smarkel Secretary, Board of Directors: ??? International Society for Computational Biology Chair: ISCB Publications and Communications Committee Associate Editor: PLoS Computational Biology Editorial Board: Briefings in Bioinformatics From cjfields at illinois.edu Sat Nov 3 02:08:52 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sat, 3 Nov 2012 02:08:52 +0000 Subject: [Bioperl-l] Structure::IO::pdb update needed to comply with PDB file format specification change In-Reply-To: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net> References: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC817F8@CHIMBX5.ad.uillinois.edu> On Nov 2, 2012, at 7:13 PM, Scott Markel wrote: > In tracking down a regression test failure we discovered that the Structure::IO::pdb module is out of date relative to the PDB file format specification (http://www.wwpdb.org/docs.html). PDB now writes out to column 79, while pdb.pm is still using the old line length of 71. Our regression failure was caused by a reformatting of 1CRN; journal titles started getting truncated. > > Some of the Perl lines are really simple, e.g., > > $keywds = $self->_read_PDB_singlecontline("KEYWDS","11-70",\$buffer); > > with others being just a little more detailed, e.g., > > my ($rec, $subr, $cont, $rol) = unpack "A6 x6 A4 A2 x1 A51", $_; > > It doesn't look like pdb.pm has changed in about 1.5 years. Is there a current module owner? Or someone else working on this? No one has really taken ownership, so as far as I'm concerned it's open. Any objections? > If not, we're willing to either compile a list of needed changes (walking through the PDB file format specification and comparing the corresponding column indices in pdb.pm) or provide a new version of the entire file. Please let us know which is preferred. A new version of the file is fine if you have someone who can work on it. We would also like to change relevant tests and documentation if there is time. > Scott > > Scott Markel, Ph.D. > Principal Bioinformatics Architect email: smarkel at accelrys.com > Accelrys (Pipeline Pilot R&D) mobile: +1 858 205 3653 > 10188 Telesis Court, Suite 100 voice: +1 858 799 5603 > San Diego, CA 92121 fax: +1 858 799 5222 > USA web: http://www.accelrys.com > > http://www.linkedin.com/in/smarkel > Secretary, Board of Directors: > International Society for Computational Biology > Chair: ISCB Publications and Communications Committee > Associate Editor: PLoS Computational Biology > Editorial Board: Briefings in Bioinformatics Thanks Scott! chris From Russell.Smithies at agresearch.co.nz Sun Nov 4 21:00:37 2012 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Mon, 5 Nov 2012 10:00:37 +1300 Subject: [Bioperl-l] blast question In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> What version of blast are you using? There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+ --Russell -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma Sent: Saturday, 3 November 2012 3:50 a.m. To: Fields, Christopher J Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov Subject: Re: [Bioperl-l] blast question I know, i am really worried about my past analysis now. Thanks a lot for cc'ing this mail Chris. -Shalabh On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J wrote: > That's a scary error, but the best place to submit this would be the > BLAST help list at NCBI (cc'd) > > chris > > On Nov 1, 2012, at 2:36 PM, shalabh sharma > wrote: > > > Hi All, > > First of all i am really very sorry for posting blast > > question > in > > this forum, I am not sure if this is the right place. > > I will really appreciate if anyone can guide me to the right direction. > > > > I am using blastall to get a top hit from a database so i am using > > -v 1 > -b > > 1 (i hope this is right). > > But the strange part is that i am getting wrong results. > > > > for example: if i use -v 1 -b 1 then for one of the hit i am getting > this: > > > > > > Sequences producing significant alignments: (bits) > > Value > > > > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > > 4e-04 > > > > > > If i use -v 3 -b 3 then i am getting this for the same query: > > > > Sequences producing significant alignments: (bits) > > Value > > > > fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 > > e-167 > > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 > > 9e-07 > > fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 > > 1.0 > > > > As you can see the top hit in the first case is totally wrong. > > > > I would really appreciate if someone can help me out, or direct to > > in the right direction. > > > > Thanks > > Shalabh > > > > > > > > -- > > Shalabh Sharma > > Scientific Computing Professional Associate (Bioinformatics > > Specialist) Department of Marine Sciences University of Georgia > > Athens, GA 30602-3636 > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From cjfields at illinois.edu Sun Nov 4 22:13:37 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sun, 4 Nov 2012 22:13:37 +0000 Subject: [Bioperl-l] blast question In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu> That in fact is the recommendation (migrate to BLAST+). chris On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" wrote: > What version of blast are you using? > There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+ > > > --Russell > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma > Sent: Saturday, 3 November 2012 3:50 a.m. > To: Fields, Christopher J > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov > Subject: Re: [Bioperl-l] blast question > > I know, i am really worried about my past analysis now. > Thanks a lot for cc'ing this mail Chris. > > -Shalabh > > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J > wrote: > >> That's a scary error, but the best place to submit this would be the >> BLAST help list at NCBI (cc'd) >> >> chris >> >> On Nov 1, 2012, at 2:36 PM, shalabh sharma >> wrote: >> >>> Hi All, >>> First of all i am really very sorry for posting blast >>> question >> in >>> this forum, I am not sure if this is the right place. >>> I will really appreciate if anyone can guide me to the right direction. >>> >>> I am using blastall to get a top hit from a database so i am using >>> -v 1 >> -b >>> 1 (i hope this is right). >>> But the strange part is that i am getting wrong results. >>> >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting >> this: >>> >>> >>> Sequences producing significant alignments: (bits) >>> Value >>> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 >>> 4e-04 >>> >>> >>> If i use -v 3 -b 3 then i am getting this for the same query: >>> >>> Sequences producing significant alignments: (bits) >>> Value >>> >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 >>> e-167 >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 >>> 9e-07 >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 >>> 1.0 >>> >>> As you can see the top hit in the first case is totally wrong. >>> >>> I would really appreciate if someone can help me out, or direct to >>> in the right direction. >>> >>> Thanks >>> Shalabh >>> >>> >>> >>> -- >>> Shalabh Sharma >>> Scientific Computing Professional Associate (Bioinformatics >>> Specialist) Department of Marine Sciences University of Georgia >>> Athens, GA 30602-3636 >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From florent.angly at gmail.com Mon Nov 5 00:46:44 2012 From: florent.angly at gmail.com (Florent Angly) Date: Mon, 05 Nov 2012 10:46:44 +1000 Subject: [Bioperl-l] Bio::PrimarySeq speedup In-Reply-To: <50920D59.4010307@gmail.com> References: <50920D59.4010307@gmail.com> Message-ID: <50970C74.7070605@gmail.com> I am planning on merging the branch with master this week. Best, Florent On 01/11/12 15:49, Florent Angly wrote: > Hi all, > > I was working with Ben Woodcroft on identifying ways to speed up > Grinder, which relies heavily on Bioperl. Ben did some profiling with > NYTProf and we realized that a lot of computation time was spent in > Bio::PrimarySeq, doing calls to subseq() and length(). The sequences > we used for the profiling were microbial genomes, i.e. several Mbp > long sequences, which is quite long. A lot of the performance cost was > associated with passing full genomes between functions. For example, > when doing a call to length(), length() requests the full sequence > from seq(), which returns it back to length() (it makes a copy!). So, > every call to length is very expensive for long sequences. And there > is a lot of code that calls length(), for error checking. > > I know that there are a few Bioperl modules that are more adapted to > handling very long sequences, e.g. Bio::DB::Fasta or > Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at > Bio::PrimarySeq with Ben and I released this commit: > https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. > But in fact, there were more things that I wanted to try to improve, > which led me to start this new branch: > https://github.com/bioperl/bioperl-live/tree/seqlength > > I wrote quite a few tests for functionalities that were not previously > covered by tests, and tried to improve the documentation. In addition, > to address the speed issue, I did some changes to Bio::PrimarySeq and > Bio::PrimarySeqI : > ? The length of a sequence is now computed as soon as the sequence is > set, not after. This way, there is no extra call to seq() (which would > incur the cost of copying the entire sequence between functions). > ? The length is saved as an object attribute. So, calling length() is > very cheap since it only needs to retrieve the stored value for the > length. > ? There is a constructor called -direct, which skips sequence > validation. However, it was only active in conjunction with the > -ref_to_seq constructor. To make -direct conform better to its > documented purpose, I made it -direct work when a sequence is set > through -seq as well. > ? This brings us to trunc(), revcom() and other methods of > Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq > object from an existing (already validated!) Bio::PrimarySeq object, > the new object can be constructed with the -direct constructor, to > save some time. > ? Finally, I noticed that subseq() used calls to eval() to do its > work. eval() is notoriously slow and these calls were easily replaced > by simple calls to substr() to save some time. > > A real-world test I performed with Grinder took 3m28s before the > changes (and ~1 min is spent doing something unrelated). After the > changes, the same test took only 2min28s. So, it's quite a significant > improvement and on more specific test cases, performance gains can > obviously be much bigger. Also, I anticipate that the gains would be > bigger for even longer sequences. > > All the changes I made are meant to be backward compatible and all the > tests in the Bioperl test suite passed. So, there _should_ not be any > issues. However, I know that Bio::PrimarySeq is a central module of > Bioperl, so please, have a look at it and let me know if there are any > glaring errors. > > Thanks, > > Florent > From cjfields at illinois.edu Mon Nov 5 02:43:28 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 5 Nov 2012 02:43:28 +0000 Subject: [Bioperl-l] Bio::PrimarySeq speedup In-Reply-To: <50970C74.7070605@gmail.com> References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu> Florent, Ran tests on it, they pass but I am seeing this (if these are expected, you can catch the warnings using Test::Warn): [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr t/Seq/PrimarySeq.t t/Seq/PrimarySeq.t .. 1/167 --------------------- WARNING --------------------- MSG: Got a sequence without letters. Could not guess alphabet --------------------------------------------------- --------------------- WARNING --------------------- MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is ! --------------------------------------------------- --------------------- WARNING --------------------- MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $ --------------------------------------------------- --------------------- WARNING --------------------- MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is & --------------------------------------------------- --------------------- WARNING --------------------- MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is \,$,+ --------------------------------------------------- --------------------- WARNING --------------------- MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/ --------------------------------------------------- t/Seq/PrimarySeq.t .. ok All tests successful. Files=1, Tests=167, 0 wallclock secs ( 0.03 usr 0.01 sys + 0.18 cusr 0.01 csys = 0.23 CPU) Result: PASS chris On Nov 4, 2012, at 6:46 PM, Florent Angly wrote: > I am planning on merging the branch with master this week. > Best, > Florent > > > On 01/11/12 15:49, Florent Angly wrote: >> Hi all, >> >> I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was associated with passing full genomes between functions. For example, when doing a call to length(), length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So, every call to length is very expensive for long sequences. And there is a lot of code that calls length(), for error checking. >> >> I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g. Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength >> >> I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to improve the documentation. In addition, to address the speed issue, I did some changes to Bio::PrimarySeq and Bio::PrimarySeqI : >> ? The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no extra call to seq() (which would incur the cost of copying the entire sequence between functions). >> ? The length is saved as an object attribute. So, calling length() is very cheap since it only needs to retrieve the stored value for the length. >> ? There is a constructor called -direct, which skips sequence validation. However, it was only active in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented purpose, I made it -direct work when a sequence is set through -seq as well. >> ? This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the new object can be constructed with the -direct constructor, to save some time. >> ? Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and these calls were easily replaced by simple calls to substr() to save some time. >> >> A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I anticipate that the gains would be bigger for even longer sequences. >> >> All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of Bioperl, so please, have a look at it and let me know if there are any glaring errors. >> >> Thanks, >> >> Florent >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From shalabh.sharma7 at gmail.com Mon Nov 5 17:03:38 2012 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Mon, 5 Nov 2012 12:03:38 -0500 Subject: [Bioperl-l] blast question In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu> References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu> Message-ID: Hi All, thanks for all your responses. Currently i am using the old version of blastall 2.2.22. @Peter: I will update my blast and will see if the problem still exist. But i can't restrict my blast with e value because i work on environmental samples , i have to reduce the size of my blast files as i am only interested in the top hit and my data sets are really huge. Thanks Shalabh On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J wrote: > That in fact is the recommendation (migrate to BLAST+). > > chris > > On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" < > Russell.Smithies at agresearch.co.nz> wrote: > > > What version of blast are you using? > > There have been quite a few bug fixes and I suspect any responses from > NCBI will suggest upgrading to the current version of blast+ > > > > > > --Russell > > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto: > bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma > > Sent: Saturday, 3 November 2012 3:50 a.m. > > To: Fields, Christopher J > > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov > > Subject: Re: [Bioperl-l] blast question > > > > I know, i am really worried about my past analysis now. > > Thanks a lot for cc'ing this mail Chris. > > > > -Shalabh > > > > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J < > cjfields at illinois.edu > >> wrote: > > > >> That's a scary error, but the best place to submit this would be the > >> BLAST help list at NCBI (cc'd) > >> > >> chris > >> > >> On Nov 1, 2012, at 2:36 PM, shalabh sharma > >> wrote: > >> > >>> Hi All, > >>> First of all i am really very sorry for posting blast > >>> question > >> in > >>> this forum, I am not sure if this is the right place. > >>> I will really appreciate if anyone can guide me to the right direction. > >>> > >>> I am using blastall to get a top hit from a database so i am using > >>> -v 1 > >> -b > >>> 1 (i hope this is right). > >>> But the strange part is that i am getting wrong results. > >>> > >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting > >> this: > >>> > >>> > >>> Sequences producing significant alignments: (bits) > >>> Value > >>> > >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA > 38 > >>> 4e-04 > >>> > >>> > >>> If i use -v 3 -b 3 then i am getting this for the same query: > >>> > >>> Sequences producing significant alignments: (bits) > >>> Value > >>> > >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... > 570 > >>> e-167 > >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA > 38 > >>> 9e-07 > >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... > 18 > >>> 1.0 > >>> > >>> As you can see the top hit in the first case is totally wrong. > >>> > >>> I would really appreciate if someone can help me out, or direct to > >>> in the right direction. > >>> > >>> Thanks > >>> Shalabh > >>> > >>> > >>> > >>> -- > >>> Shalabh Sharma > >>> Scientific Computing Professional Associate (Bioinformatics > >>> Specialist) Department of Marine Sciences University of Georgia > >>> Athens, GA 30602-3636 > >>> _______________________________________________ > >>> Bioperl-l mailing list > >>> Bioperl-l at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > >> > > > > > > -- > > Shalabh Sharma > > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences University of Georgia Athens, GA 30602-3636 > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > ======================================================================= > > Attention: The information contained in this message and/or attachments > > from AgResearch Limited is intended only for the persons or entities > > to which it is addressed and may contain confidential and/or privileged > > material. Any review, retransmission, dissemination or other use of, or > > taking of any action in reliance upon, this information by persons or > > entities other than the intended recipients is prohibited by AgResearch > > Limited. If you have received this message in error, please notify the > > sender immediately. > > ======================================================================= > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From Russell.Smithies at agresearch.co.nz Mon Nov 5 21:04:07 2012 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Tue, 6 Nov 2012 10:04:07 +1300 Subject: [Bioperl-l] blast question In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu> Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz> If you're using an older version of blast there was a bug where not all results were returned - I think the limit was 10,000 hits? Not usually a problem running basic queries but a big problem for environmental or metagenomic samples, or when aligning short reads. --Russell From: shalabh sharma [mailto:shalabh.sharma7 at gmail.com] Sent: Tuesday, 6 November 2012 6:04 a.m. To: Fields, Christopher J Cc: Smithies, Russell; bioperl-l Subject: Re: [Bioperl-l] blast question Hi All, thanks for all your responses. Currently i am using the old version of blastall 2.2.22. @Peter: I will update my blast and will see if the problem still exist. But i can't restrict my blast with e value because i work on environmental samples , i have to reduce the size of my blast files as i am only interested in the top hit and my data sets are really huge. Thanks Shalabh On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J > wrote: That in fact is the recommendation (migrate to BLAST+). chris On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" > wrote: > What version of blast are you using? > There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+ > > > --Russell > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma > Sent: Saturday, 3 November 2012 3:50 a.m. > To: Fields, Christopher J > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov > Subject: Re: [Bioperl-l] blast question > > I know, i am really worried about my past analysis now. > Thanks a lot for cc'ing this mail Chris. > > -Shalabh > > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J >> wrote: > >> That's a scary error, but the best place to submit this would be the >> BLAST help list at NCBI (cc'd) >> >> chris >> >> On Nov 1, 2012, at 2:36 PM, shalabh sharma > >> wrote: >> >>> Hi All, >>> First of all i am really very sorry for posting blast >>> question >> in >>> this forum, I am not sure if this is the right place. >>> I will really appreciate if anyone can guide me to the right direction. >>> >>> I am using blastall to get a top hit from a database so i am using >>> -v 1 >> -b >>> 1 (i hope this is right). >>> But the strange part is that i am getting wrong results. >>> >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting >> this: >>> >>> >>> Sequences producing significant alignments: (bits) >>> Value >>> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 >>> 4e-04 >>> >>> >>> If i use -v 3 -b 3 then i am getting this for the same query: >>> >>> Sequences producing significant alignments: (bits) >>> Value >>> >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... 570 >>> e-167 >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA 38 >>> 9e-07 >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... 18 >>> 1.0 >>> >>> As you can see the top hit in the first case is totally wrong. >>> >>> I would really appreciate if someone can help me out, or direct to >>> in the right direction. >>> >>> Thanks >>> Shalabh >>> >>> >>> >>> -- >>> Shalabh Sharma >>> Scientific Computing Professional Associate (Bioinformatics >>> Specialist) Department of Marine Sciences University of Georgia >>> Athens, GA 30602-3636 >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From shalabh.sharma7 at gmail.com Mon Nov 5 21:09:03 2012 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Mon, 5 Nov 2012 16:09:03 -0500 Subject: [Bioperl-l] blast question In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz> References: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz> <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz> Message-ID: Hi All, Thanks for all the suggestion. The problem is fixed by using latest blast+ . Thanks Shalabh On Mon, Nov 5, 2012 at 4:04 PM, Smithies, Russell < Russell.Smithies at agresearch.co.nz> wrote: > If you?re using an older version of blast there was a bug where not all > results were returned ? I think the limit was 10,000 hits?**** > > Not usually a problem running basic queries but a big problem for > environmental or metagenomic samples, or when aligning short reads.**** > > ** ** > > --Russell**** > > ** ** > > *From:* shalabh sharma [mailto:shalabh.sharma7 at gmail.com] > *Sent:* Tuesday, 6 November 2012 6:04 a.m. > *To:* Fields, Christopher J > *Cc:* Smithies, Russell; bioperl-l > > *Subject:* Re: [Bioperl-l] blast question**** > > ** ** > > Hi All,**** > > thanks for all your responses.**** > > ** ** > > Currently i am using the old version of blastall 2.2.22.**** > > ** ** > > @Peter: I will update my blast and will see if the problem still exist. > But i can't restrict my blast with e value because i work on environmental > samples , i have to reduce the size of my blast files as i am only > interested in the top hit and my data sets are really huge.**** > > ** ** > > Thanks**** > > Shalabh**** > > ** ** > > On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J < > cjfields at illinois.edu> wrote:**** > > That in fact is the recommendation (migrate to BLAST+). > > chris**** > > > On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" < > Russell.Smithies at agresearch.co.nz> wrote: > > > What version of blast are you using? > > There have been quite a few bug fixes and I suspect any responses from > NCBI will suggest upgrading to the current version of blast+ > > > > > > --Russell > > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto: > bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma > > Sent: Saturday, 3 November 2012 3:50 a.m. > > To: Fields, Christopher J > > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov > > Subject: Re: [Bioperl-l] blast question > > > > I know, i am really worried about my past analysis now. > > Thanks a lot for cc'ing this mail Chris. > > > > -Shalabh > > > > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J < > cjfields at illinois.edu > >> wrote: > > > >> That's a scary error, but the best place to submit this would be the > >> BLAST help list at NCBI (cc'd) > >> > >> chris > >> > >> On Nov 1, 2012, at 2:36 PM, shalabh sharma > >> wrote: > >> > >>> Hi All, > >>> First of all i am really very sorry for posting blast > >>> question > >> in > >>> this forum, I am not sure if this is the right place. > >>> I will really appreciate if anyone can guide me to the right direction. > >>> > >>> I am using blastall to get a top hit from a database so i am using > >>> -v 1 > >> -b > >>> 1 (i hope this is right). > >>> But the strange part is that i am getting wrong results. > >>> > >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting > >> this: > >>> > >>> > >>> Sequences producing significant alignments: (bits) > >>> Value > >>> > >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA > 38 > >>> 4e-04 > >>> > >>> > >>> If i use -v 3 -b 3 then i am getting this for the same query: > >>> > >>> Sequences producing significant alignments: (bits) > >>> Value > >>> > >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin... > 570 > >>> e-167 > >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA > 38 > >>> 9e-07 > >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2... > 18 > >>> 1.0 > >>> > >>> As you can see the top hit in the first case is totally wrong. > >>> > >>> I would really appreciate if someone can help me out, or direct to > >>> in the right direction. > >>> > >>> Thanks > >>> Shalabh > >>> > >>> > >>> > >>> -- > >>> Shalabh Sharma > >>> Scientific Computing Professional Associate (Bioinformatics > >>> Specialist) Department of Marine Sciences University of Georgia > >>> Athens, GA 30602-3636 > >>> _______________________________________________ > >>> Bioperl-l mailing list > >>> Bioperl-l at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > >> > > > > > > -- > > Shalabh Sharma > > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences University of Georgia Athens, GA 30602-3636 > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > ======================================================================= > > Attention: The information contained in this message and/or attachments > > from AgResearch Limited is intended only for the persons or entities > > to which it is addressed and may contain confidential and/or privileged > > material. Any review, retransmission, dissemination or other use of, or > > taking of any action in reliance upon, this information by persons or > > entities other than the intended recipients is prohibited by AgResearch > > Limited. If you have received this message in error, please notify the > > sender immediately. > > ======================================================================= > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l**** > > > > **** > > ** ** > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences > University of Georgia > Athens, GA 30602-3636**** > > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From florent.angly at gmail.com Tue Nov 6 11:06:56 2012 From: florent.angly at gmail.com (Florent Angly) Date: Tue, 06 Nov 2012 21:06:56 +1000 Subject: [Bioperl-l] Bio::PrimarySeq speedup In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu> References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com> <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu> Message-ID: <5098EF50.5040208@gmail.com> Yes, good idea, Chris. Actually, thinking about it, most of these warnings were redundant. So, I changed the behaviour of Bio::PrimarySeq::validate_seq() so that it issues exceptions if requested. Florent On 05/11/12 12:43, Fields, Christopher J wrote: > Florent, > > Ran tests on it, they pass but I am seeing this (if these are expected, you can catch the warnings using Test::Warn): > > [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr t/Seq/PrimarySeq.t > t/Seq/PrimarySeq.t .. 1/167 > --------------------- WARNING --------------------- > MSG: Got a sequence without letters. Could not guess alphabet > --------------------------------------------------- > > --------------------- WARNING --------------------- > MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is ! > --------------------------------------------------- > > --------------------- WARNING --------------------- > MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $ > --------------------------------------------------- > > --------------------- WARNING --------------------- > MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is & > --------------------------------------------------- > > --------------------- WARNING --------------------- > MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is \,$,+ > --------------------------------------------------- > > --------------------- WARNING --------------------- > MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/ > --------------------------------------------------- > t/Seq/PrimarySeq.t .. ok > All tests successful. > Files=1, Tests=167, 0 wallclock secs ( 0.03 usr 0.01 sys + 0.18 cusr 0.01 csys = 0.23 CPU) > Result: PASS > > chris > > On Nov 4, 2012, at 6:46 PM, Florent Angly > wrote: > >> I am planning on merging the branch with master this week. >> Best, >> Florent >> >> >> On 01/11/12 15:49, Florent Angly wrote: >>> Hi all, >>> >>> I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was associated with passing full genomes between functions. For example, when doing a call to length(), length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So, every call to length is very expensive for long sequences. And there is a lot of code that calls length(), for error checking. >>> >>> I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g. Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength >>> >>> I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to improve the documentation. In addition, to address the speed issue, I did some changes to Bio::PrimarySeq and Bio::PrimarySeqI : >>> ? The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no extra call to seq() (which would incur the cost of copying the entire sequence between functions). >>> ? The length is saved as an object attribute. So, calling length() is very cheap since it only needs to retrieve the stored value for the length. >>> ? There is a constructor called -direct, which skips sequence validation. However, it was only active in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented purpose, I made it -direct work when a sequence is set through -seq as well. >>> ? This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the new object can be constructed with the -direct constructor, to save some time. >>> ? Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and these calls were easily replaced by simple calls to substr() to save some time. >>> >>> A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I anticipate that the gains would be bigger for even longer sequences. >>> >>> All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of Bioperl, so please, have a look at it and let me know if there are any glaring errors. >>> >>> Thanks, >>> >>> Florent >>> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l From shlomif at shlomifish.org Tue Nov 6 12:27:00 2012 From: shlomif at shlomifish.org (Shlomi Fish) Date: Tue, 6 Nov 2012 14:27:00 +0200 Subject: [Bioperl-l] [Request] Please Help Add Some Information about Perl for Bio-Informatics to http://perl-begin.org/uses/bio-info/ In-Reply-To: <20121026192203.6d1e59c0@lap.shlomifish.org> References: <20121026192203.6d1e59c0@lap.shlomifish.org> Message-ID: <20121106142700.192f456e@lap.shlomifish.org> Hi, Can anyone help with that? Regards, Shlomi Fish On Fri, 26 Oct 2012 19:22:03 +0200 Shlomi Fish wrote: > Hi all, > > I am the maintainer of http://perl-begin.org/ , the Perl Beginners' Site. I > had this page there for a long time, but it's empty: > > http://perl-begin.org/uses/bio-info/ > > Can someone help me add some information there? A short XHTML page will be OK. > For reference, see the other pages in the section > ( http://perl-begin.org/uses/ ) such as: > > * http://perl-begin.org/uses/web/ > > * http://perl-begin.org/uses/sys-admin/ > > * http://perl-begin.org/uses/qa/ > > Note that you agree that the content will be licensed under the Creative > Commons Attribution 3.0 Unported License (or higher versions) and so you > should make sure it is original. > > I shall be obliged for any help. > > Regards, > > Shlomi Fish > -- ----------------------------------------------------------------- Shlomi Fish http://www.shlomifish.org/ Perl Humour - http://perl-begin.org/humour/ A wiseman can learn from a fool much more than a fool can ever learn from a wiseman. ? http://en.wikiquote.org/wiki/Cato_the_Elder Please reply to list if it's a mailing list post - http://shlom.in/reply . From florent.angly at gmail.com Thu Nov 15 16:29:30 2012 From: florent.angly at gmail.com (Florent Angly) Date: Fri, 16 Nov 2012 02:29:30 +1000 Subject: [Bioperl-l] Bio::PrimarySeq speedup In-Reply-To: <5098EF50.5040208@gmail.com> References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com> <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu> <5098EF50.5040208@gmail.com> Message-ID: <50A5186A.4060304@gmail.com> I now merged the branch with master. Best, Florent On 06/11/12 21:06, Florent Angly wrote: > Yes, good idea, Chris. > > Actually, thinking about it, most of these warnings were redundant. > So, I changed the behaviour of Bio::PrimarySeq::validate_seq() so that > it issues exceptions if requested. > > Florent > > > On 05/11/12 12:43, Fields, Christopher J wrote: >> Florent, >> >> Ran tests on it, they pass but I am seeing this (if these are >> expected, you can catch the warnings using Test::Warn): >> >> [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr >> t/Seq/PrimarySeq.t >> t/Seq/PrimarySeq.t .. 1/167 >> --------------------- WARNING --------------------- >> MSG: Got a sequence without letters. Could not guess alphabet >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is ! >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $ >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is & >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is >> \,$,+ >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/ >> --------------------------------------------------- >> t/Seq/PrimarySeq.t .. ok >> All tests successful. >> Files=1, Tests=167, 0 wallclock secs ( 0.03 usr 0.01 sys + 0.18 >> cusr 0.01 csys = 0.23 CPU) >> Result: PASS >> >> chris >> >> On Nov 4, 2012, at 6:46 PM, Florent Angly >> wrote: >> >>> I am planning on merging the branch with master this week. >>> Best, >>> Florent >>> >>> >>> On 01/11/12 15:49, Florent Angly wrote: >>>> Hi all, >>>> >>>> I was working with Ben Woodcroft on identifying ways to speed up >>>> Grinder, which relies heavily on Bioperl. Ben did some profiling >>>> with NYTProf and we realized that a lot of computation time was >>>> spent in Bio::PrimarySeq, doing calls to subseq() and length(). The >>>> sequences we used for the profiling were microbial genomes, i.e. >>>> several Mbp long sequences, which is quite long. A lot of the >>>> performance cost was associated with passing full genomes between >>>> functions. For example, when doing a call to length(), length() >>>> requests the full sequence from seq(), which returns it back to >>>> length() (it makes a copy!). So, every call to length is very >>>> expensive for long sequences. And there is a lot of code that calls >>>> length(), for error checking. >>>> >>>> I know that there are a few Bioperl modules that are more adapted >>>> to handling very long sequences, e.g. Bio::DB::Fasta or >>>> Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look >>>> at Bio::PrimarySeq with Ben and I released this commit: >>>> https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. >>>> But in fact, there were more things that I wanted to try to >>>> improve, which led me to start this new branch: >>>> https://github.com/bioperl/bioperl-live/tree/seqlength >>>> >>>> I wrote quite a few tests for functionalities that were not >>>> previously covered by tests, and tried to improve the >>>> documentation. In addition, to address the speed issue, I did some >>>> changes to Bio::PrimarySeq and Bio::PrimarySeqI : >>>> ? The length of a sequence is now computed as soon as the sequence >>>> is set, not after. This way, there is no extra call to seq() (which >>>> would incur the cost of copying the entire sequence between >>>> functions). >>>> ? The length is saved as an object attribute. So, calling length() >>>> is very cheap since it only needs to retrieve the stored value for >>>> the length. >>>> ? There is a constructor called -direct, which skips sequence >>>> validation. However, it was only active in conjunction with the >>>> -ref_to_seq constructor. To make -direct conform better to its >>>> documented purpose, I made it -direct work when a sequence is set >>>> through -seq as well. >>>> ? This brings us to trunc(), revcom() and other methods of >>>> Bio::PrimarySeqI. Since all these methods create a new >>>> Bio::PrimarySeq object from an existing (already validated!) >>>> Bio::PrimarySeq object, the new object can be constructed with the >>>> -direct constructor, to save some time. >>>> ? Finally, I noticed that subseq() used calls to eval() to do its >>>> work. eval() is notoriously slow and these calls were easily >>>> replaced by simple calls to substr() to save some time. >>>> >>>> A real-world test I performed with Grinder took 3m28s before the >>>> changes (and ~1 min is spent doing something unrelated). After the >>>> changes, the same test took only 2min28s. So, it's quite a >>>> significant improvement and on more specific test cases, >>>> performance gains can obviously be much bigger. Also, I anticipate >>>> that the gains would be bigger for even longer sequences. >>>> >>>> All the changes I made are meant to be backward compatible and all >>>> the tests in the Bioperl test suite passed. So, there _should_ not >>>> be any issues. However, I know that Bio::PrimarySeq is a central >>>> module of Bioperl, so please, have a look at it and let me know if >>>> there are any glaring errors. >>>> >>>> Thanks, >>>> >>>> Florent >>>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > From mahakadry at aucegypt.edu Tue Nov 20 18:44:53 2012 From: mahakadry at aucegypt.edu (maha ahmed) Date: Tue, 20 Nov 2012 20:44:53 +0200 Subject: [Bioperl-l] Parsing a blast report with multiple queries into separate one query files that only contain the fasta sequences Message-ID: Dear BioPerl list, I blasted a file that has several fasta queries against nr, however I need to align each query with its hits for further computational analysis so I need to parse the produced blast report into several files that each has only the fasta query sequence and its hits in fasta format. I found this script online, use Bio::Search::Result::BlastResult;use Bio::SearchIO; my $report = Bio::SearchIO->new( -file=>'full-report.bls', -format => blast);my $result = $report->next_result;my %hits_by_query;while (my $hit = $result->next_hit) { push @{$hits_by_query{$hit->name}}, $hit;} foreach my $qid ( keys %hits_by_query ) { my $result = Bio::Search::Result::BlastResult->new(); $result->add_hit($_) for ( @{$hits_by_query{$qid}} ); my $blio = Bio::SearchIO->new( -file => ">$qid\.bls", -format =>'blast' ); $blio->write_result($result);} however on using it this produced the following error message BlastResult::new(): Not adding iterations. ------------- EXCEPTION: Bio::Root::NoSuchThing ------------- MSG: No such iteration number: 0. Valid range=1-0 VALUE: The number zero (0) STACK: Error::throw STACK: Bio::Root::Root::throw /usr/local/share/perl/5.10.0/Bio/Root/Root.pm:472 STACK: Bio::Search::Result::BlastResult::iteration /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:327 STACK: Bio::Search::Result::BlastResult::add_hit /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:257 STACK: ./parsing.blast.results.into.per.query.files.pl:15 I tried to search for other scripts but I couldn't find any I would really appreciate your comments to this Thank you From cjfields at illinois.edu Tue Nov 20 19:21:25 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 20 Nov 2012 19:21:25 +0000 Subject: [Bioperl-l] Parsing a blast report with multiple queries into separate one query files that only contain the fasta sequences In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF22E15@CITESMBX5.ad.uillinois.edu> Maha, Do you need only the sequence reported in the report (e.g. the HSP alignments) or the original FASTA sequences? The former can be recovered from the Bio::Search::HSP::GenericHSP objects as an alignment, and this can be redirected to a FASTA file. The latter is a little trickier, as you will have to retrieve the sequences from their original source files. chris On Nov 20, 2012, at 12:44 PM, maha ahmed wrote: > Dear BioPerl list, > I blasted a file that has several fasta queries against nr, however I need > to align each query with its hits for further computational analysis so I > need to parse the produced blast report into several files that each has > only the fasta query sequence and its hits in fasta format. > I found this script online, > > use Bio::Search::Result::BlastResult;use Bio::SearchIO; > my $report = Bio::SearchIO->new( -file=>'full-report.bls', -format > => blast);my $result = > $report->next_result;my %hits_by_query;while (my $hit = > $result->next_hit) { > push > @{$hits_by_query{$hit->name}}, $hit;} > foreach my $qid ( keys > %hits_by_query ) { > my $result = Bio::Search::Result::BlastResult->new(); > $result->add_hit($_) for ( @{$hits_by_query{$qid}} ); > my $blio = Bio::SearchIO->new( -file => ">$qid\.bls", -format > =>'blast' ); > $blio->write_result($result);} > > > > however on using it this produced the following error message > > > > BlastResult::new(): Not adding iterations. > > ------------- EXCEPTION: Bio::Root::NoSuchThing ------------- > MSG: No such iteration number: 0. Valid range=1-0 > VALUE: The number zero (0) > STACK: Error::throw > STACK: Bio::Root::Root::throw /usr/local/share/perl/5.10.0/Bio/Root/Root.pm:472 > STACK: Bio::Search::Result::BlastResult::iteration > /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:327 > STACK: Bio::Search::Result::BlastResult::add_hit > /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:257 > STACK: ./parsing.blast.results.into.per.query.files.pl:15 > > I tried to search for other scripts but I couldn't find any > I would really appreciate your comments to this > Thank you > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From assayagy at gmail.com Thu Nov 1 00:02:41 2012 From: assayagy at gmail.com (eyla4ever) Date: Thu, 01 Nov 2012 00:02:41 -0000 Subject: [Bioperl-l] handle with file in perl Message-ID: <34626730.post@talk.nabble.com> hi i want to write a function that get as parameters : file_name, hsp , hit. and i want her to print all the blast Field that i need to this file. i do it because i have 2 files with the same Fields. 10X -- View this message in context: http://old.nabble.com/handle-with-file-in-perl-tp34626730p34626730.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From rfhorns at gmail.com Fri Nov 2 00:01:34 2012 From: rfhorns at gmail.com (Felix Horns) Date: Fri, 02 Nov 2012 00:01:34 -0000 Subject: [Bioperl-l] get_Stream_by_query() appears to return empty stream Message-ID: Hello everyone. I am having trouble using the get_Stream_by_query() function in Bio::DB::GenBank. It seems to return an empty stream, such that $stream->next_seq never returns anything. However, $query->count is returning the expected value (139). Also, get_Stream_by_query() seems to be querying the database, as when I pass it an array of GeneIDs that have not been properly formatted, i.e. GeneID:7816864, instead of simply 7816864, it returns warnings and errors: "MSG: Warning(s) from GenBank: GeneID 7817709...; MSG: Error from Genbank: No items found.". I have included my full code below. I have also included the output from the code below that. The code is intended to find genes located within a genomic region. I will later find the protein domains and pathways that those genes are involved in. Any help would be greatly appreciated. I realize that this is probably a very simple question, but I am relatively new to BioPerl and I've spent the better part of the day trying to figure out such issues, so I would be very thankful for help. Felix #!/usr/bin/perl use strict; use Bio::SeqIO; use Bio::DB::EntrezGene; use Bio::DB::GenBank; # Load reference sequence # Load from local .gb file # Note that .gb file does not include sequences # my $gbfile = "NC_012660.1.gb"; # my $seqio = Bio::SeqIO->new(-file => $gbfile); # my $ref_seq = $seqio->next_seq; # To access reference sequence programatically, uncomment this code my $gb = new Bio::DB::GenBank; my $ref_seq = $gb->get_Seq_by_acc("NC_012660.1"); # Specify coordinates of gap my $gap_start = 2050506; my $gap_end = 2190530; my $gene_count = 0; my @features; my @starts; my @ends; my @db_xrefs; my @products; my @protein_ids; # Get gene features in gap for my $feat ($ref_seq->get_SeqFeatures) { my $start=$feat->location->start; my $end=$feat->location->end; if (($feat->primary_tag eq 'gene') & ($gap_start < $start) & ($start < $gap_end) & ($gap_start < $end) & ($end < $gap_end)) { $gene_count += 1; # Get GeneID reference my $db_xref = ($feat->get_tag_values('db_xref'))[0]; $db_xref =~ s/GeneID://; # Trim "GeneID:" from start of $db_xref push @features, $feat; push @starts, $start; push @ends, $end; push @db_xrefs, $db_xref; } } # Get data about gene features from GeneID reference my $query = Bio::DB::Query::GenBank->new(-db => 'gene', -ids => [@db_xrefs]); my $stream = $gb->get_Stream_by_query($query); while (my $seq = $stream->next_seq) { for my $feat ($seq->all_SeqFeatures) { print "primary tag: ", $feat->primary_tag, "\n"; for my $tag ($feat->get_all_tags) { print " tag: ", $tag, "\n"; for my $value ($feat->get_tag_values($tag)) { print " value: ", $value, "\n"; } } } } print $query->count,"\n"; print $gene_count, "\n"; OUTPUT > perl analyze_gap.pl 139 139 Note that no "primary tag; tag; value" items are printed. Furthermore, when I put a print line immediately after the (while (my $seq = $stream->next_seq)) statement, it was never called, seemingly indicating that the stream is empty. From mooldhu at gmail.com Tue Nov 6 07:38:57 2012 From: mooldhu at gmail.com (=?GB2312?B?uvq9rQ==?=) Date: Tue, 6 Nov 2012 15:38:57 +0800 Subject: [Bioperl-l] Ask for help about Bioperl Message-ID: hi, when I use bioperl ,it report errors like this :--------------------- WARNING --------------------- MSG: sequence '1' doesn't validate, mismatch is )81,95,0(,:: --------------------------------------------------- Error providing evidence type: GeneModel The error was: ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Attempting to set the sequence '1' to [)81hvf95x0(DSTD=qeSrytkiyP::oiV] which does not look healthy STACK: Error::throw STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:486 STACK: Bio::PrimarySeq::seq /usr/share/perl5/Bio/PrimarySeq.pm:285 STACK: Bio::PrimarySeq::new /usr/share/perl5/Bio/PrimarySeq.pm:239 STACK: Bio::PrimarySeqI::revcom /usr/share/perl5/Bio/PrimarySeqI.pm:383 but,I am sure that the input file only cotain [ATGCN],I also try to use another sequences ,but the errors are the same.my bioperl is Bioperl-live 1.006902; -- ?? From assayagy at gmail.com Sat Nov 10 18:27:03 2012 From: assayagy at gmail.com (eyla4ever) Date: Sat, 10 Nov 2012 10:27:03 -0800 (PST) Subject: [Bioperl-l] Extracting sequences from Genbank files In-Reply-To: References: Message-ID: <34664632.post@talk.nabble.com> hello Brian i wuold like you to send me your script, i think it can help me to solve a big problem and help me to finish my final project. i hope it will be posible regards Eyla BForde wrote: > > Hello, > > I have been modifying a script which extracts all the protein sequences > from a genbank file and saves them in a multi-fasta file. > > I wish the fasta header to have both the locus_tag of the protein and the > product. However I cannot get the product tag to write to the fasta > header > > this is the relevant section of the script > > $s->display_id($f->has_tag('locus_tag') ? join(',',sort > $f->each_tag_value('locus_tag')) : > $f->has_tag('product') ? > join(',',$f->each_tag_value('product')): > $s->display_id); > > is "product" not an actual tag > > regards > > Brian > > > > -- > Brian Forde > Microbiology Dept. > Bioscience Institute. Room 4.11 > University College Cork > Cork > Ireland > tel:+353 21 4901306 > email: b.m.forde at umail.ucc.ie > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- View this message in context: http://old.nabble.com/Extracting-sequences-from-Genbank-files-tp33901023p34664632.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From bosborne11 at verizon.net Tue Nov 20 23:50:00 2012 From: bosborne11 at verizon.net (Brian Osborne) Date: Tue, 20 Nov 2012 18:50:00 -0500 Subject: [Bioperl-l] get_Stream_by_query() appears to return empty stream In-Reply-To: References: Message-ID: <5F077DEA-DEBD-42BC-87E7-327697764CFE@verizon.net> Felix, I took a look at the Bio::DB::Query::GenBank documentation, it says this: If you provide an array reference of IDs in -ids, the query will be ignored and the list of IDs will be used when the query is passed to a Bio::DB::GenBank object's get_Stream_by_query() method. Bio::DB::Genbank queries "nucleotide", by default. You have GeneIDs. I see that you're setting "-id" to "gene" but note that you're passing that query to a plain Bio::DB::GenBank object. Not sure what the expected behavior is here. I would try using the NCBI Eutilities for that second query, rather than Bio::DB::Query::GenBank (http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook). Brian O. On Nov 1, 2012, at 8:01 PM, Felix Horns wrote: > Hello everyone. > > I am having trouble using the get_Stream_by_query() function > in Bio::DB::GenBank. It seems to return an empty stream, such that > $stream->next_seq never returns anything. > > However, $query->count is returning the expected value (139). Also, > get_Stream_by_query() seems to be querying the database, as when I pass it > an array of GeneIDs that have not been properly formatted, i.e. > GeneID:7816864, instead of simply 7816864, it returns warnings and errors: > "MSG: Warning(s) from GenBank: GeneID 7817709...; MSG: > Error from Genbank: No items found.". > > I have included my full code below. I have also included the output from > the code below that. The code is intended to find genes located within a > genomic region. I will later find the protein domains and pathways that > those genes are involved in. > > Any help would be greatly appreciated. I realize that this is probably a > very simple question, but I am relatively new to BioPerl and I've spent the > better part of the day trying to figure out such issues, so I would be very > thankful for help. > > Felix > > > #!/usr/bin/perl > use strict; > use Bio::SeqIO; > use Bio::DB::EntrezGene; > use Bio::DB::GenBank; > > # Load reference sequence > # Load from local .gb file > # Note that .gb file does not include sequences > # my $gbfile = "NC_012660.1.gb"; > # my $seqio = Bio::SeqIO->new(-file => $gbfile); > # my $ref_seq = $seqio->next_seq; > > # To access reference sequence programatically, uncomment this code > my $gb = new Bio::DB::GenBank; > my $ref_seq = $gb->get_Seq_by_acc("NC_012660.1"); > > # Specify coordinates of gap > my $gap_start = 2050506; > my $gap_end = 2190530; > > my $gene_count = 0; > my @features; > my @starts; > my @ends; > my @db_xrefs; > > my @products; > my @protein_ids; > > # Get gene features in gap > for my $feat ($ref_seq->get_SeqFeatures) { > my $start=$feat->location->start; > my $end=$feat->location->end; > > if (($feat->primary_tag eq 'gene') & > ($gap_start < $start) & ($start < $gap_end) & > ($gap_start < $end) & ($end < $gap_end)) { > > $gene_count += 1; > > # Get GeneID reference > my $db_xref = ($feat->get_tag_values('db_xref'))[0]; > $db_xref =~ s/GeneID://; # Trim "GeneID:" from start of $db_xref > > push @features, $feat; > push @starts, $start; > push @ends, $end; > push @db_xrefs, $db_xref; > } > } > > # Get data about gene features from GeneID reference > my $query = Bio::DB::Query::GenBank->new(-db => 'gene', > -ids => [@db_xrefs]); > my $stream = $gb->get_Stream_by_query($query); > > while (my $seq = $stream->next_seq) { > for my $feat ($seq->all_SeqFeatures) { > print "primary tag: ", $feat->primary_tag, "\n"; > for my $tag ($feat->get_all_tags) { > print " tag: ", $tag, "\n"; > for my $value ($feat->get_tag_values($tag)) { > print " value: ", $value, "\n"; > } > } > } > } > > print $query->count,"\n"; > print $gene_count, "\n"; > > > OUTPUT >> perl analyze_gap.pl > 139 > 139 > > Note that no "primary tag; tag; value" items are printed. Furthermore, > when I put a print line immediately after the (while (my $seq = > $stream->next_seq)) statement, it was never called, seemingly indicating > that the stream is empty. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bosborne11 at verizon.net Tue Nov 20 23:52:00 2012 From: bosborne11 at verizon.net (Brian Osborne) Date: Tue, 20 Nov 2012 18:52:00 -0500 Subject: [Bioperl-l] Ask for help about Bioperl In-Reply-To: References: Message-ID: <3B472C83-7F6C-41E2-A629-E6A2BDC6B075@verizon.net> ??, You're going to have show us your code, we can't help you just by seeing the error messages. Show us the input file as well, or the beginning of it. Brian O. On Nov 6, 2012, at 2:38 AM, ?? wrote: > hi, > when I use bioperl ,it report errors like this :--------------------- > WARNING --------------------- > MSG: sequence '1' doesn't validate, mismatch is )81,95,0(,:: > --------------------------------------------------- > Error providing evidence type: GeneModel > The error was: > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: Attempting to set the sequence '1' to > [)81hvf95x0(DSTD=qeSrytkiyP::oiV] which does not look healthy > STACK: Error::throw > STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:486 > STACK: Bio::PrimarySeq::seq /usr/share/perl5/Bio/PrimarySeq.pm:285 > STACK: Bio::PrimarySeq::new /usr/share/perl5/Bio/PrimarySeq.pm:239 > STACK: Bio::PrimarySeqI::revcom /usr/share/perl5/Bio/PrimarySeqI.pm:383 > > > but,I am sure that the input file only cotain [ATGCN],I also try to use > another sequences ,but the errors are the same.my bioperl is Bioperl-live > 1.006902; > > -- > ?? > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at drycafe.net Wed Nov 21 02:24:50 2012 From: hlapp at drycafe.net (Hilmar Lapp) Date: Tue, 20 Nov 2012 21:24:50 -0500 Subject: [Bioperl-l] handle with file in perl In-Reply-To: <34626730.post@talk.nabble.com> References: <34626730.post@talk.nabble.com> Message-ID: <1DE09B34-5124-478C-8925-0045EC119CFC@drycafe.net> This sounds like a homework assignment. We're not here to do your homework or assignments for you. You can post if you run into a specific problem when solving your assignment with Bioperl, and we'll help with that. -hilmar Sent with a tap. On Oct 31, 2012, at 7:45 PM, eyla4ever wrote: > > hi > > i want to write a function that get as parameters : file_name, hsp , hit. > and i want her to print all the blast Field that i need to this file. > > i do it because i have 2 files with the same Fields. > > > 10X > -- > View this message in context: http://old.nabble.com/handle-with-file-in-perl-tp34626730p34626730.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From mahakadry at aucegypt.edu Sat Nov 24 01:33:59 2012 From: mahakadry at aucegypt.edu (maha ahmed) Date: Sat, 24 Nov 2012 03:33:59 +0200 Subject: [Bioperl-l] retrieving a subset of files from a folder Message-ID: Dear Bioperl list, I have a folder that has 60,000 files (one file for each phylogenetic tree) However I only need to work with a subset of 1,000 files from that folder (the files are not numbered in order so I cant use the i++ loop in my bioperl script) Is there a way to write a script that only moves files with the names given in a list in a text file i.e. I have a file that has the names of the files I want to copy fro m the folder and I want to write script that does this Thank you so much From kellert at ohsu.edu Sat Nov 24 18:08:11 2012 From: kellert at ohsu.edu (Tom Keller) Date: Sat, 24 Nov 2012 10:08:11 -0800 Subject: [Bioperl-l] use cookbook to work with a directory of files In-Reply-To: References: Message-ID: A search with the phrase "perl cookbook filenames from directory" should help you find what you need. On Nov 24, 2012, at 9:00 AM, bioperl-l-request at lists.open-bio.org wrote: > Send Bioperl-l mailing list submissions to > bioperl-l at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/bioperl-l > or, via email, send a message with subject or body 'help' to > bioperl-l-request at lists.open-bio.org > > You can reach the person managing the list at > bioperl-l-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Bioperl-l digest..." > > > Today's Topics: > > 1. retrieving a subset of files from a folder (maha ahmed) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sat, 24 Nov 2012 03:33:59 +0200 > From: maha ahmed > Subject: [Bioperl-l] retrieving a subset of files from a folder > To: Bioperl-l at lists.open-bio.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > Dear Bioperl list, > I have a folder that has 60,000 files (one file for each phylogenetic tree) > However I only need to work with a subset of 1,000 files from that folder > (the files are not numbered in order so I cant use the i++ loop in my > bioperl script) > Is there a way to write a script that only moves files with the names given > in a list in a text file > i.e. I have a file that has the names of the files I want to copy fro m the > folder and I want to write script that does this > Thank you so much > > > ------------------------------ > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > End of Bioperl-l Digest, Vol 115, Issue 8 > ***************************************** From minou.nowrousian at rub.de Sat Nov 24 18:24:02 2012 From: minou.nowrousian at rub.de (Minou Nowrousian) Date: 24 Nov 2012 19:24:02 +0100 Subject: [Bioperl-l] retrieving a subset of files from a folder Message-ID: <000001cdca70$e1a97720$a4fc6560$@rub.de> >Dear Bioperl list, >I have a folder that has 60,000 files (one file for each phylogenetic tree) However I only need to work with a subset of 1,000 files from that folder >(the files are not numbered in order so I cant use the i++ loop in my bioperl script) Is there a way to write a script that only moves files with the >names given in a list in a text file i.e. I have a file that has the names of the files I want to copy fro m the folder and I want to write script that does >this Thank you so much I don't know if there is a BioPerl solution, but you could use the File::Copy module (available from CPAN): use File::Copy; copy("path_to_file_you_want_to_copy","path_to_target_file") or die "Copy failed: $!"; Regards, Minou From mahakadry at aucegypt.edu Sat Nov 24 19:04:09 2012 From: mahakadry at aucegypt.edu (maha ahmed) Date: Sat, 24 Nov 2012 21:04:09 +0200 Subject: [Bioperl-l] retrieving a subset of files from a folder In-Reply-To: <000001cdca70$e1a97720$a4fc6560$@rub.de> References: <000001cdca70$e1a97720$a4fc6560$@rub.de> Message-ID: Thanks everyone , I actually found a one line command that I am going to try: xargs -a file_list.txt mv -t /path/to/des thanks for your help I will read have a look at the readings you suggested thank you On Sat, Nov 24, 2012 at 8:24 PM, Minou Nowrousian wrote: > > >Dear Bioperl list, > >I have a folder that has 60,000 files (one file for each phylogenetic > tree) > However I only need to work with a subset of 1,000 files from that folder > >(the files are not numbered in order so I cant use the i++ loop in my > bioperl script) Is there a way to write a script that only moves files with > the >names given in a list in a text file i.e. I have a file that has the > names of the files I want to copy fro m the folder and I want to write > script that does >this Thank you so much > > I don't know if there is a BioPerl solution, but you could use the > File::Copy module (available from CPAN): > > use File::Copy; > copy("path_to_file_you_want_to_copy","path_to_target_file") or die "Copy > failed: $!"; > > Regards, > Minou > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From maj at fortinbras.us Tue Nov 27 13:49:46 2012 From: maj at fortinbras.us (Mark A. Jensen) Date: Tue, 27 Nov 2012 13:49:46 +0000 Subject: [Bioperl-l] Neo4j : applying user defined validation and constraints Message-ID: Hi Folks, Since there was some enthusiasm about REST::Neo4p, my interface to Neo4j, I thought I would let you know about https://metacpan.org/module/REST::Neo4p::Constrain This is a framework that lets you apply constraints on node and relationship properties, relationships, and relationship types. You can specify your constraints, and have REST::Neo4p throw exceptions when the constraints aren't met, or you can do validation on existing database items. The pod has a full explanation and examples aplenty. Please have a look and send bugs my way via RT. Cheers all, MAJ From francescomusacchia at gmail.com Wed Nov 28 10:27:16 2012 From: francescomusacchia at gmail.com (Francesco Musacchia) Date: Wed, 28 Nov 2012 02:27:16 -0800 (PST) Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access Message-ID: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com> Hi all, I have a big problem with using GFF3 database with BioPerl. This is not a question about what is the way to write some bioperl code. I'm experiencing that when I have to do a lot of accessess on a GFF database (with Bio:DB::SeqFeature::Store) the slowness increase until my script can stay running for more than a day. How can I solve it? Or it cannot be done? Thanks a lot!