From florent.angly at gmail.com  Thu Nov  1 01:49:13 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Thu, 01 Nov 2012 15:49:13 +1000
Subject: [Bioperl-l] Bio::PrimarySeq speedup
Message-ID: <50920D59.4010307@gmail.com>

Hi all,

I was working with Ben Woodcroft on identifying ways to speed up 
Grinder, which relies heavily on Bioperl. Ben did some profiling with 
NYTProf and we realized that a lot of computation time was spent in 
Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we 
used for the profiling were microbial genomes, i.e. several Mbp long 
sequences, which is quite long. A lot of the performance cost was 
associated with passing full genomes between functions. For example, 
when doing a call to length(), length() requests the full sequence from 
seq(), which returns it back to length() (it makes a copy!). So, every 
call to length is very expensive for long sequences. And there is a lot 
of code that calls length(), for error checking.

I know that there are a few Bioperl modules that are more adapted to 
handling very long sequences, e.g. Bio::DB::Fasta or 
Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at 
Bio::PrimarySeq with Ben and I released this commit: 
https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. 
But in fact, there were more things that I wanted to try to improve, 
which led me to start this new branch: 
https://github.com/bioperl/bioperl-live/tree/seqlength

I wrote quite a few tests for functionalities that were not previously 
covered by tests, and tried to improve the documentation. In addition, 
to address the speed issue, I did some changes to Bio::PrimarySeq and 
Bio::PrimarySeqI :
? The length of a sequence is now computed as soon as the sequence is 
set, not after. This way, there is no extra call to seq() (which would 
incur the cost of copying the entire sequence between functions).
? The length is saved as an object attribute. So, calling length() is 
very cheap since it only needs to retrieve the stored value for the length.
? There is a constructor called -direct, which skips sequence 
validation. However, it was only active in conjunction with the 
-ref_to_seq constructor. To make -direct conform better to its 
documented purpose, I made it -direct work when a sequence is set 
through -seq as well.
? This brings us to trunc(), revcom() and other methods of 
Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq 
object from an existing (already validated!) Bio::PrimarySeq object, the 
new object can be constructed with the -direct constructor, to save some 
time.
? Finally, I noticed that subseq() used calls to eval() to do its work. 
eval() is notoriously slow and these calls were easily replaced by 
simple calls to substr() to save some time.

A real-world test I performed with Grinder took 3m28s before the changes 
(and ~1 min is spent doing something unrelated). After the changes, the 
same test took only 2min28s. So, it's quite a significant improvement 
and on more specific test cases, performance gains can obviously be much 
bigger. Also, I anticipate that the gains would be bigger for even 
longer sequences.

All the changes I made are meant to be backward compatible and all the 
tests in the Bioperl test suite passed. So, there _should_ not be any 
issues. However, I know that Bio::PrimarySeq is a central module of 
Bioperl, so please, have a look at it and let me know if there are any 
glaring errors.

Thanks,

Florent


From shalabh.sharma7 at gmail.com  Thu Nov  1 15:36:35 2012
From: shalabh.sharma7 at gmail.com (shalabh sharma)
Date: Thu, 1 Nov 2012 15:36:35 -0400
Subject: [Bioperl-l] blast question
Message-ID: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>

Hi All,
          First of all i am really very sorry for posting blast question in
this forum, I am not sure if this is the right place.
I will really appreciate if anyone can guide me to the right direction.

I am using blastall to get a top hit from a database so i am using -v 1 -b
1 (i hope this is right).
But the strange part is that i am getting wrong results.

for example: if i use -v 1 -b 1 then for one of the hit i am getting this:


Sequences producing significant alignments:                      (bits)
Value

fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
4e-04


If i use -v 3 -b 3 then i am getting this for the same query:

Sequences producing significant alignments:                      (bits)
Value

fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
e-167
fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
9e-07
fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
1.0

As you can see the top hit in the first case is totally wrong.

I would really appreciate if someone can help me out, or direct to in the
right direction.

Thanks
Shalabh


-- 
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist)
Department of Marine Sciences
University of Georgia
Athens, GA 30602-3636

From cjfields at illinois.edu  Thu Nov  1 17:41:43 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Thu, 1 Nov 2012 21:41:43 +0000
Subject: [Bioperl-l] blast question
In-Reply-To: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>

That's a scary error, but the best place to submit this would be the BLAST help list at NCBI (cc'd)

chris

On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com> wrote:

> Hi All,
>          First of all i am really very sorry for posting blast question in
> this forum, I am not sure if this is the right place.
> I will really appreciate if anyone can guide me to the right direction.
> 
> I am using blastall to get a top hit from a database so i am using -v 1 -b
> 1 (i hope this is right).
> But the strange part is that i am getting wrong results.
> 
> for example: if i use -v 1 -b 1 then for one of the hit i am getting this:
> 
> 
> Sequences producing significant alignments:                      (bits)
> Value
> 
> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> 4e-04
> 
> 
> If i use -v 3 -b 3 then i am getting this for the same query:
> 
> Sequences producing significant alignments:                      (bits)
> Value
> 
> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
> e-167
> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> 9e-07
> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
> 1.0
> 
> As you can see the top hit in the first case is totally wrong.
> 
> I would really appreciate if someone can help me out, or direct to in the
> right direction.
> 
> Thanks
> Shalabh
> 
> 
> 
> -- 
> Shalabh Sharma
> Scientific Computing Professional Associate (Bioinformatics Specialist)
> Department of Marine Sciences
> University of Georgia
> Athens, GA 30602-3636
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From shalabh.sharma7 at gmail.com  Fri Nov  2 10:50:17 2012
From: shalabh.sharma7 at gmail.com (shalabh sharma)
Date: Fri, 2 Nov 2012 10:50:17 -0400
Subject: [Bioperl-l] blast question
In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
Message-ID: <CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>

I know, i am really worried about my past analysis now.
Thanks a lot for cc'ing this mail Chris.

-Shalabh

On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <cjfields at illinois.edu
> wrote:

> That's a scary error, but the best place to submit this would be the BLAST
> help list at NCBI (cc'd)
>
> chris
>
> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com>
> wrote:
>
> > Hi All,
> >          First of all i am really very sorry for posting blast question
> in
> > this forum, I am not sure if this is the right place.
> > I will really appreciate if anyone can guide me to the right direction.
> >
> > I am using blastall to get a top hit from a database so i am using -v 1
> -b
> > 1 (i hope this is right).
> > But the strange part is that i am getting wrong results.
> >
> > for example: if i use -v 1 -b 1 then for one of the hit i am getting
> this:
> >
> >
> > Sequences producing significant alignments:                      (bits)
> > Value
> >
> > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> > 4e-04
> >
> >
> > If i use -v 3 -b 3 then i am getting this for the same query:
> >
> > Sequences producing significant alignments:                      (bits)
> > Value
> >
> > fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
> > e-167
> > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> > 9e-07
> > fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
> > 1.0
> >
> > As you can see the top hit in the first case is totally wrong.
> >
> > I would really appreciate if someone can help me out, or direct to in the
> > right direction.
> >
> > Thanks
> > Shalabh
> >
> >
> >
> > --
> > Shalabh Sharma
> > Scientific Computing Professional Associate (Bioinformatics Specialist)
> > Department of Marine Sciences
> > University of Georgia
> > Athens, GA 30602-3636
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


-- 
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist)
Department of Marine Sciences
University of Georgia
Athens, GA 30602-3636

From Scott.Markel at accelrys.com  Fri Nov  2 20:13:59 2012
From: Scott.Markel at accelrys.com (Scott Markel)
Date: Fri, 2 Nov 2012 17:13:59 -0700
Subject: [Bioperl-l] Structure::IO::pdb update needed to comply with PDB
 file format specification change
Message-ID: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net>

In tracking down a regression test failure we discovered that the Structure::IO::pdb module is out of date relative to the PDB file format specification (http://www.wwpdb.org/docs.html).  PDB now writes out to column 79, while pdb.pm is still using the old line length of 71.  Our regression failure was caused by a reformatting of 1CRN; journal titles started getting truncated.

Some of the Perl lines are really simple, e.g.,

	$keywds = $self->_read_PDB_singlecontline("KEYWDS","11-70",\$buffer);

with others being just a little more detailed, e.g.,

	my ($rec, $subr, $cont, $rol) = unpack "A6 x6 A4 A2 x1 A51", $_;

It doesn't look like pdb.pm has changed in about 1.5 years.  Is there a current module owner?  Or someone else working on this?

If not, we're willing to either compile a list of needed changes (walking through the PDB file format specification and comparing the corresponding column indices in pdb.pm) or provide a new version of the entire file.  Please let us know which is preferred.

Scott

Scott Markel, Ph.D.
Principal Bioinformatics Architect? email:? smarkel at accelrys.com
Accelrys (Pipeline Pilot R&D)?????? mobile: +1 858 205 3653
10188 Telesis Court, Suite 100????? voice:? +1 858 799 5603
San Diego, CA 92121???????????????? fax:??? +1 858 799 5222
USA???????????????????????????????? web:??? http://www.accelrys.com

http://www.linkedin.com/in/smarkel
Secretary, Board of Directors:
??? International Society for Computational Biology
Chair: ISCB Publications and Communications Committee
Associate Editor: PLoS Computational Biology
Editorial Board: Briefings in Bioinformatics


From cjfields at illinois.edu  Fri Nov  2 22:08:52 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Sat, 3 Nov 2012 02:08:52 +0000
Subject: [Bioperl-l] Structure::IO::pdb update needed to comply with PDB
 file format specification change
In-Reply-To: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net>
References: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC817F8@CHIMBX5.ad.uillinois.edu>

On Nov 2, 2012, at 7:13 PM, Scott Markel <Scott.Markel at accelrys.com> wrote:

> In tracking down a regression test failure we discovered that the Structure::IO::pdb module is out of date relative to the PDB file format specification (http://www.wwpdb.org/docs.html).  PDB now writes out to column 79, while pdb.pm is still using the old line length of 71.  Our regression failure was caused by a reformatting of 1CRN; journal titles started getting truncated.
> 
> Some of the Perl lines are really simple, e.g.,
> 
> 	$keywds = $self->_read_PDB_singlecontline("KEYWDS","11-70",\$buffer);
> 
> with others being just a little more detailed, e.g.,
> 
> 	my ($rec, $subr, $cont, $rol) = unpack "A6 x6 A4 A2 x1 A51", $_;
> 
> It doesn't look like pdb.pm has changed in about 1.5 years.  Is there a current module owner?  Or someone else working on this?

No one has really taken ownership, so as far as I'm concerned it's open.  Any objections?

> If not, we're willing to either compile a list of needed changes (walking through the PDB file format specification and comparing the corresponding column indices in pdb.pm) or provide a new version of the entire file.  Please let us know which is preferred.

A new version of the file is fine if you have someone who can work on it.  We would also like to change relevant tests and documentation if there is time.

> Scott
> 
> Scott Markel, Ph.D.
> Principal Bioinformatics Architect  email:  smarkel at accelrys.com
> Accelrys (Pipeline Pilot R&D)       mobile: +1 858 205 3653
> 10188 Telesis Court, Suite 100      voice:  +1 858 799 5603
> San Diego, CA 92121                 fax:    +1 858 799 5222
> USA                                 web:    http://www.accelrys.com
> 
> http://www.linkedin.com/in/smarkel
> Secretary, Board of Directors:
>     International Society for Computational Biology
> Chair: ISCB Publications and Communications Committee
> Associate Editor: PLoS Computational Biology
> Editorial Board: Briefings in Bioinformatics

Thanks Scott!

chris


From Russell.Smithies at agresearch.co.nz  Sun Nov  4 16:00:37 2012
From: Russell.Smithies at agresearch.co.nz (Smithies, Russell)
Date: Mon, 5 Nov 2012 10:00:37 +1300
Subject: [Bioperl-l] blast question
In-Reply-To: <CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>

What version of blast are you using?
There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+


--Russell

-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma
Sent: Saturday, 3 November 2012 3:50 a.m.
To: Fields, Christopher J
Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov
Subject: Re: [Bioperl-l] blast question

I know, i am really worried about my past analysis now.
Thanks a lot for cc'ing this mail Chris.

-Shalabh

On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <cjfields at illinois.edu
> wrote:

> That's a scary error, but the best place to submit this would be the 
> BLAST help list at NCBI (cc'd)
>
> chris
>
> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com>
> wrote:
>
> > Hi All,
> >          First of all i am really very sorry for posting blast 
> > question
> in
> > this forum, I am not sure if this is the right place.
> > I will really appreciate if anyone can guide me to the right direction.
> >
> > I am using blastall to get a top hit from a database so i am using 
> > -v 1
> -b
> > 1 (i hope this is right).
> > But the strange part is that i am getting wrong results.
> >
> > for example: if i use -v 1 -b 1 then for one of the hit i am getting
> this:
> >
> >
> > Sequences producing significant alignments:                      (bits)
> > Value
> >
> > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> > 4e-04
> >
> >
> > If i use -v 3 -b 3 then i am getting this for the same query:
> >
> > Sequences producing significant alignments:                      (bits)
> > Value
> >
> > fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
> > e-167
> > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> > 9e-07
> > fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
> > 1.0
> >
> > As you can see the top hit in the first case is totally wrong.
> >
> > I would really appreciate if someone can help me out, or direct to 
> > in the right direction.
> >
> > Thanks
> > Shalabh
> >
> >
> >
> > --
> > Shalabh Sharma
> > Scientific Computing Professional Associate (Bioinformatics 
> > Specialist) Department of Marine Sciences University of Georgia 
> > Athens, GA 30602-3636 
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


--
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l

=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================


From cjfields at illinois.edu  Sun Nov  4 17:13:37 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Sun, 4 Nov 2012 22:13:37 +0000
Subject: [Bioperl-l] blast question
In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
	<18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu>

That in fact is the recommendation (migrate to BLAST+).

chris

On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" <Russell.Smithies at agresearch.co.nz> wrote:

> What version of blast are you using?
> There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+
> 
> 
> --Russell
> 
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma
> Sent: Saturday, 3 November 2012 3:50 a.m.
> To: Fields, Christopher J
> Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov
> Subject: Re: [Bioperl-l] blast question
> 
> I know, i am really worried about my past analysis now.
> Thanks a lot for cc'ing this mail Chris.
> 
> -Shalabh
> 
> On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <cjfields at illinois.edu
>> wrote:
> 
>> That's a scary error, but the best place to submit this would be the 
>> BLAST help list at NCBI (cc'd)
>> 
>> chris
>> 
>> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com>
>> wrote:
>> 
>>> Hi All,
>>>         First of all i am really very sorry for posting blast 
>>> question
>> in
>>> this forum, I am not sure if this is the right place.
>>> I will really appreciate if anyone can guide me to the right direction.
>>> 
>>> I am using blastall to get a top hit from a database so i am using 
>>> -v 1
>> -b
>>> 1 (i hope this is right).
>>> But the strange part is that i am getting wrong results.
>>> 
>>> for example: if i use -v 1 -b 1 then for one of the hit i am getting
>> this:
>>> 
>>> 
>>> Sequences producing significant alignments:                      (bits)
>>> Value
>>> 
>>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
>>> 4e-04
>>> 
>>> 
>>> If i use -v 3 -b 3 then i am getting this for the same query:
>>> 
>>> Sequences producing significant alignments:                      (bits)
>>> Value
>>> 
>>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
>>> e-167
>>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
>>> 9e-07
>>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
>>> 1.0
>>> 
>>> As you can see the top hit in the first case is totally wrong.
>>> 
>>> I would really appreciate if someone can help me out, or direct to 
>>> in the right direction.
>>> 
>>> Thanks
>>> Shalabh
>>> 
>>> 
>>> 
>>> --
>>> Shalabh Sharma
>>> Scientific Computing Professional Associate (Bioinformatics 
>>> Specialist) Department of Marine Sciences University of Georgia 
>>> Athens, GA 30602-3636 
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
>> 
> 
> 
> --
> Shalabh Sharma
> Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From florent.angly at gmail.com  Sun Nov  4 19:46:44 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Mon, 05 Nov 2012 10:46:44 +1000
Subject: [Bioperl-l] Bio::PrimarySeq speedup
In-Reply-To: <50920D59.4010307@gmail.com>
References: <50920D59.4010307@gmail.com>
Message-ID: <50970C74.7070605@gmail.com>

I am planning on merging the branch with master this week.
Best,
Florent


On 01/11/12 15:49, Florent Angly wrote:
> Hi all,
>
> I was working with Ben Woodcroft on identifying ways to speed up 
> Grinder, which relies heavily on Bioperl. Ben did some profiling with 
> NYTProf and we realized that a lot of computation time was spent in 
> Bio::PrimarySeq, doing calls to subseq() and length(). The sequences 
> we used for the profiling were microbial genomes, i.e. several Mbp 
> long sequences, which is quite long. A lot of the performance cost was 
> associated with passing full genomes between functions. For example, 
> when doing a call to length(), length() requests the full sequence 
> from seq(), which returns it back to length() (it makes a copy!). So, 
> every call to length is very expensive for long sequences. And there 
> is a lot of code that calls length(), for error checking.
>
> I know that there are a few Bioperl modules that are more adapted to 
> handling very long sequences, e.g. Bio::DB::Fasta or 
> Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at 
> Bio::PrimarySeq with Ben and I released this commit: 
> https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. 
> But in fact, there were more things that I wanted to try to improve, 
> which led me to start this new branch: 
> https://github.com/bioperl/bioperl-live/tree/seqlength
>
> I wrote quite a few tests for functionalities that were not previously 
> covered by tests, and tried to improve the documentation. In addition, 
> to address the speed issue, I did some changes to Bio::PrimarySeq and 
> Bio::PrimarySeqI :
> ? The length of a sequence is now computed as soon as the sequence is 
> set, not after. This way, there is no extra call to seq() (which would 
> incur the cost of copying the entire sequence between functions).
> ? The length is saved as an object attribute. So, calling length() is 
> very cheap since it only needs to retrieve the stored value for the 
> length.
> ? There is a constructor called -direct, which skips sequence 
> validation. However, it was only active in conjunction with the 
> -ref_to_seq constructor. To make -direct conform better to its 
> documented purpose, I made it -direct work when a sequence is set 
> through -seq as well.
> ? This brings us to trunc(), revcom() and other methods of 
> Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq 
> object from an existing (already validated!) Bio::PrimarySeq object, 
> the new object can be constructed with the -direct constructor, to 
> save some time.
> ? Finally, I noticed that subseq() used calls to eval() to do its 
> work. eval() is notoriously slow and these calls were easily replaced 
> by simple calls to substr() to save some time.
>
> A real-world test I performed with Grinder took 3m28s before the 
> changes (and ~1 min is spent doing something unrelated). After the 
> changes, the same test took only 2min28s. So, it's quite a significant 
> improvement and on more specific test cases, performance gains can 
> obviously be much bigger. Also, I anticipate that the gains would be 
> bigger for even longer sequences.
>
> All the changes I made are meant to be backward compatible and all the 
> tests in the Bioperl test suite passed. So, there _should_ not be any 
> issues. However, I know that Bio::PrimarySeq is a central module of 
> Bioperl, so please, have a look at it and let me know if there are any 
> glaring errors.
>
> Thanks,
>
> Florent
>


From cjfields at illinois.edu  Sun Nov  4 21:43:28 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Mon, 5 Nov 2012 02:43:28 +0000
Subject: [Bioperl-l] Bio::PrimarySeq speedup
In-Reply-To: <50970C74.7070605@gmail.com>
References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu>

Florent,

Ran tests on it, they pass but I am seeing this (if these are expected, you can catch the warnings using Test::Warn):

[cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr t/Seq/PrimarySeq.t 
t/Seq/PrimarySeq.t .. 1/167 
--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is !
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is &
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is \,$,+
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/
---------------------------------------------------
t/Seq/PrimarySeq.t .. ok       
All tests successful.
Files=1, Tests=167,  0 wallclock secs ( 0.03 usr  0.01 sys +  0.18 cusr  0.01 csys =  0.23 CPU)
Result: PASS

chris

On Nov 4, 2012, at 6:46 PM, Florent Angly <florent.angly at gmail.com>
 wrote:

> I am planning on merging the branch with master this week.
> Best,
> Florent
> 
> 
> On 01/11/12 15:49, Florent Angly wrote:
>> Hi all,
>> 
>> I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was associated with passing full genomes between functions. For example, when doing a call to length(), length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So, every call to length is very expensive for long sequences. And there is a lot of code that calls length(), for error checking.
>> 
>> I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g. Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength
>> 
>> I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to improve the documentation. In addition, to address the speed issue, I did some changes to Bio::PrimarySeq and Bio::PrimarySeqI :
>> ? The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no extra call to seq() (which would incur the cost of copying the entire sequence between functions).
>> ? The length is saved as an object attribute. So, calling length() is very cheap since it only needs to retrieve the stored value for the length.
>> ? There is a constructor called -direct, which skips sequence validation. However, it was only active in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented purpose, I made it -direct work when a sequence is set through -seq as well.
>> ? This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the new object can be constructed with the -direct constructor, to save some time.
>> ? Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and these calls were easily replaced by simple calls to substr() to save some time.
>> 
>> A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I anticipate that the gains would be bigger for even longer sequences.
>> 
>> All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of Bioperl, so please, have a look at it and let me know if there are any glaring errors.
>> 
>> Thanks,
>> 
>> Florent
>> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From shalabh.sharma7 at gmail.com  Mon Nov  5 12:03:38 2012
From: shalabh.sharma7 at gmail.com (shalabh sharma)
Date: Mon, 5 Nov 2012 12:03:38 -0500
Subject: [Bioperl-l] blast question
In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
	<18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>
	<118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu>
Message-ID: <CAA7rn9f1GOsib-PhNAsQo327dQznMn-15UZz7w49iYKjg-VLHg@mail.gmail.com>

Hi All,
         thanks for all your responses.

Currently i am using the old version of blastall 2.2.22.

@Peter: I will update my blast and will see if the problem still exist. But
i can't restrict my blast with e value because i work on environmental
samples , i have to reduce the size of my blast files as i am only
interested in the top hit and my data sets are really huge.

Thanks
Shalabh

On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J <cjfields at illinois.edu
> wrote:

> That in fact is the recommendation (migrate to BLAST+).
>
> chris
>
> On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" <
> Russell.Smithies at agresearch.co.nz> wrote:
>
> > What version of blast are you using?
> > There have been quite a few bug fixes and I suspect any responses from
> NCBI will suggest upgrading to the current version of blast+
> >
> >
> > --Russell
> >
> > -----Original Message-----
> > From: bioperl-l-bounces at lists.open-bio.org [mailto:
> bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma
> > Sent: Saturday, 3 November 2012 3:50 a.m.
> > To: Fields, Christopher J
> > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov
> > Subject: Re: [Bioperl-l] blast question
> >
> > I know, i am really worried about my past analysis now.
> > Thanks a lot for cc'ing this mail Chris.
> >
> > -Shalabh
> >
> > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <
> cjfields at illinois.edu
> >> wrote:
> >
> >> That's a scary error, but the best place to submit this would be the
> >> BLAST help list at NCBI (cc'd)
> >>
> >> chris
> >>
> >> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com>
> >> wrote:
> >>
> >>> Hi All,
> >>>         First of all i am really very sorry for posting blast
> >>> question
> >> in
> >>> this forum, I am not sure if this is the right place.
> >>> I will really appreciate if anyone can guide me to the right direction.
> >>>
> >>> I am using blastall to get a top hit from a database so i am using
> >>> -v 1
> >> -b
> >>> 1 (i hope this is right).
> >>> But the strange part is that i am getting wrong results.
> >>>
> >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting
> >> this:
> >>>
> >>>
> >>> Sequences producing significant alignments:                      (bits)
> >>> Value
> >>>
> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA
>  38
> >>> 4e-04
> >>>
> >>>
> >>> If i use -v 3 -b 3 then i am getting this for the same query:
> >>>
> >>> Sequences producing significant alignments:                      (bits)
> >>> Value
> >>>
> >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...
> 570
> >>> e-167
> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA
>  38
> >>> 9e-07
> >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...
>  18
> >>> 1.0
> >>>
> >>> As you can see the top hit in the first case is totally wrong.
> >>>
> >>> I would really appreciate if someone can help me out, or direct to
> >>> in the right direction.
> >>>
> >>> Thanks
> >>> Shalabh
> >>>
> >>>
> >>>
> >>> --
> >>> Shalabh Sharma
> >>> Scientific Computing Professional Associate (Bioinformatics
> >>> Specialist) Department of Marine Sciences University of Georgia
> >>> Athens, GA 30602-3636
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >>
> >
> >
> > --
> > Shalabh Sharma
> > Scientific Computing Professional Associate (Bioinformatics Specialist)
> Department of Marine Sciences University of Georgia Athens, GA 30602-3636
> _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > =======================================================================
> > Attention: The information contained in this message and/or attachments
> > from AgResearch Limited is intended only for the persons or entities
> > to which it is addressed and may contain confidential and/or privileged
> > material. Any review, retransmission, dissemination or other use of, or
> > taking of any action in reliance upon, this information by persons or
> > entities other than the intended recipients is prohibited by AgResearch
> > Limited. If you have received this message in error, please notify the
> > sender immediately.
> > =======================================================================
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


-- 
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist)
Department of Marine Sciences
University of Georgia
Athens, GA 30602-3636

From Russell.Smithies at agresearch.co.nz  Mon Nov  5 16:04:07 2012
From: Russell.Smithies at agresearch.co.nz (Smithies, Russell)
Date: Tue, 6 Nov 2012 10:04:07 +1300
Subject: [Bioperl-l] blast question
In-Reply-To: <CAA7rn9f1GOsib-PhNAsQo327dQznMn-15UZz7w49iYKjg-VLHg@mail.gmail.com>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
	<18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>
	<118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9f1GOsib-PhNAsQo327dQznMn-15UZz7w49iYKjg-VLHg@mail.gmail.com>
Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz>

If you're using an older version of blast there was a bug where not all results were returned - I think the limit was 10,000 hits?
Not usually a problem running basic queries but a big problem for environmental or metagenomic samples, or when aligning short reads.

--Russell

From: shalabh sharma [mailto:shalabh.sharma7 at gmail.com]
Sent: Tuesday, 6 November 2012 6:04 a.m.
To: Fields, Christopher J
Cc: Smithies, Russell; bioperl-l
Subject: Re: [Bioperl-l] blast question

Hi All,
         thanks for all your responses.

Currently i am using the old version of blastall 2.2.22.

@Peter: I will update my blast and will see if the problem still exist. But i can't restrict my blast with e value because i work on environmental samples , i have to reduce the size of my blast files as i am only interested in the top hit and my data sets are really huge.

Thanks
Shalabh

On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J <cjfields at illinois.edu<mailto:cjfields at illinois.edu>> wrote:
That in fact is the recommendation (migrate to BLAST+).

chris

On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" <Russell.Smithies at agresearch.co.nz<mailto:Russell.Smithies at agresearch.co.nz>> wrote:

> What version of blast are you using?
> There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+
>
>
> --Russell
>
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org<mailto:bioperl-l-bounces at lists.open-bio.org> [mailto:bioperl-l-bounces at lists.open-bio.org<mailto:bioperl-l-bounces at lists.open-bio.org>] On Behalf Of shalabh sharma
> Sent: Saturday, 3 November 2012 3:50 a.m.
> To: Fields, Christopher J
> Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov<mailto:blast-help at ncbi.nlm.nih.gov>
> Subject: Re: [Bioperl-l] blast question
>
> I know, i am really worried about my past analysis now.
> Thanks a lot for cc'ing this mail Chris.
>
> -Shalabh
>
> On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <cjfields at illinois.edu<mailto:cjfields at illinois.edu>
>> wrote:
>
>> That's a scary error, but the best place to submit this would be the
>> BLAST help list at NCBI (cc'd)
>>
>> chris
>>
>> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com<mailto:shalabh.sharma7 at gmail.com>>
>> wrote:
>>
>>> Hi All,
>>>         First of all i am really very sorry for posting blast
>>> question
>> in
>>> this forum, I am not sure if this is the right place.
>>> I will really appreciate if anyone can guide me to the right direction.
>>>
>>> I am using blastall to get a top hit from a database so i am using
>>> -v 1
>> -b
>>> 1 (i hope this is right).
>>> But the strange part is that i am getting wrong results.
>>>
>>> for example: if i use -v 1 -b 1 then for one of the hit i am getting
>> this:
>>>
>>>
>>> Sequences producing significant alignments:                      (bits)
>>> Value
>>>
>>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
>>> 4e-04
>>>
>>>
>>> If i use -v 3 -b 3 then i am getting this for the same query:
>>>
>>> Sequences producing significant alignments:                      (bits)
>>> Value
>>>
>>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
>>> e-167
>>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
>>> 9e-07
>>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
>>> 1.0
>>>
>>> As you can see the top hit in the first case is totally wrong.
>>>
>>> I would really appreciate if someone can help me out, or direct to
>>> in the right direction.
>>>
>>> Thanks
>>> Shalabh
>>>
>>>
>>>
>>> --
>>> Shalabh Sharma
>>> Scientific Computing Professional Associate (Bioinformatics
>>> Specialist) Department of Marine Sciences University of Georgia
>>> Athens, GA 30602-3636
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>
>
> --
> Shalabh Sharma
> Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


--
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist)
Department of Marine Sciences
University of Georgia
Athens, GA 30602-3636

=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================


From shalabh.sharma7 at gmail.com  Mon Nov  5 16:09:03 2012
From: shalabh.sharma7 at gmail.com (shalabh sharma)
Date: Mon, 5 Nov 2012 16:09:03 -0500
Subject: [Bioperl-l] blast question
In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
	<18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>
	<118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9f1GOsib-PhNAsQo327dQznMn-15UZz7w49iYKjg-VLHg@mail.gmail.com>
	<18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz>
Message-ID: <CAA7rn9ech--TYQdezH6fLLArjsTdypkNjSkQeO1LiaLTR1zoHQ@mail.gmail.com>

Hi All,
           Thanks for all the suggestion. The problem is fixed by using
latest blast+ .
Thanks
Shalabh

On Mon, Nov 5, 2012 at 4:04 PM, Smithies, Russell <
Russell.Smithies at agresearch.co.nz> wrote:

> If you?re using an older version of blast there was a bug where not all
> results were returned ? I think the limit was 10,000 hits?****
>
> Not usually a problem running basic queries but a big problem for
> environmental or metagenomic samples, or when aligning short reads.****
>
> ** **
>
> --Russell****
>
> ** **
>
> *From:* shalabh sharma [mailto:shalabh.sharma7 at gmail.com]
> *Sent:* Tuesday, 6 November 2012 6:04 a.m.
> *To:* Fields, Christopher J
> *Cc:* Smithies, Russell; bioperl-l
>
> *Subject:* Re: [Bioperl-l] blast question****
>
> ** **
>
> Hi All,****
>
>          thanks for all your responses.****
>
> ** **
>
> Currently i am using the old version of blastall 2.2.22.****
>
> ** **
>
> @Peter: I will update my blast and will see if the problem still exist.
> But i can't restrict my blast with e value because i work on environmental
> samples , i have to reduce the size of my blast files as i am only
> interested in the top hit and my data sets are really huge.****
>
> ** **
>
> Thanks****
>
> Shalabh****
>
> ** **
>
> On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J <
> cjfields at illinois.edu> wrote:****
>
> That in fact is the recommendation (migrate to BLAST+).
>
> chris****
>
>
> On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" <
> Russell.Smithies at agresearch.co.nz> wrote:
>
> > What version of blast are you using?
> > There have been quite a few bug fixes and I suspect any responses from
> NCBI will suggest upgrading to the current version of blast+
> >
> >
> > --Russell
> >
> > -----Original Message-----
> > From: bioperl-l-bounces at lists.open-bio.org [mailto:
> bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma
> > Sent: Saturday, 3 November 2012 3:50 a.m.
> > To: Fields, Christopher J
> > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov
> > Subject: Re: [Bioperl-l] blast question
> >
> > I know, i am really worried about my past analysis now.
> > Thanks a lot for cc'ing this mail Chris.
> >
> > -Shalabh
> >
> > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <
> cjfields at illinois.edu
> >> wrote:
> >
> >> That's a scary error, but the best place to submit this would be the
> >> BLAST help list at NCBI (cc'd)
> >>
> >> chris
> >>
> >> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com>
> >> wrote:
> >>
> >>> Hi All,
> >>>         First of all i am really very sorry for posting blast
> >>> question
> >> in
> >>> this forum, I am not sure if this is the right place.
> >>> I will really appreciate if anyone can guide me to the right direction.
> >>>
> >>> I am using blastall to get a top hit from a database so i am using
> >>> -v 1
> >> -b
> >>> 1 (i hope this is right).
> >>> But the strange part is that i am getting wrong results.
> >>>
> >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting
> >> this:
> >>>
> >>>
> >>> Sequences producing significant alignments:                      (bits)
> >>> Value
> >>>
> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA
>  38
> >>> 4e-04
> >>>
> >>>
> >>> If i use -v 3 -b 3 then i am getting this for the same query:
> >>>
> >>> Sequences producing significant alignments:                      (bits)
> >>> Value
> >>>
> >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...
> 570
> >>> e-167
> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA
>  38
> >>> 9e-07
> >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...
>  18
> >>> 1.0
> >>>
> >>> As you can see the top hit in the first case is totally wrong.
> >>>
> >>> I would really appreciate if someone can help me out, or direct to
> >>> in the right direction.
> >>>
> >>> Thanks
> >>> Shalabh
> >>>
> >>>
> >>>
> >>> --
> >>> Shalabh Sharma
> >>> Scientific Computing Professional Associate (Bioinformatics
> >>> Specialist) Department of Marine Sciences University of Georgia
> >>> Athens, GA 30602-3636
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >>
> >
> >
> > --
> > Shalabh Sharma
> > Scientific Computing Professional Associate (Bioinformatics Specialist)
> Department of Marine Sciences University of Georgia Athens, GA 30602-3636
> _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > =======================================================================
> > Attention: The information contained in this message and/or attachments
> > from AgResearch Limited is intended only for the persons or entities
> > to which it is addressed and may contain confidential and/or privileged
> > material. Any review, retransmission, dissemination or other use of, or
> > taking of any action in reliance upon, this information by persons or
> > entities other than the intended recipients is prohibited by AgResearch
> > Limited. If you have received this message in error, please notify the
> > sender immediately.
> > =======================================================================
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l****
>
>
>
> ****
>
> ** **
>
> --
> Shalabh Sharma
> Scientific Computing Professional Associate (Bioinformatics Specialist)
> Department of Marine Sciences
> University of Georgia
> Athens, GA 30602-3636****
>
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
>


-- 
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist)
Department of Marine Sciences
University of Georgia
Athens, GA 30602-3636


From florent.angly at gmail.com  Tue Nov  6 06:06:56 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Tue, 06 Nov 2012 21:06:56 +1000
Subject: [Bioperl-l] Bio::PrimarySeq speedup
In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu>
References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu>
Message-ID: <5098EF50.5040208@gmail.com>

Yes, good idea, Chris.

Actually, thinking about it, most of these warnings were redundant. So, 
I changed the behaviour of Bio::PrimarySeq::validate_seq() so that it 
issues exceptions if requested.

Florent


On 05/11/12 12:43, Fields, Christopher J wrote:
> Florent,
>
> Ran tests on it, they pass but I am seeing this (if these are expected, you can catch the warnings using Test::Warn):
>
> [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr t/Seq/PrimarySeq.t
> t/Seq/PrimarySeq.t .. 1/167
> --------------------- WARNING ---------------------
> MSG: Got a sequence without letters. Could not guess alphabet
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is !
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is &
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is \,$,+
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/
> ---------------------------------------------------
> t/Seq/PrimarySeq.t .. ok
> All tests successful.
> Files=1, Tests=167,  0 wallclock secs ( 0.03 usr  0.01 sys +  0.18 cusr  0.01 csys =  0.23 CPU)
> Result: PASS
>
> chris
>
> On Nov 4, 2012, at 6:46 PM, Florent Angly <florent.angly at gmail.com>
>   wrote:
>
>> I am planning on merging the branch with master this week.
>> Best,
>> Florent
>>
>>
>> On 01/11/12 15:49, Florent Angly wrote:
>>> Hi all,
>>>
>>> I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was associated with passing full genomes between functions. For example, when doing a call to length(), length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So, every call to length is very expensive for long sequences. And there is a lot of code that calls length(), for error checking.
>>>
>>> I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g. Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength
>>>
>>> I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to improve the documentation. In addition, to address the speed issue, I did some changes to Bio::PrimarySeq and Bio::PrimarySeqI :
>>> ? The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no extra call to seq() (which would incur the cost of copying the entire sequence between functions).
>>> ? The length is saved as an object attribute. So, calling length() is very cheap since it only needs to retrieve the stored value for the length.
>>> ? There is a constructor called -direct, which skips sequence validation. However, it was only active in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented purpose, I made it -direct work when a sequence is set through -seq as well.
>>> ? This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the new object can be constructed with the -direct constructor, to save some time.
>>> ? Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and these calls were easily replaced by simple calls to substr() to save some time.
>>>
>>> A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I anticipate that the gains would be bigger for even longer sequences.
>>>
>>> All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of Bioperl, so please, have a look at it and let me know if there are any glaring errors.
>>>
>>> Thanks,
>>>
>>> Florent
>>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From shlomif at shlomifish.org  Tue Nov  6 07:27:00 2012
From: shlomif at shlomifish.org (Shlomi Fish)
Date: Tue, 6 Nov 2012 14:27:00 +0200
Subject: [Bioperl-l] [Request] Please Help Add Some Information about
 Perl for Bio-Informatics to http://perl-begin.org/uses/bio-info/
In-Reply-To: <20121026192203.6d1e59c0@lap.shlomifish.org>
References: <20121026192203.6d1e59c0@lap.shlomifish.org>
Message-ID: <20121106142700.192f456e@lap.shlomifish.org>

Hi,

Can anyone help with that?

Regards,

	Shlomi Fish

On Fri, 26 Oct 2012 19:22:03 +0200
Shlomi Fish <shlomif at shlomifish.org> wrote:

> Hi all,
> 
> I am the maintainer of http://perl-begin.org/ , the Perl Beginners' Site. I
> had this page there for a long time, but it's empty:
> 
> http://perl-begin.org/uses/bio-info/
> 
> Can someone help me add some information there? A short XHTML page will be OK.
> For reference, see the other pages in the section
> ( http://perl-begin.org/uses/ ) such as:
> 
> * http://perl-begin.org/uses/web/
> 
> * http://perl-begin.org/uses/sys-admin/
> 
> * http://perl-begin.org/uses/qa/
> 
> Note that you agree that the content will be licensed under the Creative
> Commons Attribution 3.0 Unported License (or higher versions) and so you
> should make sure it is original.
> 
> I shall be obliged for any help.
> 
> Regards,
> 
> 	Shlomi Fish
> 


-- 
-----------------------------------------------------------------
Shlomi Fish       http://www.shlomifish.org/
Perl Humour - http://perl-begin.org/humour/

A wiseman can learn from a fool much more than a fool can ever learn from a
wiseman.               ? http://en.wikiquote.org/wiki/Cato_the_Elder

Please reply to list if it's a mailing list post - http://shlom.in/reply .


From florent.angly at gmail.com  Thu Nov 15 11:29:30 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Fri, 16 Nov 2012 02:29:30 +1000
Subject: [Bioperl-l] Bio::PrimarySeq speedup
In-Reply-To: <5098EF50.5040208@gmail.com>
References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu>
	<5098EF50.5040208@gmail.com>
Message-ID: <50A5186A.4060304@gmail.com>

I now merged the branch with master.
Best,
Florent

On 06/11/12 21:06, Florent Angly wrote:
> Yes, good idea, Chris.
>
> Actually, thinking about it, most of these warnings were redundant. 
> So, I changed the behaviour of Bio::PrimarySeq::validate_seq() so that 
> it issues exceptions if requested.
>
> Florent
>
>
> On 05/11/12 12:43, Fields, Christopher J wrote:
>> Florent,
>>
>> Ran tests on it, they pass but I am seeing this (if these are 
>> expected, you can catch the warnings using Test::Warn):
>>
>> [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr 
>> t/Seq/PrimarySeq.t
>> t/Seq/PrimarySeq.t .. 1/167
>> --------------------- WARNING ---------------------
>> MSG: Got a sequence without letters. Could not guess alphabet
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is !
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is &
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is 
>> \,$,+
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/
>> ---------------------------------------------------
>> t/Seq/PrimarySeq.t .. ok
>> All tests successful.
>> Files=1, Tests=167,  0 wallclock secs ( 0.03 usr  0.01 sys + 0.18 
>> cusr  0.01 csys =  0.23 CPU)
>> Result: PASS
>>
>> chris
>>
>> On Nov 4, 2012, at 6:46 PM, Florent Angly <florent.angly at gmail.com>
>>   wrote:
>>
>>> I am planning on merging the branch with master this week.
>>> Best,
>>> Florent
>>>
>>>
>>> On 01/11/12 15:49, Florent Angly wrote:
>>>> Hi all,
>>>>
>>>> I was working with Ben Woodcroft on identifying ways to speed up 
>>>> Grinder, which relies heavily on Bioperl. Ben did some profiling 
>>>> with NYTProf and we realized that a lot of computation time was 
>>>> spent in Bio::PrimarySeq, doing calls to subseq() and length(). The 
>>>> sequences we used for the profiling were microbial genomes, i.e. 
>>>> several Mbp long sequences, which is quite long. A lot of the 
>>>> performance cost was associated with passing full genomes between 
>>>> functions. For example, when doing a call to length(), length() 
>>>> requests the full sequence from seq(), which returns it back to 
>>>> length() (it makes a copy!). So, every call to length is very 
>>>> expensive for long sequences. And there is a lot of code that calls 
>>>> length(), for error checking.
>>>>
>>>> I know that there are a few Bioperl modules that are more adapted 
>>>> to handling very long sequences, e.g. Bio::DB::Fasta or 
>>>> Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look 
>>>> at Bio::PrimarySeq with Ben and I released this commit: 
>>>> https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. 
>>>> But in fact, there were more things that I wanted to try to 
>>>> improve, which led me to start this new branch: 
>>>> https://github.com/bioperl/bioperl-live/tree/seqlength
>>>>
>>>> I wrote quite a few tests for functionalities that were not 
>>>> previously covered by tests, and tried to improve the 
>>>> documentation. In addition, to address the speed issue, I did some 
>>>> changes to Bio::PrimarySeq and Bio::PrimarySeqI :
>>>> ? The length of a sequence is now computed as soon as the sequence 
>>>> is set, not after. This way, there is no extra call to seq() (which 
>>>> would incur the cost of copying the entire sequence between 
>>>> functions).
>>>> ? The length is saved as an object attribute. So, calling length() 
>>>> is very cheap since it only needs to retrieve the stored value for 
>>>> the length.
>>>> ? There is a constructor called -direct, which skips sequence 
>>>> validation. However, it was only active in conjunction with the 
>>>> -ref_to_seq constructor. To make -direct conform better to its 
>>>> documented purpose, I made it -direct work when a sequence is set 
>>>> through -seq as well.
>>>> ? This brings us to trunc(), revcom() and other methods of 
>>>> Bio::PrimarySeqI. Since all these methods create a new 
>>>> Bio::PrimarySeq object from an existing (already validated!) 
>>>> Bio::PrimarySeq object, the new object can be constructed with the 
>>>> -direct constructor, to save some time.
>>>> ? Finally, I noticed that subseq() used calls to eval() to do its 
>>>> work. eval() is notoriously slow and these calls were easily 
>>>> replaced by simple calls to substr() to save some time.
>>>>
>>>> A real-world test I performed with Grinder took 3m28s before the 
>>>> changes (and ~1 min is spent doing something unrelated). After the 
>>>> changes, the same test took only 2min28s. So, it's quite a 
>>>> significant improvement and on more specific test cases, 
>>>> performance gains can obviously be much bigger. Also, I anticipate 
>>>> that the gains would be bigger for even longer sequences.
>>>>
>>>> All the changes I made are meant to be backward compatible and all 
>>>> the tests in the Bioperl test suite passed. So, there _should_ not 
>>>> be any issues. However, I know that Bio::PrimarySeq is a central 
>>>> module of Bioperl, so please, have a look at it and let me know if 
>>>> there are any glaring errors.
>>>>
>>>> Thanks,
>>>>
>>>> Florent
>>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


From mahakadry at aucegypt.edu  Tue Nov 20 13:44:53 2012
From: mahakadry at aucegypt.edu (maha ahmed)
Date: Tue, 20 Nov 2012 20:44:53 +0200
Subject: [Bioperl-l] Parsing a blast report with multiple queries into
 separate one query files that only contain the fasta sequences
Message-ID: <CAE=MQgz9BY7-nUOePoqVZS6FD0ME76yXyTKmODWFhc+TjBGdrg@mail.gmail.com>

Dear BioPerl list,
I blasted a file that has several fasta queries against nr, however I need
to align each query with its hits for further computational analysis so I
need to parse the produced blast report into several files that each has
only the fasta query sequence and its hits in fasta format.
I found this script online,

use Bio::Search::Result::BlastResult;use Bio::SearchIO;
 my $report = Bio::SearchIO->new( -file=>'full-report.bls', -format
<http://perldoc.perl.org/functions/format.html> => blast);my $result =
$report->next_result;my %hits_by_query;while (my $hit =
$result->next_hit) {
  push <http://perldoc.perl.org/functions/push.html>
@{$hits_by_query{$hit->name}}, $hit;}
 foreach my $qid ( keys <http://perldoc.perl.org/functions/keys.html>
%hits_by_query ) {
  my $result = Bio::Search::Result::BlastResult->new();
  $result->add_hit($_) for ( @{$hits_by_query{$qid}} );
  my $blio = Bio::SearchIO->new( -file => ">$qid\.bls", -format
<http://perldoc.perl.org/functions/format.html>=>'blast' );
  $blio->write_result($result);}


however on using it this produced the following error message


BlastResult::new(): Not adding iterations.

------------- EXCEPTION: Bio::Root::NoSuchThing -------------
MSG: No such iteration number: 0. Valid range=1-0
VALUE: The number zero (0)
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/local/share/perl/5.10.0/Bio/Root/Root.pm:472
STACK: Bio::Search::Result::BlastResult::iteration
/usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:327
STACK: Bio::Search::Result::BlastResult::add_hit
/usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:257
STACK: ./parsing.blast.results.into.per.query.files.pl:15

I tried to search for other scripts but I couldn't find any
I would really appreciate your comments to this
Thank you

From cjfields at illinois.edu  Tue Nov 20 14:21:25 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 20 Nov 2012 19:21:25 +0000
Subject: [Bioperl-l] Parsing a blast report with multiple queries into
 separate one query files that only contain the fasta sequences
In-Reply-To: <CAE=MQgz9BY7-nUOePoqVZS6FD0ME76yXyTKmODWFhc+TjBGdrg@mail.gmail.com>
References: <CAE=MQgz9BY7-nUOePoqVZS6FD0ME76yXyTKmODWFhc+TjBGdrg@mail.gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF22E15@CITESMBX5.ad.uillinois.edu>

Maha,

Do you need only the sequence reported in the report (e.g. the HSP alignments) or the original FASTA sequences?  

The former can be recovered from the Bio::Search::HSP::GenericHSP objects as an alignment, and this can be redirected to a FASTA file.  The latter is a little trickier, as you will have to retrieve the sequences from their original source files.  

chris

On Nov 20, 2012, at 12:44 PM, maha ahmed <mahakadry at aucegypt.edu> wrote:

> Dear BioPerl list,
> I blasted a file that has several fasta queries against nr, however I need
> to align each query with its hits for further computational analysis so I
> need to parse the produced blast report into several files that each has
> only the fasta query sequence and its hits in fasta format.
> I found this script online,
> 
> use Bio::Search::Result::BlastResult;use Bio::SearchIO;
> my $report = Bio::SearchIO->new( -file=>'full-report.bls', -format
> <http://perldoc.perl.org/functions/format.html> => blast);my $result =
> $report->next_result;my %hits_by_query;while (my $hit =
> $result->next_hit) {
>  push <http://perldoc.perl.org/functions/push.html>
> @{$hits_by_query{$hit->name}}, $hit;}
> foreach my $qid ( keys <http://perldoc.perl.org/functions/keys.html>
> %hits_by_query ) {
>  my $result = Bio::Search::Result::BlastResult->new();
>  $result->add_hit($_) for ( @{$hits_by_query{$qid}} );
>  my $blio = Bio::SearchIO->new( -file => ">$qid\.bls", -format
> <http://perldoc.perl.org/functions/format.html>=>'blast' );
>  $blio->write_result($result);}
> 
> 
> 
> however on using it this produced the following error message
> 
> 
> 
> BlastResult::new(): Not adding iterations.
> 
> ------------- EXCEPTION: Bio::Root::NoSuchThing -------------
> MSG: No such iteration number: 0. Valid range=1-0
> VALUE: The number zero (0)
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /usr/local/share/perl/5.10.0/Bio/Root/Root.pm:472
> STACK: Bio::Search::Result::BlastResult::iteration
> /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:327
> STACK: Bio::Search::Result::BlastResult::add_hit
> /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:257
> STACK: ./parsing.blast.results.into.per.query.files.pl:15
> 
> I tried to search for other scripts but I couldn't find any
> I would really appreciate your comments to this
> Thank you
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From rfhorns at gmail.com  Thu Nov  1 20:01:34 2012
From: rfhorns at gmail.com (Felix Horns)
Date: Fri, 02 Nov 2012 00:01:34 -0000
Subject: [Bioperl-l] get_Stream_by_query() appears to return empty stream
Message-ID: <CANpYnv_dSVMM85iNVn8gnktNyOuKmevsWifyMFm9BY_QboJT8w@mail.gmail.com>

Hello everyone.

I am having trouble using the get_Stream_by_query() function
in Bio::DB::GenBank.  It seems to return an empty stream, such that
$stream->next_seq never returns anything.

However, $query->count is returning the expected value (139).  Also,
get_Stream_by_query() seems to be querying the database, as when I pass it
an array of GeneIDs that have not been properly formatted, i.e.
GeneID:7816864, instead of simply 7816864, it returns warnings and errors:
"MSG: Warning(s) from GenBank: <PhraseNotFound>GeneID 7817709...; MSG:
Error from Genbank: No items found.".

I have included my full code below. I have also included the output from
the code below that.  The code is intended to find genes located within a
genomic region. I will later find the protein domains and pathways that
those genes are involved in.

Any help would be greatly appreciated.  I realize that this is probably a
very simple question, but I am relatively new to BioPerl and I've spent the
better part of the day trying to figure out such issues, so I would be very
thankful for help.

Felix


#!/usr/bin/perl
use strict;
use Bio::SeqIO;
use Bio::DB::EntrezGene;
use Bio::DB::GenBank;

# Load reference sequence
# Load from local .gb file
# Note that .gb file does not include sequences
# my $gbfile = "NC_012660.1.gb";
# my $seqio = Bio::SeqIO->new(-file => $gbfile);
# my $ref_seq = $seqio->next_seq;

# To access reference sequence programatically, uncomment this code
my $gb = new Bio::DB::GenBank;
my $ref_seq = $gb->get_Seq_by_acc("NC_012660.1");

# Specify coordinates of gap
my $gap_start = 2050506;
my $gap_end = 2190530;

my $gene_count = 0;
my @features;
my @starts;
my @ends;
my @db_xrefs;

my @products;
my @protein_ids;

# Get gene features in gap
for my $feat ($ref_seq->get_SeqFeatures) {
  my $start=$feat->location->start;
  my $end=$feat->location->end;

  if (($feat->primary_tag eq 'gene') &
      ($gap_start < $start) & ($start < $gap_end) &
      ($gap_start < $end) & ($end < $gap_end)) {

    $gene_count += 1;

    # Get GeneID reference
    my $db_xref = ($feat->get_tag_values('db_xref'))[0];
    $db_xref =~ s/GeneID://;    # Trim "GeneID:" from start of $db_xref

    push @features, $feat;
    push @starts, $start;
    push @ends, $end;
    push @db_xrefs, $db_xref;
  }
}

# Get data about gene features from GeneID reference
my $query = Bio::DB::Query::GenBank->new(-db => 'gene',
 -ids => [@db_xrefs]);
my $stream = $gb->get_Stream_by_query($query);

while (my $seq = $stream->next_seq) {
  for my $feat ($seq->all_SeqFeatures) {
    print "primary tag: ", $feat->primary_tag, "\n";
    for my $tag ($feat->get_all_tags) {
      print "  tag: ", $tag, "\n";
      for my $value ($feat->get_tag_values($tag)) {
print "    value: ", $value, "\n";
      }
    }
  }
}

print $query->count,"\n";
print $gene_count, "\n";


OUTPUT
> perl analyze_gap.pl
139
139

Note that no "primary tag; tag; value" items are printed.  Furthermore,
when I put a print line immediately after the (while (my $seq =
$stream->next_seq)) statement, it was never called, seemingly indicating
that the stream is empty.

From mooldhu at gmail.com  Tue Nov  6 02:38:57 2012
From: mooldhu at gmail.com (=?GB2312?B?uvq9rQ==?=)
Date: Tue, 6 Nov 2012 15:38:57 +0800
Subject: [Bioperl-l] Ask for help about Bioperl
Message-ID: <CACdwZFTvgwGzaa2wTfinu9=JKYAXL_O0kQ-Osn7FGcsDx33+Ng@mail.gmail.com>

hi,
when I use bioperl ,it report errors like this :---------------------
WARNING ---------------------
MSG: sequence '1' doesn't validate, mismatch is )81,95,0(,::
---------------------------------------------------
Error providing evidence type: GeneModel
The error was:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Attempting to set the sequence '1' to
[)81hvf95x0(DSTD=qeSrytkiyP::oiV] which does not look healthy
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:486
STACK: Bio::PrimarySeq::seq /usr/share/perl5/Bio/PrimarySeq.pm:285
STACK: Bio::PrimarySeq::new /usr/share/perl5/Bio/PrimarySeq.pm:239
STACK: Bio::PrimarySeqI::revcom /usr/share/perl5/Bio/PrimarySeqI.pm:383


but,I am sure that the input file only cotain [ATGCN],I also try to use
another sequences ,but the errors are the same.my bioperl is Bioperl-live
1.006902;

-- 
????


From assayagy at gmail.com  Sat Nov 10 13:27:03 2012
From: assayagy at gmail.com (eyla4ever)
Date: Sat, 10 Nov 2012 10:27:03 -0800 (PST)
Subject: [Bioperl-l] Extracting sequences from Genbank files
In-Reply-To: <CAJLmuDKPBA_DUtnfQjcAXr+JU=nL+orBP4cCgwBi=4BWQiYmpw@mail.gmail.com>
References: <CAJLmuDKPBA_DUtnfQjcAXr+JU=nL+orBP4cCgwBi=4BWQiYmpw@mail.gmail.com>
Message-ID: <34664632.post@talk.nabble.com>


hello Brian

i wuold like you to send me your script, i think it can help me to solve a
big problem
and help me to finish my final project.
i hope it will be posible

regards Eyla


BForde wrote:
> 
> Hello,
> 
> I have been modifying a script which extracts all the protein sequences
> from a genbank file and saves them in a multi-fasta file.
> 
> I wish the fasta header to have both the locus_tag of the protein and the
> product. However I cannot get the  product tag to write to the fasta
> header
> 
> this is the relevant section of the script
> 
>  $s->display_id($f->has_tag('locus_tag') ? join(',',sort
> $f->each_tag_value('locus_tag')) :
>                            $f->has_tag('product') ?
> join(',',$f->each_tag_value('product')):
>                            $s->display_id);
> 
> is "product" not an actual tag
> 
> regards
> 
> Brian
> 
> 
> 
> -- 
> Brian Forde
> Microbiology Dept.
> Bioscience Institute. Room 4.11
> University College Cork
> Cork
> Ireland
> tel:+353 21 4901306
> email: b.m.forde at umail.ucc.ie
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 

-- 
View this message in context: http://old.nabble.com/Extracting-sequences-from-Genbank-files-tp33901023p34664632.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From bosborne11 at verizon.net  Tue Nov 20 18:50:00 2012
From: bosborne11 at verizon.net (Brian Osborne)
Date: Tue, 20 Nov 2012 18:50:00 -0500
Subject: [Bioperl-l] get_Stream_by_query() appears to return empty stream
In-Reply-To: <CANpYnv_dSVMM85iNVn8gnktNyOuKmevsWifyMFm9BY_QboJT8w@mail.gmail.com>
References: <CANpYnv_dSVMM85iNVn8gnktNyOuKmevsWifyMFm9BY_QboJT8w@mail.gmail.com>
Message-ID: <5F077DEA-DEBD-42BC-87E7-327697764CFE@verizon.net>

Felix,

I took a look at the Bio::DB::Query::GenBank documentation, it says this:

If you provide an array reference of IDs in -ids, the query will be ignored and the list of IDs will be used when the query is passed to a Bio::DB::GenBank object's get_Stream_by_query() method. 

Bio::DB::Genbank queries "nucleotide", by default. You have GeneIDs. I see that you're setting "-id" to "gene" but note that you're passing that query to a plain Bio::DB::GenBank object. Not sure what the expected behavior is here.

I would try using the NCBI Eutilities for that second query, rather than Bio::DB::Query::GenBank (http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook).

Brian O.

On Nov 1, 2012, at 8:01 PM, Felix Horns <rfhorns at gmail.com> wrote:

> Hello everyone.
> 
> I am having trouble using the get_Stream_by_query() function
> in Bio::DB::GenBank.  It seems to return an empty stream, such that
> $stream->next_seq never returns anything.
> 
> However, $query->count is returning the expected value (139).  Also,
> get_Stream_by_query() seems to be querying the database, as when I pass it
> an array of GeneIDs that have not been properly formatted, i.e.
> GeneID:7816864, instead of simply 7816864, it returns warnings and errors:
> "MSG: Warning(s) from GenBank: <PhraseNotFound>GeneID 7817709...; MSG:
> Error from Genbank: No items found.".
> 
> I have included my full code below. I have also included the output from
> the code below that.  The code is intended to find genes located within a
> genomic region. I will later find the protein domains and pathways that
> those genes are involved in.
> 
> Any help would be greatly appreciated.  I realize that this is probably a
> very simple question, but I am relatively new to BioPerl and I've spent the
> better part of the day trying to figure out such issues, so I would be very
> thankful for help.
> 
> Felix
> 
> 
> #!/usr/bin/perl
> use strict;
> use Bio::SeqIO;
> use Bio::DB::EntrezGene;
> use Bio::DB::GenBank;
> 
> # Load reference sequence
> # Load from local .gb file
> # Note that .gb file does not include sequences
> # my $gbfile = "NC_012660.1.gb";
> # my $seqio = Bio::SeqIO->new(-file => $gbfile);
> # my $ref_seq = $seqio->next_seq;
> 
> # To access reference sequence programatically, uncomment this code
> my $gb = new Bio::DB::GenBank;
> my $ref_seq = $gb->get_Seq_by_acc("NC_012660.1");
> 
> # Specify coordinates of gap
> my $gap_start = 2050506;
> my $gap_end = 2190530;
> 
> my $gene_count = 0;
> my @features;
> my @starts;
> my @ends;
> my @db_xrefs;
> 
> my @products;
> my @protein_ids;
> 
> # Get gene features in gap
> for my $feat ($ref_seq->get_SeqFeatures) {
>  my $start=$feat->location->start;
>  my $end=$feat->location->end;
> 
>  if (($feat->primary_tag eq 'gene') &
>      ($gap_start < $start) & ($start < $gap_end) &
>      ($gap_start < $end) & ($end < $gap_end)) {
> 
>    $gene_count += 1;
> 
>    # Get GeneID reference
>    my $db_xref = ($feat->get_tag_values('db_xref'))[0];
>    $db_xref =~ s/GeneID://;    # Trim "GeneID:" from start of $db_xref
> 
>    push @features, $feat;
>    push @starts, $start;
>    push @ends, $end;
>    push @db_xrefs, $db_xref;
>  }
> }
> 
> # Get data about gene features from GeneID reference
> my $query = Bio::DB::Query::GenBank->new(-db => 'gene',
> -ids => [@db_xrefs]);
> my $stream = $gb->get_Stream_by_query($query);
> 
> while (my $seq = $stream->next_seq) {
>  for my $feat ($seq->all_SeqFeatures) {
>    print "primary tag: ", $feat->primary_tag, "\n";
>    for my $tag ($feat->get_all_tags) {
>      print "  tag: ", $tag, "\n";
>      for my $value ($feat->get_tag_values($tag)) {
> print "    value: ", $value, "\n";
>      }
>    }
>  }
> }
> 
> print $query->count,"\n";
> print $gene_count, "\n";
> 
> 
> OUTPUT
>> perl analyze_gap.pl
> 139
> 139
> 
> Note that no "primary tag; tag; value" items are printed.  Furthermore,
> when I put a print line immediately after the (while (my $seq =
> $stream->next_seq)) statement, it was never called, seemingly indicating
> that the stream is empty.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From bosborne11 at verizon.net  Tue Nov 20 18:52:00 2012
From: bosborne11 at verizon.net (Brian Osborne)
Date: Tue, 20 Nov 2012 18:52:00 -0500
Subject: [Bioperl-l] Ask for help about Bioperl
In-Reply-To: <CACdwZFTvgwGzaa2wTfinu9=JKYAXL_O0kQ-Osn7FGcsDx33+Ng@mail.gmail.com>
References: <CACdwZFTvgwGzaa2wTfinu9=JKYAXL_O0kQ-Osn7FGcsDx33+Ng@mail.gmail.com>
Message-ID: <3B472C83-7F6C-41E2-A629-E6A2BDC6B075@verizon.net>

????,

You're going to have show us your code, we can't help you just by seeing the error messages. Show us the input file as well, or the beginning of it.

Brian O.


On Nov 6, 2012, at 2:38 AM, ???? <mooldhu at gmail.com> wrote:

> hi,
> when I use bioperl ,it report errors like this :---------------------
> WARNING ---------------------
> MSG: sequence '1' doesn't validate, mismatch is )81,95,0(,::
> ---------------------------------------------------
> Error providing evidence type: GeneModel
> The error was:
> 
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: Attempting to set the sequence '1' to
> [)81hvf95x0(DSTD=qeSrytkiyP::oiV] which does not look healthy
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:486
> STACK: Bio::PrimarySeq::seq /usr/share/perl5/Bio/PrimarySeq.pm:285
> STACK: Bio::PrimarySeq::new /usr/share/perl5/Bio/PrimarySeq.pm:239
> STACK: Bio::PrimarySeqI::revcom /usr/share/perl5/Bio/PrimarySeqI.pm:383
> 
> 
> but,I am sure that the input file only cotain [ATGCN],I also try to use
> another sequences ,but the errors are the same.my bioperl is Bioperl-live
> 1.006902;
> 
> -- 
> ????
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From hlapp at drycafe.net  Tue Nov 20 21:24:50 2012
From: hlapp at drycafe.net (Hilmar Lapp)
Date: Tue, 20 Nov 2012 21:24:50 -0500
Subject: [Bioperl-l] handle with file in perl
In-Reply-To: <34626730.post@talk.nabble.com>
References: <34626730.post@talk.nabble.com>
Message-ID: <1DE09B34-5124-478C-8925-0045EC119CFC@drycafe.net>

This sounds like a homework assignment. We're not here to do your homework or assignments for you. You can post if you run into a specific problem when solving your assignment with Bioperl, and we'll help with that. 

-hilmar

Sent with a tap.

On Oct 31, 2012, at 7:45 PM, eyla4ever <assayagy at gmail.com> wrote:

> 
> hi 
> 
> i want to write a function that get as parameters : file_name, hsp , hit.
> and i want her to print all the blast Field that i need to this file.
> 
> i do it because i have 2 files with the same Fields.
>        
> 
> 10X
> -- 
> View this message in context: http://old.nabble.com/handle-with-file-in-perl-tp34626730p34626730.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From mahakadry at aucegypt.edu  Fri Nov 23 20:33:59 2012
From: mahakadry at aucegypt.edu (maha ahmed)
Date: Sat, 24 Nov 2012 03:33:59 +0200
Subject: [Bioperl-l] retrieving a subset of files from a folder
Message-ID: <CAE=MQgzV7Z-bQxiQFDuHHaEr=XL5=cTc6Mhq1Su7VPByZNEP6Q@mail.gmail.com>

Dear Bioperl list,
I have a folder that has 60,000 files (one file for each phylogenetic tree)
However I only need to work with a subset of 1,000 files from that folder
(the files are not numbered in order so I cant use the i++ loop in my
bioperl script)
Is there a way to write a script that only moves files with the names given
in a list in a text file
i.e. I have a file that has the names of the files I want to copy fro m the
folder and I want to write script that does this
Thank you so much

From kellert at ohsu.edu  Sat Nov 24 13:08:11 2012
From: kellert at ohsu.edu (Tom Keller)
Date: Sat, 24 Nov 2012 10:08:11 -0800
Subject: [Bioperl-l] use cookbook to work with a directory of files
In-Reply-To: <mailman.7.1353776405.32614.bioperl-l@lists.open-bio.org>
References: <mailman.7.1353776405.32614.bioperl-l@lists.open-bio.org>
Message-ID: <C969FE0E-18FE-4771-B031-22EEA42AEA77@ohsu.edu>

A search with the phrase "perl cookbook filenames from directory" should help you find what you need.

On Nov 24, 2012, at 9:00 AM, bioperl-l-request at lists.open-bio.org wrote:

> Send Bioperl-l mailing list submissions to
> 	bioperl-l at lists.open-bio.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://lists.open-bio.org/mailman/listinfo/bioperl-l
> or, via email, send a message with subject or body 'help' to
> 	bioperl-l-request at lists.open-bio.org
> 
> You can reach the person managing the list at
> 	bioperl-l-owner at lists.open-bio.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Bioperl-l digest..."
> 
> 
> Today's Topics:
> 
>   1.  retrieving a subset of files from a folder (maha ahmed)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Sat, 24 Nov 2012 03:33:59 +0200
> From: maha ahmed <mahakadry at aucegypt.edu>
> Subject: [Bioperl-l] retrieving a subset of files from a folder
> To: Bioperl-l at lists.open-bio.org
> Message-ID:
> 	<CAE=MQgzV7Z-bQxiQFDuHHaEr=XL5=cTc6Mhq1Su7VPByZNEP6Q at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> Dear Bioperl list,
> I have a folder that has 60,000 files (one file for each phylogenetic tree)
> However I only need to work with a subset of 1,000 files from that folder
> (the files are not numbered in order so I cant use the i++ loop in my
> bioperl script)
> Is there a way to write a script that only moves files with the names given
> in a list in a text file
> i.e. I have a file that has the names of the files I want to copy fro m the
> folder and I want to write script that does this
> Thank you so much
> 
> 
> ------------------------------
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> End of Bioperl-l Digest, Vol 115, Issue 8
> *****************************************


From minou.nowrousian at rub.de  Sat Nov 24 13:24:02 2012
From: minou.nowrousian at rub.de (Minou Nowrousian)
Date: 24 Nov 2012 19:24:02 +0100
Subject: [Bioperl-l] retrieving a subset of files from a folder
Message-ID: <000001cdca70$e1a97720$a4fc6560$@rub.de>


>Dear Bioperl list,
>I have a folder that has 60,000 files (one file for each phylogenetic tree)
However I only need to work with a subset of 1,000 files from that folder
>(the files are not numbered in order so I cant use the i++ loop in my
bioperl script) Is there a way to write a script that only moves files with
the >names given in a list in a text file i.e. I have a file that has the
names of the files I want to copy fro m the folder and I want to write
script that does >this Thank you so much

I don't know if there is a BioPerl solution, but you could use the
File::Copy module (available from CPAN):

use File::Copy;
 copy("path_to_file_you_want_to_copy","path_to_target_file") or die "Copy
failed: $!";

Regards,
Minou


From mahakadry at aucegypt.edu  Sat Nov 24 14:04:09 2012
From: mahakadry at aucegypt.edu (maha ahmed)
Date: Sat, 24 Nov 2012 21:04:09 +0200
Subject: [Bioperl-l] retrieving a subset of files from a folder
In-Reply-To: <000001cdca70$e1a97720$a4fc6560$@rub.de>
References: <000001cdca70$e1a97720$a4fc6560$@rub.de>
Message-ID: <CAE=MQgztf_isVyt=WPF9LMXCtX4Q2U9vHL1AV+TwpueUjKayuw@mail.gmail.com>

Thanks everyone , I actually found a one line command that I am going to
try:
xargs -a file_list.txt mv -t /path/to/des
thanks for your help I will read have a look at the readings you suggested
thank you

On Sat, Nov 24, 2012 at 8:24 PM, Minou Nowrousian
<minou.nowrousian at rub.de>wrote:

>
> >Dear Bioperl list,
> >I have a folder that has 60,000 files (one file for each phylogenetic
> tree)
> However I only need to work with a subset of 1,000 files from that folder
> >(the files are not numbered in order so I cant use the i++ loop in my
> bioperl script) Is there a way to write a script that only moves files with
> the >names given in a list in a text file i.e. I have a file that has the
> names of the files I want to copy fro m the folder and I want to write
> script that does >this Thank you so much
>
> I don't know if there is a BioPerl solution, but you could use the
> File::Copy module (available from CPAN):
>
> use File::Copy;
>  copy("path_to_file_you_want_to_copy","path_to_target_file") or die "Copy
> failed: $!";
>
> Regards,
> Minou
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>

From maj at fortinbras.us  Tue Nov 27 08:49:46 2012
From: maj at fortinbras.us (Mark A. Jensen)
Date: Tue, 27 Nov 2012 13:49:46 +0000
Subject: [Bioperl-l] Neo4j : applying user defined validation and constraints
Message-ID: <W2391426705276111354024186@webmail57>

Hi Folks,
Since there was some enthusiasm about REST::Neo4p, my interface to Neo4j, I thought I would let you know about
https://metacpan.org/module/REST::Neo4p::Constrain
This is a framework that lets you apply constraints on node and relationship properties, relationships, and relationship types. You can specify your constraints, and have REST::Neo4p throw exceptions when the constraints aren't met, or you can do validation on existing database items. The pod has a full explanation and examples aplenty.

Please have a look and send bugs my way via RT.
Cheers all,
MAJ


From francescomusacchia at gmail.com  Wed Nov 28 05:27:16 2012
From: francescomusacchia at gmail.com (Francesco Musacchia)
Date: Wed, 28 Nov 2012 02:27:16 -0800 (PST)
Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access
Message-ID: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com>

Hi all,
I have a big problem with using GFF3 database with BioPerl. This is not a 
question about what is the way to write some bioperl code. I'm experiencing 
that when I have to do a lot of accessess on a GFF database (with Bio:DB::SeqFeature::Store) 
the slowness increase until my script can stay running for more than a day.

How can I solve it? Or it cannot be done?

Thanks a lot!

From florent.angly at gmail.com  Thu Nov  1 01:49:13 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Thu, 01 Nov 2012 15:49:13 +1000
Subject: [Bioperl-l] Bio::PrimarySeq speedup
Message-ID: <50920D59.4010307@gmail.com>

Hi all,

I was working with Ben Woodcroft on identifying ways to speed up 
Grinder, which relies heavily on Bioperl. Ben did some profiling with 
NYTProf and we realized that a lot of computation time was spent in 
Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we 
used for the profiling were microbial genomes, i.e. several Mbp long 
sequences, which is quite long. A lot of the performance cost was 
associated with passing full genomes between functions. For example, 
when doing a call to length(), length() requests the full sequence from 
seq(), which returns it back to length() (it makes a copy!). So, every 
call to length is very expensive for long sequences. And there is a lot 
of code that calls length(), for error checking.

I know that there are a few Bioperl modules that are more adapted to 
handling very long sequences, e.g. Bio::DB::Fasta or 
Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at 
Bio::PrimarySeq with Ben and I released this commit: 
https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. 
But in fact, there were more things that I wanted to try to improve, 
which led me to start this new branch: 
https://github.com/bioperl/bioperl-live/tree/seqlength

I wrote quite a few tests for functionalities that were not previously 
covered by tests, and tried to improve the documentation. In addition, 
to address the speed issue, I did some changes to Bio::PrimarySeq and 
Bio::PrimarySeqI :
? The length of a sequence is now computed as soon as the sequence is 
set, not after. This way, there is no extra call to seq() (which would 
incur the cost of copying the entire sequence between functions).
? The length is saved as an object attribute. So, calling length() is 
very cheap since it only needs to retrieve the stored value for the length.
? There is a constructor called -direct, which skips sequence 
validation. However, it was only active in conjunction with the 
-ref_to_seq constructor. To make -direct conform better to its 
documented purpose, I made it -direct work when a sequence is set 
through -seq as well.
? This brings us to trunc(), revcom() and other methods of 
Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq 
object from an existing (already validated!) Bio::PrimarySeq object, the 
new object can be constructed with the -direct constructor, to save some 
time.
? Finally, I noticed that subseq() used calls to eval() to do its work. 
eval() is notoriously slow and these calls were easily replaced by 
simple calls to substr() to save some time.

A real-world test I performed with Grinder took 3m28s before the changes 
(and ~1 min is spent doing something unrelated). After the changes, the 
same test took only 2min28s. So, it's quite a significant improvement 
and on more specific test cases, performance gains can obviously be much 
bigger. Also, I anticipate that the gains would be bigger for even 
longer sequences.

All the changes I made are meant to be backward compatible and all the 
tests in the Bioperl test suite passed. So, there _should_ not be any 
issues. However, I know that Bio::PrimarySeq is a central module of 
Bioperl, so please, have a look at it and let me know if there are any 
glaring errors.

Thanks,

Florent


From shalabh.sharma7 at gmail.com  Thu Nov  1 15:36:35 2012
From: shalabh.sharma7 at gmail.com (shalabh sharma)
Date: Thu, 1 Nov 2012 15:36:35 -0400
Subject: [Bioperl-l] blast question
Message-ID: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>

Hi All,
          First of all i am really very sorry for posting blast question in
this forum, I am not sure if this is the right place.
I will really appreciate if anyone can guide me to the right direction.

I am using blastall to get a top hit from a database so i am using -v 1 -b
1 (i hope this is right).
But the strange part is that i am getting wrong results.

for example: if i use -v 1 -b 1 then for one of the hit i am getting this:


Sequences producing significant alignments:                      (bits)
Value

fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
4e-04


If i use -v 3 -b 3 then i am getting this for the same query:

Sequences producing significant alignments:                      (bits)
Value

fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
e-167
fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
9e-07
fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
1.0

As you can see the top hit in the first case is totally wrong.

I would really appreciate if someone can help me out, or direct to in the
right direction.

Thanks
Shalabh


-- 
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist)
Department of Marine Sciences
University of Georgia
Athens, GA 30602-3636


From cjfields at illinois.edu  Thu Nov  1 17:41:43 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Thu, 1 Nov 2012 21:41:43 +0000
Subject: [Bioperl-l] blast question
In-Reply-To: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>

That's a scary error, but the best place to submit this would be the BLAST help list at NCBI (cc'd)

chris

On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com> wrote:

> Hi All,
>          First of all i am really very sorry for posting blast question in
> this forum, I am not sure if this is the right place.
> I will really appreciate if anyone can guide me to the right direction.
> 
> I am using blastall to get a top hit from a database so i am using -v 1 -b
> 1 (i hope this is right).
> But the strange part is that i am getting wrong results.
> 
> for example: if i use -v 1 -b 1 then for one of the hit i am getting this:
> 
> 
> Sequences producing significant alignments:                      (bits)
> Value
> 
> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> 4e-04
> 
> 
> If i use -v 3 -b 3 then i am getting this for the same query:
> 
> Sequences producing significant alignments:                      (bits)
> Value
> 
> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
> e-167
> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> 9e-07
> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
> 1.0
> 
> As you can see the top hit in the first case is totally wrong.
> 
> I would really appreciate if someone can help me out, or direct to in the
> right direction.
> 
> Thanks
> Shalabh
> 
> 
> 
> -- 
> Shalabh Sharma
> Scientific Computing Professional Associate (Bioinformatics Specialist)
> Department of Marine Sciences
> University of Georgia
> Athens, GA 30602-3636
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From shalabh.sharma7 at gmail.com  Fri Nov  2 10:50:17 2012
From: shalabh.sharma7 at gmail.com (shalabh sharma)
Date: Fri, 2 Nov 2012 10:50:17 -0400
Subject: [Bioperl-l] blast question
In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
Message-ID: <CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>

I know, i am really worried about my past analysis now.
Thanks a lot for cc'ing this mail Chris.

-Shalabh

On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <cjfields at illinois.edu
> wrote:

> That's a scary error, but the best place to submit this would be the BLAST
> help list at NCBI (cc'd)
>
> chris
>
> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com>
> wrote:
>
> > Hi All,
> >          First of all i am really very sorry for posting blast question
> in
> > this forum, I am not sure if this is the right place.
> > I will really appreciate if anyone can guide me to the right direction.
> >
> > I am using blastall to get a top hit from a database so i am using -v 1
> -b
> > 1 (i hope this is right).
> > But the strange part is that i am getting wrong results.
> >
> > for example: if i use -v 1 -b 1 then for one of the hit i am getting
> this:
> >
> >
> > Sequences producing significant alignments:                      (bits)
> > Value
> >
> > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> > 4e-04
> >
> >
> > If i use -v 3 -b 3 then i am getting this for the same query:
> >
> > Sequences producing significant alignments:                      (bits)
> > Value
> >
> > fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
> > e-167
> > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> > 9e-07
> > fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
> > 1.0
> >
> > As you can see the top hit in the first case is totally wrong.
> >
> > I would really appreciate if someone can help me out, or direct to in the
> > right direction.
> >
> > Thanks
> > Shalabh
> >
> >
> >
> > --
> > Shalabh Sharma
> > Scientific Computing Professional Associate (Bioinformatics Specialist)
> > Department of Marine Sciences
> > University of Georgia
> > Athens, GA 30602-3636
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


-- 
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist)
Department of Marine Sciences
University of Georgia
Athens, GA 30602-3636


From Scott.Markel at accelrys.com  Fri Nov  2 20:13:59 2012
From: Scott.Markel at accelrys.com (Scott Markel)
Date: Fri, 2 Nov 2012 17:13:59 -0700
Subject: [Bioperl-l] Structure::IO::pdb update needed to comply with PDB
 file format specification change
Message-ID: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net>

In tracking down a regression test failure we discovered that the Structure::IO::pdb module is out of date relative to the PDB file format specification (http://www.wwpdb.org/docs.html).  PDB now writes out to column 79, while pdb.pm is still using the old line length of 71.  Our regression failure was caused by a reformatting of 1CRN; journal titles started getting truncated.

Some of the Perl lines are really simple, e.g.,

	$keywds = $self->_read_PDB_singlecontline("KEYWDS","11-70",\$buffer);

with others being just a little more detailed, e.g.,

	my ($rec, $subr, $cont, $rol) = unpack "A6 x6 A4 A2 x1 A51", $_;

It doesn't look like pdb.pm has changed in about 1.5 years.  Is there a current module owner?  Or someone else working on this?

If not, we're willing to either compile a list of needed changes (walking through the PDB file format specification and comparing the corresponding column indices in pdb.pm) or provide a new version of the entire file.  Please let us know which is preferred.

Scott

Scott Markel, Ph.D.
Principal Bioinformatics Architect? email:? smarkel at accelrys.com
Accelrys (Pipeline Pilot R&D)?????? mobile: +1 858 205 3653
10188 Telesis Court, Suite 100????? voice:? +1 858 799 5603
San Diego, CA 92121???????????????? fax:??? +1 858 799 5222
USA???????????????????????????????? web:??? http://www.accelrys.com

http://www.linkedin.com/in/smarkel
Secretary, Board of Directors:
??? International Society for Computational Biology
Chair: ISCB Publications and Communications Committee
Associate Editor: PLoS Computational Biology
Editorial Board: Briefings in Bioinformatics


From cjfields at illinois.edu  Fri Nov  2 22:08:52 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Sat, 3 Nov 2012 02:08:52 +0000
Subject: [Bioperl-l] Structure::IO::pdb update needed to comply with PDB
 file format specification change
In-Reply-To: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net>
References: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC817F8@CHIMBX5.ad.uillinois.edu>

On Nov 2, 2012, at 7:13 PM, Scott Markel <Scott.Markel at accelrys.com> wrote:

> In tracking down a regression test failure we discovered that the Structure::IO::pdb module is out of date relative to the PDB file format specification (http://www.wwpdb.org/docs.html).  PDB now writes out to column 79, while pdb.pm is still using the old line length of 71.  Our regression failure was caused by a reformatting of 1CRN; journal titles started getting truncated.
> 
> Some of the Perl lines are really simple, e.g.,
> 
> 	$keywds = $self->_read_PDB_singlecontline("KEYWDS","11-70",\$buffer);
> 
> with others being just a little more detailed, e.g.,
> 
> 	my ($rec, $subr, $cont, $rol) = unpack "A6 x6 A4 A2 x1 A51", $_;
> 
> It doesn't look like pdb.pm has changed in about 1.5 years.  Is there a current module owner?  Or someone else working on this?

No one has really taken ownership, so as far as I'm concerned it's open.  Any objections?

> If not, we're willing to either compile a list of needed changes (walking through the PDB file format specification and comparing the corresponding column indices in pdb.pm) or provide a new version of the entire file.  Please let us know which is preferred.

A new version of the file is fine if you have someone who can work on it.  We would also like to change relevant tests and documentation if there is time.

> Scott
> 
> Scott Markel, Ph.D.
> Principal Bioinformatics Architect  email:  smarkel at accelrys.com
> Accelrys (Pipeline Pilot R&D)       mobile: +1 858 205 3653
> 10188 Telesis Court, Suite 100      voice:  +1 858 799 5603
> San Diego, CA 92121                 fax:    +1 858 799 5222
> USA                                 web:    http://www.accelrys.com
> 
> http://www.linkedin.com/in/smarkel
> Secretary, Board of Directors:
>     International Society for Computational Biology
> Chair: ISCB Publications and Communications Committee
> Associate Editor: PLoS Computational Biology
> Editorial Board: Briefings in Bioinformatics

Thanks Scott!

chris


From Russell.Smithies at agresearch.co.nz  Sun Nov  4 16:00:37 2012
From: Russell.Smithies at agresearch.co.nz (Smithies, Russell)
Date: Mon, 5 Nov 2012 10:00:37 +1300
Subject: [Bioperl-l] blast question
In-Reply-To: <CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>

What version of blast are you using?
There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+


--Russell

-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma
Sent: Saturday, 3 November 2012 3:50 a.m.
To: Fields, Christopher J
Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov
Subject: Re: [Bioperl-l] blast question

I know, i am really worried about my past analysis now.
Thanks a lot for cc'ing this mail Chris.

-Shalabh

On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <cjfields at illinois.edu
> wrote:

> That's a scary error, but the best place to submit this would be the 
> BLAST help list at NCBI (cc'd)
>
> chris
>
> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com>
> wrote:
>
> > Hi All,
> >          First of all i am really very sorry for posting blast 
> > question
> in
> > this forum, I am not sure if this is the right place.
> > I will really appreciate if anyone can guide me to the right direction.
> >
> > I am using blastall to get a top hit from a database so i am using 
> > -v 1
> -b
> > 1 (i hope this is right).
> > But the strange part is that i am getting wrong results.
> >
> > for example: if i use -v 1 -b 1 then for one of the hit i am getting
> this:
> >
> >
> > Sequences producing significant alignments:                      (bits)
> > Value
> >
> > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> > 4e-04
> >
> >
> > If i use -v 3 -b 3 then i am getting this for the same query:
> >
> > Sequences producing significant alignments:                      (bits)
> > Value
> >
> > fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
> > e-167
> > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> > 9e-07
> > fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
> > 1.0
> >
> > As you can see the top hit in the first case is totally wrong.
> >
> > I would really appreciate if someone can help me out, or direct to 
> > in the right direction.
> >
> > Thanks
> > Shalabh
> >
> >
> >
> > --
> > Shalabh Sharma
> > Scientific Computing Professional Associate (Bioinformatics 
> > Specialist) Department of Marine Sciences University of Georgia 
> > Athens, GA 30602-3636 
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


--
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l

=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================


From cjfields at illinois.edu  Sun Nov  4 17:13:37 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Sun, 4 Nov 2012 22:13:37 +0000
Subject: [Bioperl-l] blast question
In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
	<18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu>

That in fact is the recommendation (migrate to BLAST+).

chris

On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" <Russell.Smithies at agresearch.co.nz> wrote:

> What version of blast are you using?
> There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+
> 
> 
> --Russell
> 
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma
> Sent: Saturday, 3 November 2012 3:50 a.m.
> To: Fields, Christopher J
> Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov
> Subject: Re: [Bioperl-l] blast question
> 
> I know, i am really worried about my past analysis now.
> Thanks a lot for cc'ing this mail Chris.
> 
> -Shalabh
> 
> On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <cjfields at illinois.edu
>> wrote:
> 
>> That's a scary error, but the best place to submit this would be the 
>> BLAST help list at NCBI (cc'd)
>> 
>> chris
>> 
>> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com>
>> wrote:
>> 
>>> Hi All,
>>>         First of all i am really very sorry for posting blast 
>>> question
>> in
>>> this forum, I am not sure if this is the right place.
>>> I will really appreciate if anyone can guide me to the right direction.
>>> 
>>> I am using blastall to get a top hit from a database so i am using 
>>> -v 1
>> -b
>>> 1 (i hope this is right).
>>> But the strange part is that i am getting wrong results.
>>> 
>>> for example: if i use -v 1 -b 1 then for one of the hit i am getting
>> this:
>>> 
>>> 
>>> Sequences producing significant alignments:                      (bits)
>>> Value
>>> 
>>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
>>> 4e-04
>>> 
>>> 
>>> If i use -v 3 -b 3 then i am getting this for the same query:
>>> 
>>> Sequences producing significant alignments:                      (bits)
>>> Value
>>> 
>>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
>>> e-167
>>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
>>> 9e-07
>>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
>>> 1.0
>>> 
>>> As you can see the top hit in the first case is totally wrong.
>>> 
>>> I would really appreciate if someone can help me out, or direct to 
>>> in the right direction.
>>> 
>>> Thanks
>>> Shalabh
>>> 
>>> 
>>> 
>>> --
>>> Shalabh Sharma
>>> Scientific Computing Professional Associate (Bioinformatics 
>>> Specialist) Department of Marine Sciences University of Georgia 
>>> Athens, GA 30602-3636 
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
>> 
> 
> 
> --
> Shalabh Sharma
> Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From florent.angly at gmail.com  Sun Nov  4 19:46:44 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Mon, 05 Nov 2012 10:46:44 +1000
Subject: [Bioperl-l] Bio::PrimarySeq speedup
In-Reply-To: <50920D59.4010307@gmail.com>
References: <50920D59.4010307@gmail.com>
Message-ID: <50970C74.7070605@gmail.com>

I am planning on merging the branch with master this week.
Best,
Florent


On 01/11/12 15:49, Florent Angly wrote:
> Hi all,
>
> I was working with Ben Woodcroft on identifying ways to speed up 
> Grinder, which relies heavily on Bioperl. Ben did some profiling with 
> NYTProf and we realized that a lot of computation time was spent in 
> Bio::PrimarySeq, doing calls to subseq() and length(). The sequences 
> we used for the profiling were microbial genomes, i.e. several Mbp 
> long sequences, which is quite long. A lot of the performance cost was 
> associated with passing full genomes between functions. For example, 
> when doing a call to length(), length() requests the full sequence 
> from seq(), which returns it back to length() (it makes a copy!). So, 
> every call to length is very expensive for long sequences. And there 
> is a lot of code that calls length(), for error checking.
>
> I know that there are a few Bioperl modules that are more adapted to 
> handling very long sequences, e.g. Bio::DB::Fasta or 
> Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at 
> Bio::PrimarySeq with Ben and I released this commit: 
> https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. 
> But in fact, there were more things that I wanted to try to improve, 
> which led me to start this new branch: 
> https://github.com/bioperl/bioperl-live/tree/seqlength
>
> I wrote quite a few tests for functionalities that were not previously 
> covered by tests, and tried to improve the documentation. In addition, 
> to address the speed issue, I did some changes to Bio::PrimarySeq and 
> Bio::PrimarySeqI :
> ? The length of a sequence is now computed as soon as the sequence is 
> set, not after. This way, there is no extra call to seq() (which would 
> incur the cost of copying the entire sequence between functions).
> ? The length is saved as an object attribute. So, calling length() is 
> very cheap since it only needs to retrieve the stored value for the 
> length.
> ? There is a constructor called -direct, which skips sequence 
> validation. However, it was only active in conjunction with the 
> -ref_to_seq constructor. To make -direct conform better to its 
> documented purpose, I made it -direct work when a sequence is set 
> through -seq as well.
> ? This brings us to trunc(), revcom() and other methods of 
> Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq 
> object from an existing (already validated!) Bio::PrimarySeq object, 
> the new object can be constructed with the -direct constructor, to 
> save some time.
> ? Finally, I noticed that subseq() used calls to eval() to do its 
> work. eval() is notoriously slow and these calls were easily replaced 
> by simple calls to substr() to save some time.
>
> A real-world test I performed with Grinder took 3m28s before the 
> changes (and ~1 min is spent doing something unrelated). After the 
> changes, the same test took only 2min28s. So, it's quite a significant 
> improvement and on more specific test cases, performance gains can 
> obviously be much bigger. Also, I anticipate that the gains would be 
> bigger for even longer sequences.
>
> All the changes I made are meant to be backward compatible and all the 
> tests in the Bioperl test suite passed. So, there _should_ not be any 
> issues. However, I know that Bio::PrimarySeq is a central module of 
> Bioperl, so please, have a look at it and let me know if there are any 
> glaring errors.
>
> Thanks,
>
> Florent
>


From cjfields at illinois.edu  Sun Nov  4 21:43:28 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Mon, 5 Nov 2012 02:43:28 +0000
Subject: [Bioperl-l] Bio::PrimarySeq speedup
In-Reply-To: <50970C74.7070605@gmail.com>
References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu>

Florent,

Ran tests on it, they pass but I am seeing this (if these are expected, you can catch the warnings using Test::Warn):

[cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr t/Seq/PrimarySeq.t 
t/Seq/PrimarySeq.t .. 1/167 
--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is !
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is &
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is \,$,+
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/
---------------------------------------------------
t/Seq/PrimarySeq.t .. ok       
All tests successful.
Files=1, Tests=167,  0 wallclock secs ( 0.03 usr  0.01 sys +  0.18 cusr  0.01 csys =  0.23 CPU)
Result: PASS

chris

On Nov 4, 2012, at 6:46 PM, Florent Angly <florent.angly at gmail.com>
 wrote:

> I am planning on merging the branch with master this week.
> Best,
> Florent
> 
> 
> On 01/11/12 15:49, Florent Angly wrote:
>> Hi all,
>> 
>> I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was associated with passing full genomes between functions. For example, when doing a call to length(), length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So, every call to length is very expensive for long sequences. And there is a lot of code that calls length(), for error checking.
>> 
>> I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g. Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength
>> 
>> I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to improve the documentation. In addition, to address the speed issue, I did some changes to Bio::PrimarySeq and Bio::PrimarySeqI :
>> ? The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no extra call to seq() (which would incur the cost of copying the entire sequence between functions).
>> ? The length is saved as an object attribute. So, calling length() is very cheap since it only needs to retrieve the stored value for the length.
>> ? There is a constructor called -direct, which skips sequence validation. However, it was only active in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented purpose, I made it -direct work when a sequence is set through -seq as well.
>> ? This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the new object can be constructed with the -direct constructor, to save some time.
>> ? Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and these calls were easily replaced by simple calls to substr() to save some time.
>> 
>> A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I anticipate that the gains would be bigger for even longer sequences.
>> 
>> All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of Bioperl, so please, have a look at it and let me know if there are any glaring errors.
>> 
>> Thanks,
>> 
>> Florent
>> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From shalabh.sharma7 at gmail.com  Mon Nov  5 12:03:38 2012
From: shalabh.sharma7 at gmail.com (shalabh sharma)
Date: Mon, 5 Nov 2012 12:03:38 -0500
Subject: [Bioperl-l] blast question
In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
	<18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>
	<118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu>
Message-ID: <CAA7rn9f1GOsib-PhNAsQo327dQznMn-15UZz7w49iYKjg-VLHg@mail.gmail.com>

Hi All,
         thanks for all your responses.

Currently i am using the old version of blastall 2.2.22.

@Peter: I will update my blast and will see if the problem still exist. But
i can't restrict my blast with e value because i work on environmental
samples , i have to reduce the size of my blast files as i am only
interested in the top hit and my data sets are really huge.

Thanks
Shalabh

On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J <cjfields at illinois.edu
> wrote:

> That in fact is the recommendation (migrate to BLAST+).
>
> chris
>
> On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" <
> Russell.Smithies at agresearch.co.nz> wrote:
>
> > What version of blast are you using?
> > There have been quite a few bug fixes and I suspect any responses from
> NCBI will suggest upgrading to the current version of blast+
> >
> >
> > --Russell
> >
> > -----Original Message-----
> > From: bioperl-l-bounces at lists.open-bio.org [mailto:
> bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma
> > Sent: Saturday, 3 November 2012 3:50 a.m.
> > To: Fields, Christopher J
> > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov
> > Subject: Re: [Bioperl-l] blast question
> >
> > I know, i am really worried about my past analysis now.
> > Thanks a lot for cc'ing this mail Chris.
> >
> > -Shalabh
> >
> > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <
> cjfields at illinois.edu
> >> wrote:
> >
> >> That's a scary error, but the best place to submit this would be the
> >> BLAST help list at NCBI (cc'd)
> >>
> >> chris
> >>
> >> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com>
> >> wrote:
> >>
> >>> Hi All,
> >>>         First of all i am really very sorry for posting blast
> >>> question
> >> in
> >>> this forum, I am not sure if this is the right place.
> >>> I will really appreciate if anyone can guide me to the right direction.
> >>>
> >>> I am using blastall to get a top hit from a database so i am using
> >>> -v 1
> >> -b
> >>> 1 (i hope this is right).
> >>> But the strange part is that i am getting wrong results.
> >>>
> >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting
> >> this:
> >>>
> >>>
> >>> Sequences producing significant alignments:                      (bits)
> >>> Value
> >>>
> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA
>  38
> >>> 4e-04
> >>>
> >>>
> >>> If i use -v 3 -b 3 then i am getting this for the same query:
> >>>
> >>> Sequences producing significant alignments:                      (bits)
> >>> Value
> >>>
> >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...
> 570
> >>> e-167
> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA
>  38
> >>> 9e-07
> >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...
>  18
> >>> 1.0
> >>>
> >>> As you can see the top hit in the first case is totally wrong.
> >>>
> >>> I would really appreciate if someone can help me out, or direct to
> >>> in the right direction.
> >>>
> >>> Thanks
> >>> Shalabh
> >>>
> >>>
> >>>
> >>> --
> >>> Shalabh Sharma
> >>> Scientific Computing Professional Associate (Bioinformatics
> >>> Specialist) Department of Marine Sciences University of Georgia
> >>> Athens, GA 30602-3636
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >>
> >
> >
> > --
> > Shalabh Sharma
> > Scientific Computing Professional Associate (Bioinformatics Specialist)
> Department of Marine Sciences University of Georgia Athens, GA 30602-3636
> _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > =======================================================================
> > Attention: The information contained in this message and/or attachments
> > from AgResearch Limited is intended only for the persons or entities
> > to which it is addressed and may contain confidential and/or privileged
> > material. Any review, retransmission, dissemination or other use of, or
> > taking of any action in reliance upon, this information by persons or
> > entities other than the intended recipients is prohibited by AgResearch
> > Limited. If you have received this message in error, please notify the
> > sender immediately.
> > =======================================================================
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


-- 
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist)
Department of Marine Sciences
University of Georgia
Athens, GA 30602-3636


From Russell.Smithies at agresearch.co.nz  Mon Nov  5 16:04:07 2012
From: Russell.Smithies at agresearch.co.nz (Smithies, Russell)
Date: Tue, 6 Nov 2012 10:04:07 +1300
Subject: [Bioperl-l] blast question
In-Reply-To: <CAA7rn9f1GOsib-PhNAsQo327dQznMn-15UZz7w49iYKjg-VLHg@mail.gmail.com>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
	<18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>
	<118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9f1GOsib-PhNAsQo327dQznMn-15UZz7w49iYKjg-VLHg@mail.gmail.com>
Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz>

If you're using an older version of blast there was a bug where not all results were returned - I think the limit was 10,000 hits?
Not usually a problem running basic queries but a big problem for environmental or metagenomic samples, or when aligning short reads.

--Russell

From: shalabh sharma [mailto:shalabh.sharma7 at gmail.com]
Sent: Tuesday, 6 November 2012 6:04 a.m.
To: Fields, Christopher J
Cc: Smithies, Russell; bioperl-l
Subject: Re: [Bioperl-l] blast question

Hi All,
         thanks for all your responses.

Currently i am using the old version of blastall 2.2.22.

@Peter: I will update my blast and will see if the problem still exist. But i can't restrict my blast with e value because i work on environmental samples , i have to reduce the size of my blast files as i am only interested in the top hit and my data sets are really huge.

Thanks
Shalabh

On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J <cjfields at illinois.edu<mailto:cjfields at illinois.edu>> wrote:
That in fact is the recommendation (migrate to BLAST+).

chris

On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" <Russell.Smithies at agresearch.co.nz<mailto:Russell.Smithies at agresearch.co.nz>> wrote:

> What version of blast are you using?
> There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+
>
>
> --Russell
>
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org<mailto:bioperl-l-bounces at lists.open-bio.org> [mailto:bioperl-l-bounces at lists.open-bio.org<mailto:bioperl-l-bounces at lists.open-bio.org>] On Behalf Of shalabh sharma
> Sent: Saturday, 3 November 2012 3:50 a.m.
> To: Fields, Christopher J
> Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov<mailto:blast-help at ncbi.nlm.nih.gov>
> Subject: Re: [Bioperl-l] blast question
>
> I know, i am really worried about my past analysis now.
> Thanks a lot for cc'ing this mail Chris.
>
> -Shalabh
>
> On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <cjfields at illinois.edu<mailto:cjfields at illinois.edu>
>> wrote:
>
>> That's a scary error, but the best place to submit this would be the
>> BLAST help list at NCBI (cc'd)
>>
>> chris
>>
>> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com<mailto:shalabh.sharma7 at gmail.com>>
>> wrote:
>>
>>> Hi All,
>>>         First of all i am really very sorry for posting blast
>>> question
>> in
>>> this forum, I am not sure if this is the right place.
>>> I will really appreciate if anyone can guide me to the right direction.
>>>
>>> I am using blastall to get a top hit from a database so i am using
>>> -v 1
>> -b
>>> 1 (i hope this is right).
>>> But the strange part is that i am getting wrong results.
>>>
>>> for example: if i use -v 1 -b 1 then for one of the hit i am getting
>> this:
>>>
>>>
>>> Sequences producing significant alignments:                      (bits)
>>> Value
>>>
>>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
>>> 4e-04
>>>
>>>
>>> If i use -v 3 -b 3 then i am getting this for the same query:
>>>
>>> Sequences producing significant alignments:                      (bits)
>>> Value
>>>
>>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
>>> e-167
>>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
>>> 9e-07
>>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
>>> 1.0
>>>
>>> As you can see the top hit in the first case is totally wrong.
>>>
>>> I would really appreciate if someone can help me out, or direct to
>>> in the right direction.
>>>
>>> Thanks
>>> Shalabh
>>>
>>>
>>>
>>> --
>>> Shalabh Sharma
>>> Scientific Computing Professional Associate (Bioinformatics
>>> Specialist) Department of Marine Sciences University of Georgia
>>> Athens, GA 30602-3636
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>
>
> --
> Shalabh Sharma
> Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


--
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist)
Department of Marine Sciences
University of Georgia
Athens, GA 30602-3636

=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================


From shalabh.sharma7 at gmail.com  Mon Nov  5 16:09:03 2012
From: shalabh.sharma7 at gmail.com (shalabh sharma)
Date: Mon, 5 Nov 2012 16:09:03 -0500
Subject: [Bioperl-l] blast question
In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
	<18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>
	<118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9f1GOsib-PhNAsQo327dQznMn-15UZz7w49iYKjg-VLHg@mail.gmail.com>
	<18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz>
Message-ID: <CAA7rn9ech--TYQdezH6fLLArjsTdypkNjSkQeO1LiaLTR1zoHQ@mail.gmail.com>

Hi All,
           Thanks for all the suggestion. The problem is fixed by using
latest blast+ .
Thanks
Shalabh

On Mon, Nov 5, 2012 at 4:04 PM, Smithies, Russell <
Russell.Smithies at agresearch.co.nz> wrote:

> If you?re using an older version of blast there was a bug where not all
> results were returned ? I think the limit was 10,000 hits?****
>
> Not usually a problem running basic queries but a big problem for
> environmental or metagenomic samples, or when aligning short reads.****
>
> ** **
>
> --Russell****
>
> ** **
>
> *From:* shalabh sharma [mailto:shalabh.sharma7 at gmail.com]
> *Sent:* Tuesday, 6 November 2012 6:04 a.m.
> *To:* Fields, Christopher J
> *Cc:* Smithies, Russell; bioperl-l
>
> *Subject:* Re: [Bioperl-l] blast question****
>
> ** **
>
> Hi All,****
>
>          thanks for all your responses.****
>
> ** **
>
> Currently i am using the old version of blastall 2.2.22.****
>
> ** **
>
> @Peter: I will update my blast and will see if the problem still exist.
> But i can't restrict my blast with e value because i work on environmental
> samples , i have to reduce the size of my blast files as i am only
> interested in the top hit and my data sets are really huge.****
>
> ** **
>
> Thanks****
>
> Shalabh****
>
> ** **
>
> On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J <
> cjfields at illinois.edu> wrote:****
>
> That in fact is the recommendation (migrate to BLAST+).
>
> chris****
>
>
> On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" <
> Russell.Smithies at agresearch.co.nz> wrote:
>
> > What version of blast are you using?
> > There have been quite a few bug fixes and I suspect any responses from
> NCBI will suggest upgrading to the current version of blast+
> >
> >
> > --Russell
> >
> > -----Original Message-----
> > From: bioperl-l-bounces at lists.open-bio.org [mailto:
> bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma
> > Sent: Saturday, 3 November 2012 3:50 a.m.
> > To: Fields, Christopher J
> > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov
> > Subject: Re: [Bioperl-l] blast question
> >
> > I know, i am really worried about my past analysis now.
> > Thanks a lot for cc'ing this mail Chris.
> >
> > -Shalabh
> >
> > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <
> cjfields at illinois.edu
> >> wrote:
> >
> >> That's a scary error, but the best place to submit this would be the
> >> BLAST help list at NCBI (cc'd)
> >>
> >> chris
> >>
> >> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com>
> >> wrote:
> >>
> >>> Hi All,
> >>>         First of all i am really very sorry for posting blast
> >>> question
> >> in
> >>> this forum, I am not sure if this is the right place.
> >>> I will really appreciate if anyone can guide me to the right direction.
> >>>
> >>> I am using blastall to get a top hit from a database so i am using
> >>> -v 1
> >> -b
> >>> 1 (i hope this is right).
> >>> But the strange part is that i am getting wrong results.
> >>>
> >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting
> >> this:
> >>>
> >>>
> >>> Sequences producing significant alignments:                      (bits)
> >>> Value
> >>>
> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA
>  38
> >>> 4e-04
> >>>
> >>>
> >>> If i use -v 3 -b 3 then i am getting this for the same query:
> >>>
> >>> Sequences producing significant alignments:                      (bits)
> >>> Value
> >>>
> >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...
> 570
> >>> e-167
> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA
>  38
> >>> 9e-07
> >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...
>  18
> >>> 1.0
> >>>
> >>> As you can see the top hit in the first case is totally wrong.
> >>>
> >>> I would really appreciate if someone can help me out, or direct to
> >>> in the right direction.
> >>>
> >>> Thanks
> >>> Shalabh
> >>>
> >>>
> >>>
> >>> --
> >>> Shalabh Sharma
> >>> Scientific Computing Professional Associate (Bioinformatics
> >>> Specialist) Department of Marine Sciences University of Georgia
> >>> Athens, GA 30602-3636
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >>
> >
> >
> > --
> > Shalabh Sharma
> > Scientific Computing Professional Associate (Bioinformatics Specialist)
> Department of Marine Sciences University of Georgia Athens, GA 30602-3636
> _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > =======================================================================
> > Attention: The information contained in this message and/or attachments
> > from AgResearch Limited is intended only for the persons or entities
> > to which it is addressed and may contain confidential and/or privileged
> > material. Any review, retransmission, dissemination or other use of, or
> > taking of any action in reliance upon, this information by persons or
> > entities other than the intended recipients is prohibited by AgResearch
> > Limited. If you have received this message in error, please notify the
> > sender immediately.
> > =======================================================================
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l****
>
>
>
> ****
>
> ** **
>
> --
> Shalabh Sharma
> Scientific Computing Professional Associate (Bioinformatics Specialist)
> Department of Marine Sciences
> University of Georgia
> Athens, GA 30602-3636****
>
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
>


-- 
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist)
Department of Marine Sciences
University of Georgia
Athens, GA 30602-3636


From florent.angly at gmail.com  Tue Nov  6 06:06:56 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Tue, 06 Nov 2012 21:06:56 +1000
Subject: [Bioperl-l] Bio::PrimarySeq speedup
In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu>
References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu>
Message-ID: <5098EF50.5040208@gmail.com>

Yes, good idea, Chris.

Actually, thinking about it, most of these warnings were redundant. So, 
I changed the behaviour of Bio::PrimarySeq::validate_seq() so that it 
issues exceptions if requested.

Florent


On 05/11/12 12:43, Fields, Christopher J wrote:
> Florent,
>
> Ran tests on it, they pass but I am seeing this (if these are expected, you can catch the warnings using Test::Warn):
>
> [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr t/Seq/PrimarySeq.t
> t/Seq/PrimarySeq.t .. 1/167
> --------------------- WARNING ---------------------
> MSG: Got a sequence without letters. Could not guess alphabet
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is !
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is &
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is \,$,+
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/
> ---------------------------------------------------
> t/Seq/PrimarySeq.t .. ok
> All tests successful.
> Files=1, Tests=167,  0 wallclock secs ( 0.03 usr  0.01 sys +  0.18 cusr  0.01 csys =  0.23 CPU)
> Result: PASS
>
> chris
>
> On Nov 4, 2012, at 6:46 PM, Florent Angly <florent.angly at gmail.com>
>   wrote:
>
>> I am planning on merging the branch with master this week.
>> Best,
>> Florent
>>
>>
>> On 01/11/12 15:49, Florent Angly wrote:
>>> Hi all,
>>>
>>> I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was associated with passing full genomes between functions. For example, when doing a call to length(), length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So, every call to length is very expensive for long sequences. And there is a lot of code that calls length(), for error checking.
>>>
>>> I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g. Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength
>>>
>>> I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to improve the documentation. In addition, to address the speed issue, I did some changes to Bio::PrimarySeq and Bio::PrimarySeqI :
>>> ? The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no extra call to seq() (which would incur the cost of copying the entire sequence between functions).
>>> ? The length is saved as an object attribute. So, calling length() is very cheap since it only needs to retrieve the stored value for the length.
>>> ? There is a constructor called -direct, which skips sequence validation. However, it was only active in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented purpose, I made it -direct work when a sequence is set through -seq as well.
>>> ? This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the new object can be constructed with the -direct constructor, to save some time.
>>> ? Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and these calls were easily replaced by simple calls to substr() to save some time.
>>>
>>> A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I anticipate that the gains would be bigger for even longer sequences.
>>>
>>> All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of Bioperl, so please, have a look at it and let me know if there are any glaring errors.
>>>
>>> Thanks,
>>>
>>> Florent
>>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From shlomif at shlomifish.org  Tue Nov  6 07:27:00 2012
From: shlomif at shlomifish.org (Shlomi Fish)
Date: Tue, 6 Nov 2012 14:27:00 +0200
Subject: [Bioperl-l] [Request] Please Help Add Some Information about
 Perl for Bio-Informatics to http://perl-begin.org/uses/bio-info/
In-Reply-To: <20121026192203.6d1e59c0@lap.shlomifish.org>
References: <20121026192203.6d1e59c0@lap.shlomifish.org>
Message-ID: <20121106142700.192f456e@lap.shlomifish.org>

Hi,

Can anyone help with that?

Regards,

	Shlomi Fish

On Fri, 26 Oct 2012 19:22:03 +0200
Shlomi Fish <shlomif at shlomifish.org> wrote:

> Hi all,
> 
> I am the maintainer of http://perl-begin.org/ , the Perl Beginners' Site. I
> had this page there for a long time, but it's empty:
> 
> http://perl-begin.org/uses/bio-info/
> 
> Can someone help me add some information there? A short XHTML page will be OK.
> For reference, see the other pages in the section
> ( http://perl-begin.org/uses/ ) such as:
> 
> * http://perl-begin.org/uses/web/
> 
> * http://perl-begin.org/uses/sys-admin/
> 
> * http://perl-begin.org/uses/qa/
> 
> Note that you agree that the content will be licensed under the Creative
> Commons Attribution 3.0 Unported License (or higher versions) and so you
> should make sure it is original.
> 
> I shall be obliged for any help.
> 
> Regards,
> 
> 	Shlomi Fish
> 


-- 
-----------------------------------------------------------------
Shlomi Fish       http://www.shlomifish.org/
Perl Humour - http://perl-begin.org/humour/

A wiseman can learn from a fool much more than a fool can ever learn from a
wiseman.               ? http://en.wikiquote.org/wiki/Cato_the_Elder

Please reply to list if it's a mailing list post - http://shlom.in/reply .


From florent.angly at gmail.com  Thu Nov 15 11:29:30 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Fri, 16 Nov 2012 02:29:30 +1000
Subject: [Bioperl-l] Bio::PrimarySeq speedup
In-Reply-To: <5098EF50.5040208@gmail.com>
References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu>
	<5098EF50.5040208@gmail.com>
Message-ID: <50A5186A.4060304@gmail.com>

I now merged the branch with master.
Best,
Florent

On 06/11/12 21:06, Florent Angly wrote:
> Yes, good idea, Chris.
>
> Actually, thinking about it, most of these warnings were redundant. 
> So, I changed the behaviour of Bio::PrimarySeq::validate_seq() so that 
> it issues exceptions if requested.
>
> Florent
>
>
> On 05/11/12 12:43, Fields, Christopher J wrote:
>> Florent,
>>
>> Ran tests on it, they pass but I am seeing this (if these are 
>> expected, you can catch the warnings using Test::Warn):
>>
>> [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr 
>> t/Seq/PrimarySeq.t
>> t/Seq/PrimarySeq.t .. 1/167
>> --------------------- WARNING ---------------------
>> MSG: Got a sequence without letters. Could not guess alphabet
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is !
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is &
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is 
>> \,$,+
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/
>> ---------------------------------------------------
>> t/Seq/PrimarySeq.t .. ok
>> All tests successful.
>> Files=1, Tests=167,  0 wallclock secs ( 0.03 usr  0.01 sys + 0.18 
>> cusr  0.01 csys =  0.23 CPU)
>> Result: PASS
>>
>> chris
>>
>> On Nov 4, 2012, at 6:46 PM, Florent Angly <florent.angly at gmail.com>
>>   wrote:
>>
>>> I am planning on merging the branch with master this week.
>>> Best,
>>> Florent
>>>
>>>
>>> On 01/11/12 15:49, Florent Angly wrote:
>>>> Hi all,
>>>>
>>>> I was working with Ben Woodcroft on identifying ways to speed up 
>>>> Grinder, which relies heavily on Bioperl. Ben did some profiling 
>>>> with NYTProf and we realized that a lot of computation time was 
>>>> spent in Bio::PrimarySeq, doing calls to subseq() and length(). The 
>>>> sequences we used for the profiling were microbial genomes, i.e. 
>>>> several Mbp long sequences, which is quite long. A lot of the 
>>>> performance cost was associated with passing full genomes between 
>>>> functions. For example, when doing a call to length(), length() 
>>>> requests the full sequence from seq(), which returns it back to 
>>>> length() (it makes a copy!). So, every call to length is very 
>>>> expensive for long sequences. And there is a lot of code that calls 
>>>> length(), for error checking.
>>>>
>>>> I know that there are a few Bioperl modules that are more adapted 
>>>> to handling very long sequences, e.g. Bio::DB::Fasta or 
>>>> Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look 
>>>> at Bio::PrimarySeq with Ben and I released this commit: 
>>>> https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. 
>>>> But in fact, there were more things that I wanted to try to 
>>>> improve, which led me to start this new branch: 
>>>> https://github.com/bioperl/bioperl-live/tree/seqlength
>>>>
>>>> I wrote quite a few tests for functionalities that were not 
>>>> previously covered by tests, and tried to improve the 
>>>> documentation. In addition, to address the speed issue, I did some 
>>>> changes to Bio::PrimarySeq and Bio::PrimarySeqI :
>>>> ? The length of a sequence is now computed as soon as the sequence 
>>>> is set, not after. This way, there is no extra call to seq() (which 
>>>> would incur the cost of copying the entire sequence between 
>>>> functions).
>>>> ? The length is saved as an object attribute. So, calling length() 
>>>> is very cheap since it only needs to retrieve the stored value for 
>>>> the length.
>>>> ? There is a constructor called -direct, which skips sequence 
>>>> validation. However, it was only active in conjunction with the 
>>>> -ref_to_seq constructor. To make -direct conform better to its 
>>>> documented purpose, I made it -direct work when a sequence is set 
>>>> through -seq as well.
>>>> ? This brings us to trunc(), revcom() and other methods of 
>>>> Bio::PrimarySeqI. Since all these methods create a new 
>>>> Bio::PrimarySeq object from an existing (already validated!) 
>>>> Bio::PrimarySeq object, the new object can be constructed with the 
>>>> -direct constructor, to save some time.
>>>> ? Finally, I noticed that subseq() used calls to eval() to do its 
>>>> work. eval() is notoriously slow and these calls were easily 
>>>> replaced by simple calls to substr() to save some time.
>>>>
>>>> A real-world test I performed with Grinder took 3m28s before the 
>>>> changes (and ~1 min is spent doing something unrelated). After the 
>>>> changes, the same test took only 2min28s. So, it's quite a 
>>>> significant improvement and on more specific test cases, 
>>>> performance gains can obviously be much bigger. Also, I anticipate 
>>>> that the gains would be bigger for even longer sequences.
>>>>
>>>> All the changes I made are meant to be backward compatible and all 
>>>> the tests in the Bioperl test suite passed. So, there _should_ not 
>>>> be any issues. However, I know that Bio::PrimarySeq is a central 
>>>> module of Bioperl, so please, have a look at it and let me know if 
>>>> there are any glaring errors.
>>>>
>>>> Thanks,
>>>>
>>>> Florent
>>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


From mahakadry at aucegypt.edu  Tue Nov 20 13:44:53 2012
From: mahakadry at aucegypt.edu (maha ahmed)
Date: Tue, 20 Nov 2012 20:44:53 +0200
Subject: [Bioperl-l] Parsing a blast report with multiple queries into
 separate one query files that only contain the fasta sequences
Message-ID: <CAE=MQgz9BY7-nUOePoqVZS6FD0ME76yXyTKmODWFhc+TjBGdrg@mail.gmail.com>

Dear BioPerl list,
I blasted a file that has several fasta queries against nr, however I need
to align each query with its hits for further computational analysis so I
need to parse the produced blast report into several files that each has
only the fasta query sequence and its hits in fasta format.
I found this script online,

use Bio::Search::Result::BlastResult;use Bio::SearchIO;
 my $report = Bio::SearchIO->new( -file=>'full-report.bls', -format
<http://perldoc.perl.org/functions/format.html> => blast);my $result =
$report->next_result;my %hits_by_query;while (my $hit =
$result->next_hit) {
  push <http://perldoc.perl.org/functions/push.html>
@{$hits_by_query{$hit->name}}, $hit;}
 foreach my $qid ( keys <http://perldoc.perl.org/functions/keys.html>
%hits_by_query ) {
  my $result = Bio::Search::Result::BlastResult->new();
  $result->add_hit($_) for ( @{$hits_by_query{$qid}} );
  my $blio = Bio::SearchIO->new( -file => ">$qid\.bls", -format
<http://perldoc.perl.org/functions/format.html>=>'blast' );
  $blio->write_result($result);}


however on using it this produced the following error message


BlastResult::new(): Not adding iterations.

------------- EXCEPTION: Bio::Root::NoSuchThing -------------
MSG: No such iteration number: 0. Valid range=1-0
VALUE: The number zero (0)
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/local/share/perl/5.10.0/Bio/Root/Root.pm:472
STACK: Bio::Search::Result::BlastResult::iteration
/usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:327
STACK: Bio::Search::Result::BlastResult::add_hit
/usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:257
STACK: ./parsing.blast.results.into.per.query.files.pl:15

I tried to search for other scripts but I couldn't find any
I would really appreciate your comments to this
Thank you


From cjfields at illinois.edu  Tue Nov 20 14:21:25 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 20 Nov 2012 19:21:25 +0000
Subject: [Bioperl-l] Parsing a blast report with multiple queries into
 separate one query files that only contain the fasta sequences
In-Reply-To: <CAE=MQgz9BY7-nUOePoqVZS6FD0ME76yXyTKmODWFhc+TjBGdrg@mail.gmail.com>
References: <CAE=MQgz9BY7-nUOePoqVZS6FD0ME76yXyTKmODWFhc+TjBGdrg@mail.gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF22E15@CITESMBX5.ad.uillinois.edu>

Maha,

Do you need only the sequence reported in the report (e.g. the HSP alignments) or the original FASTA sequences?  

The former can be recovered from the Bio::Search::HSP::GenericHSP objects as an alignment, and this can be redirected to a FASTA file.  The latter is a little trickier, as you will have to retrieve the sequences from their original source files.  

chris

On Nov 20, 2012, at 12:44 PM, maha ahmed <mahakadry at aucegypt.edu> wrote:

> Dear BioPerl list,
> I blasted a file that has several fasta queries against nr, however I need
> to align each query with its hits for further computational analysis so I
> need to parse the produced blast report into several files that each has
> only the fasta query sequence and its hits in fasta format.
> I found this script online,
> 
> use Bio::Search::Result::BlastResult;use Bio::SearchIO;
> my $report = Bio::SearchIO->new( -file=>'full-report.bls', -format
> <http://perldoc.perl.org/functions/format.html> => blast);my $result =
> $report->next_result;my %hits_by_query;while (my $hit =
> $result->next_hit) {
>  push <http://perldoc.perl.org/functions/push.html>
> @{$hits_by_query{$hit->name}}, $hit;}
> foreach my $qid ( keys <http://perldoc.perl.org/functions/keys.html>
> %hits_by_query ) {
>  my $result = Bio::Search::Result::BlastResult->new();
>  $result->add_hit($_) for ( @{$hits_by_query{$qid}} );
>  my $blio = Bio::SearchIO->new( -file => ">$qid\.bls", -format
> <http://perldoc.perl.org/functions/format.html>=>'blast' );
>  $blio->write_result($result);}
> 
> 
> 
> however on using it this produced the following error message
> 
> 
> 
> BlastResult::new(): Not adding iterations.
> 
> ------------- EXCEPTION: Bio::Root::NoSuchThing -------------
> MSG: No such iteration number: 0. Valid range=1-0
> VALUE: The number zero (0)
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /usr/local/share/perl/5.10.0/Bio/Root/Root.pm:472
> STACK: Bio::Search::Result::BlastResult::iteration
> /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:327
> STACK: Bio::Search::Result::BlastResult::add_hit
> /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:257
> STACK: ./parsing.blast.results.into.per.query.files.pl:15
> 
> I tried to search for other scripts but I couldn't find any
> I would really appreciate your comments to this
> Thank you
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From rfhorns at gmail.com  Thu Nov  1 20:01:34 2012
From: rfhorns at gmail.com (Felix Horns)
Date: Fri, 02 Nov 2012 00:01:34 -0000
Subject: [Bioperl-l] get_Stream_by_query() appears to return empty stream
Message-ID: <CANpYnv_dSVMM85iNVn8gnktNyOuKmevsWifyMFm9BY_QboJT8w@mail.gmail.com>

Hello everyone.

I am having trouble using the get_Stream_by_query() function
in Bio::DB::GenBank.  It seems to return an empty stream, such that
$stream->next_seq never returns anything.

However, $query->count is returning the expected value (139).  Also,
get_Stream_by_query() seems to be querying the database, as when I pass it
an array of GeneIDs that have not been properly formatted, i.e.
GeneID:7816864, instead of simply 7816864, it returns warnings and errors:
"MSG: Warning(s) from GenBank: <PhraseNotFound>GeneID 7817709...; MSG:
Error from Genbank: No items found.".

I have included my full code below. I have also included the output from
the code below that.  The code is intended to find genes located within a
genomic region. I will later find the protein domains and pathways that
those genes are involved in.

Any help would be greatly appreciated.  I realize that this is probably a
very simple question, but I am relatively new to BioPerl and I've spent the
better part of the day trying to figure out such issues, so I would be very
thankful for help.

Felix


#!/usr/bin/perl
use strict;
use Bio::SeqIO;
use Bio::DB::EntrezGene;
use Bio::DB::GenBank;

# Load reference sequence
# Load from local .gb file
# Note that .gb file does not include sequences
# my $gbfile = "NC_012660.1.gb";
# my $seqio = Bio::SeqIO->new(-file => $gbfile);
# my $ref_seq = $seqio->next_seq;

# To access reference sequence programatically, uncomment this code
my $gb = new Bio::DB::GenBank;
my $ref_seq = $gb->get_Seq_by_acc("NC_012660.1");

# Specify coordinates of gap
my $gap_start = 2050506;
my $gap_end = 2190530;

my $gene_count = 0;
my @features;
my @starts;
my @ends;
my @db_xrefs;

my @products;
my @protein_ids;

# Get gene features in gap
for my $feat ($ref_seq->get_SeqFeatures) {
  my $start=$feat->location->start;
  my $end=$feat->location->end;

  if (($feat->primary_tag eq 'gene') &
      ($gap_start < $start) & ($start < $gap_end) &
      ($gap_start < $end) & ($end < $gap_end)) {

    $gene_count += 1;

    # Get GeneID reference
    my $db_xref = ($feat->get_tag_values('db_xref'))[0];
    $db_xref =~ s/GeneID://;    # Trim "GeneID:" from start of $db_xref

    push @features, $feat;
    push @starts, $start;
    push @ends, $end;
    push @db_xrefs, $db_xref;
  }
}

# Get data about gene features from GeneID reference
my $query = Bio::DB::Query::GenBank->new(-db => 'gene',
 -ids => [@db_xrefs]);
my $stream = $gb->get_Stream_by_query($query);

while (my $seq = $stream->next_seq) {
  for my $feat ($seq->all_SeqFeatures) {
    print "primary tag: ", $feat->primary_tag, "\n";
    for my $tag ($feat->get_all_tags) {
      print "  tag: ", $tag, "\n";
      for my $value ($feat->get_tag_values($tag)) {
print "    value: ", $value, "\n";
      }
    }
  }
}

print $query->count,"\n";
print $gene_count, "\n";


OUTPUT
> perl analyze_gap.pl
139
139

Note that no "primary tag; tag; value" items are printed.  Furthermore,
when I put a print line immediately after the (while (my $seq =
$stream->next_seq)) statement, it was never called, seemingly indicating
that the stream is empty.


From mooldhu at gmail.com  Tue Nov  6 02:38:57 2012
From: mooldhu at gmail.com (=?GB2312?B?uvq9rQ==?=)
Date: Tue, 6 Nov 2012 15:38:57 +0800
Subject: [Bioperl-l] Ask for help about Bioperl
Message-ID: <CACdwZFTvgwGzaa2wTfinu9=JKYAXL_O0kQ-Osn7FGcsDx33+Ng@mail.gmail.com>

hi,
when I use bioperl ,it report errors like this :---------------------
WARNING ---------------------
MSG: sequence '1' doesn't validate, mismatch is )81,95,0(,::
---------------------------------------------------
Error providing evidence type: GeneModel
The error was:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Attempting to set the sequence '1' to
[)81hvf95x0(DSTD=qeSrytkiyP::oiV] which does not look healthy
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:486
STACK: Bio::PrimarySeq::seq /usr/share/perl5/Bio/PrimarySeq.pm:285
STACK: Bio::PrimarySeq::new /usr/share/perl5/Bio/PrimarySeq.pm:239
STACK: Bio::PrimarySeqI::revcom /usr/share/perl5/Bio/PrimarySeqI.pm:383


but,I am sure that the input file only cotain [ATGCN],I also try to use
another sequences ,but the errors are the same.my bioperl is Bioperl-live
1.006902;

-- 
????


From assayagy at gmail.com  Sat Nov 10 13:27:03 2012
From: assayagy at gmail.com (eyla4ever)
Date: Sat, 10 Nov 2012 10:27:03 -0800 (PST)
Subject: [Bioperl-l] Extracting sequences from Genbank files
In-Reply-To: <CAJLmuDKPBA_DUtnfQjcAXr+JU=nL+orBP4cCgwBi=4BWQiYmpw@mail.gmail.com>
References: <CAJLmuDKPBA_DUtnfQjcAXr+JU=nL+orBP4cCgwBi=4BWQiYmpw@mail.gmail.com>
Message-ID: <34664632.post@talk.nabble.com>


hello Brian

i wuold like you to send me your script, i think it can help me to solve a
big problem
and help me to finish my final project.
i hope it will be posible

regards Eyla


BForde wrote:
> 
> Hello,
> 
> I have been modifying a script which extracts all the protein sequences
> from a genbank file and saves them in a multi-fasta file.
> 
> I wish the fasta header to have both the locus_tag of the protein and the
> product. However I cannot get the  product tag to write to the fasta
> header
> 
> this is the relevant section of the script
> 
>  $s->display_id($f->has_tag('locus_tag') ? join(',',sort
> $f->each_tag_value('locus_tag')) :
>                            $f->has_tag('product') ?
> join(',',$f->each_tag_value('product')):
>                            $s->display_id);
> 
> is "product" not an actual tag
> 
> regards
> 
> Brian
> 
> 
> 
> -- 
> Brian Forde
> Microbiology Dept.
> Bioscience Institute. Room 4.11
> University College Cork
> Cork
> Ireland
> tel:+353 21 4901306
> email: b.m.forde at umail.ucc.ie
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 

-- 
View this message in context: http://old.nabble.com/Extracting-sequences-from-Genbank-files-tp33901023p34664632.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From bosborne11 at verizon.net  Tue Nov 20 18:50:00 2012
From: bosborne11 at verizon.net (Brian Osborne)
Date: Tue, 20 Nov 2012 18:50:00 -0500
Subject: [Bioperl-l] get_Stream_by_query() appears to return empty stream
In-Reply-To: <CANpYnv_dSVMM85iNVn8gnktNyOuKmevsWifyMFm9BY_QboJT8w@mail.gmail.com>
References: <CANpYnv_dSVMM85iNVn8gnktNyOuKmevsWifyMFm9BY_QboJT8w@mail.gmail.com>
Message-ID: <5F077DEA-DEBD-42BC-87E7-327697764CFE@verizon.net>

Felix,

I took a look at the Bio::DB::Query::GenBank documentation, it says this:

If you provide an array reference of IDs in -ids, the query will be ignored and the list of IDs will be used when the query is passed to a Bio::DB::GenBank object's get_Stream_by_query() method. 

Bio::DB::Genbank queries "nucleotide", by default. You have GeneIDs. I see that you're setting "-id" to "gene" but note that you're passing that query to a plain Bio::DB::GenBank object. Not sure what the expected behavior is here.

I would try using the NCBI Eutilities for that second query, rather than Bio::DB::Query::GenBank (http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook).

Brian O.

On Nov 1, 2012, at 8:01 PM, Felix Horns <rfhorns at gmail.com> wrote:

> Hello everyone.
> 
> I am having trouble using the get_Stream_by_query() function
> in Bio::DB::GenBank.  It seems to return an empty stream, such that
> $stream->next_seq never returns anything.
> 
> However, $query->count is returning the expected value (139).  Also,
> get_Stream_by_query() seems to be querying the database, as when I pass it
> an array of GeneIDs that have not been properly formatted, i.e.
> GeneID:7816864, instead of simply 7816864, it returns warnings and errors:
> "MSG: Warning(s) from GenBank: <PhraseNotFound>GeneID 7817709...; MSG:
> Error from Genbank: No items found.".
> 
> I have included my full code below. I have also included the output from
> the code below that.  The code is intended to find genes located within a
> genomic region. I will later find the protein domains and pathways that
> those genes are involved in.
> 
> Any help would be greatly appreciated.  I realize that this is probably a
> very simple question, but I am relatively new to BioPerl and I've spent the
> better part of the day trying to figure out such issues, so I would be very
> thankful for help.
> 
> Felix
> 
> 
> #!/usr/bin/perl
> use strict;
> use Bio::SeqIO;
> use Bio::DB::EntrezGene;
> use Bio::DB::GenBank;
> 
> # Load reference sequence
> # Load from local .gb file
> # Note that .gb file does not include sequences
> # my $gbfile = "NC_012660.1.gb";
> # my $seqio = Bio::SeqIO->new(-file => $gbfile);
> # my $ref_seq = $seqio->next_seq;
> 
> # To access reference sequence programatically, uncomment this code
> my $gb = new Bio::DB::GenBank;
> my $ref_seq = $gb->get_Seq_by_acc("NC_012660.1");
> 
> # Specify coordinates of gap
> my $gap_start = 2050506;
> my $gap_end = 2190530;
> 
> my $gene_count = 0;
> my @features;
> my @starts;
> my @ends;
> my @db_xrefs;
> 
> my @products;
> my @protein_ids;
> 
> # Get gene features in gap
> for my $feat ($ref_seq->get_SeqFeatures) {
>  my $start=$feat->location->start;
>  my $end=$feat->location->end;
> 
>  if (($feat->primary_tag eq 'gene') &
>      ($gap_start < $start) & ($start < $gap_end) &
>      ($gap_start < $end) & ($end < $gap_end)) {
> 
>    $gene_count += 1;
> 
>    # Get GeneID reference
>    my $db_xref = ($feat->get_tag_values('db_xref'))[0];
>    $db_xref =~ s/GeneID://;    # Trim "GeneID:" from start of $db_xref
> 
>    push @features, $feat;
>    push @starts, $start;
>    push @ends, $end;
>    push @db_xrefs, $db_xref;
>  }
> }
> 
> # Get data about gene features from GeneID reference
> my $query = Bio::DB::Query::GenBank->new(-db => 'gene',
> -ids => [@db_xrefs]);
> my $stream = $gb->get_Stream_by_query($query);
> 
> while (my $seq = $stream->next_seq) {
>  for my $feat ($seq->all_SeqFeatures) {
>    print "primary tag: ", $feat->primary_tag, "\n";
>    for my $tag ($feat->get_all_tags) {
>      print "  tag: ", $tag, "\n";
>      for my $value ($feat->get_tag_values($tag)) {
> print "    value: ", $value, "\n";
>      }
>    }
>  }
> }
> 
> print $query->count,"\n";
> print $gene_count, "\n";
> 
> 
> OUTPUT
>> perl analyze_gap.pl
> 139
> 139
> 
> Note that no "primary tag; tag; value" items are printed.  Furthermore,
> when I put a print line immediately after the (while (my $seq =
> $stream->next_seq)) statement, it was never called, seemingly indicating
> that the stream is empty.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From bosborne11 at verizon.net  Tue Nov 20 18:52:00 2012
From: bosborne11 at verizon.net (Brian Osborne)
Date: Tue, 20 Nov 2012 18:52:00 -0500
Subject: [Bioperl-l] Ask for help about Bioperl
In-Reply-To: <CACdwZFTvgwGzaa2wTfinu9=JKYAXL_O0kQ-Osn7FGcsDx33+Ng@mail.gmail.com>
References: <CACdwZFTvgwGzaa2wTfinu9=JKYAXL_O0kQ-Osn7FGcsDx33+Ng@mail.gmail.com>
Message-ID: <3B472C83-7F6C-41E2-A629-E6A2BDC6B075@verizon.net>

????,

You're going to have show us your code, we can't help you just by seeing the error messages. Show us the input file as well, or the beginning of it.

Brian O.


On Nov 6, 2012, at 2:38 AM, ???? <mooldhu at gmail.com> wrote:

> hi,
> when I use bioperl ,it report errors like this :---------------------
> WARNING ---------------------
> MSG: sequence '1' doesn't validate, mismatch is )81,95,0(,::
> ---------------------------------------------------
> Error providing evidence type: GeneModel
> The error was:
> 
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: Attempting to set the sequence '1' to
> [)81hvf95x0(DSTD=qeSrytkiyP::oiV] which does not look healthy
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:486
> STACK: Bio::PrimarySeq::seq /usr/share/perl5/Bio/PrimarySeq.pm:285
> STACK: Bio::PrimarySeq::new /usr/share/perl5/Bio/PrimarySeq.pm:239
> STACK: Bio::PrimarySeqI::revcom /usr/share/perl5/Bio/PrimarySeqI.pm:383
> 
> 
> but,I am sure that the input file only cotain [ATGCN],I also try to use
> another sequences ,but the errors are the same.my bioperl is Bioperl-live
> 1.006902;
> 
> -- 
> ????
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From hlapp at drycafe.net  Tue Nov 20 21:24:50 2012
From: hlapp at drycafe.net (Hilmar Lapp)
Date: Tue, 20 Nov 2012 21:24:50 -0500
Subject: [Bioperl-l] handle with file in perl
In-Reply-To: <34626730.post@talk.nabble.com>
References: <34626730.post@talk.nabble.com>
Message-ID: <1DE09B34-5124-478C-8925-0045EC119CFC@drycafe.net>

This sounds like a homework assignment. We're not here to do your homework or assignments for you. You can post if you run into a specific problem when solving your assignment with Bioperl, and we'll help with that. 

-hilmar

Sent with a tap.

On Oct 31, 2012, at 7:45 PM, eyla4ever <assayagy at gmail.com> wrote:

> 
> hi 
> 
> i want to write a function that get as parameters : file_name, hsp , hit.
> and i want her to print all the blast Field that i need to this file.
> 
> i do it because i have 2 files with the same Fields.
>        
> 
> 10X
> -- 
> View this message in context: http://old.nabble.com/handle-with-file-in-perl-tp34626730p34626730.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From mahakadry at aucegypt.edu  Fri Nov 23 20:33:59 2012
From: mahakadry at aucegypt.edu (maha ahmed)
Date: Sat, 24 Nov 2012 03:33:59 +0200
Subject: [Bioperl-l] retrieving a subset of files from a folder
Message-ID: <CAE=MQgzV7Z-bQxiQFDuHHaEr=XL5=cTc6Mhq1Su7VPByZNEP6Q@mail.gmail.com>

Dear Bioperl list,
I have a folder that has 60,000 files (one file for each phylogenetic tree)
However I only need to work with a subset of 1,000 files from that folder
(the files are not numbered in order so I cant use the i++ loop in my
bioperl script)
Is there a way to write a script that only moves files with the names given
in a list in a text file
i.e. I have a file that has the names of the files I want to copy fro m the
folder and I want to write script that does this
Thank you so much


From kellert at ohsu.edu  Sat Nov 24 13:08:11 2012
From: kellert at ohsu.edu (Tom Keller)
Date: Sat, 24 Nov 2012 10:08:11 -0800
Subject: [Bioperl-l] use cookbook to work with a directory of files
In-Reply-To: <mailman.7.1353776405.32614.bioperl-l@lists.open-bio.org>
References: <mailman.7.1353776405.32614.bioperl-l@lists.open-bio.org>
Message-ID: <C969FE0E-18FE-4771-B031-22EEA42AEA77@ohsu.edu>

A search with the phrase "perl cookbook filenames from directory" should help you find what you need.

On Nov 24, 2012, at 9:00 AM, bioperl-l-request at lists.open-bio.org wrote:

> Send Bioperl-l mailing list submissions to
> 	bioperl-l at lists.open-bio.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://lists.open-bio.org/mailman/listinfo/bioperl-l
> or, via email, send a message with subject or body 'help' to
> 	bioperl-l-request at lists.open-bio.org
> 
> You can reach the person managing the list at
> 	bioperl-l-owner at lists.open-bio.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Bioperl-l digest..."
> 
> 
> Today's Topics:
> 
>   1.  retrieving a subset of files from a folder (maha ahmed)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Sat, 24 Nov 2012 03:33:59 +0200
> From: maha ahmed <mahakadry at aucegypt.edu>
> Subject: [Bioperl-l] retrieving a subset of files from a folder
> To: Bioperl-l at lists.open-bio.org
> Message-ID:
> 	<CAE=MQgzV7Z-bQxiQFDuHHaEr=XL5=cTc6Mhq1Su7VPByZNEP6Q at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> Dear Bioperl list,
> I have a folder that has 60,000 files (one file for each phylogenetic tree)
> However I only need to work with a subset of 1,000 files from that folder
> (the files are not numbered in order so I cant use the i++ loop in my
> bioperl script)
> Is there a way to write a script that only moves files with the names given
> in a list in a text file
> i.e. I have a file that has the names of the files I want to copy fro m the
> folder and I want to write script that does this
> Thank you so much
> 
> 
> ------------------------------
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> End of Bioperl-l Digest, Vol 115, Issue 8
> *****************************************


From minou.nowrousian at rub.de  Sat Nov 24 13:24:02 2012
From: minou.nowrousian at rub.de (Minou Nowrousian)
Date: 24 Nov 2012 19:24:02 +0100
Subject: [Bioperl-l] retrieving a subset of files from a folder
Message-ID: <000001cdca70$e1a97720$a4fc6560$@rub.de>


>Dear Bioperl list,
>I have a folder that has 60,000 files (one file for each phylogenetic tree)
However I only need to work with a subset of 1,000 files from that folder
>(the files are not numbered in order so I cant use the i++ loop in my
bioperl script) Is there a way to write a script that only moves files with
the >names given in a list in a text file i.e. I have a file that has the
names of the files I want to copy fro m the folder and I want to write
script that does >this Thank you so much

I don't know if there is a BioPerl solution, but you could use the
File::Copy module (available from CPAN):

use File::Copy;
 copy("path_to_file_you_want_to_copy","path_to_target_file") or die "Copy
failed: $!";

Regards,
Minou


From mahakadry at aucegypt.edu  Sat Nov 24 14:04:09 2012
From: mahakadry at aucegypt.edu (maha ahmed)
Date: Sat, 24 Nov 2012 21:04:09 +0200
Subject: [Bioperl-l] retrieving a subset of files from a folder
In-Reply-To: <000001cdca70$e1a97720$a4fc6560$@rub.de>
References: <000001cdca70$e1a97720$a4fc6560$@rub.de>
Message-ID: <CAE=MQgztf_isVyt=WPF9LMXCtX4Q2U9vHL1AV+TwpueUjKayuw@mail.gmail.com>

Thanks everyone , I actually found a one line command that I am going to
try:
xargs -a file_list.txt mv -t /path/to/des
thanks for your help I will read have a look at the readings you suggested
thank you

On Sat, Nov 24, 2012 at 8:24 PM, Minou Nowrousian
<minou.nowrousian at rub.de>wrote:

>
> >Dear Bioperl list,
> >I have a folder that has 60,000 files (one file for each phylogenetic
> tree)
> However I only need to work with a subset of 1,000 files from that folder
> >(the files are not numbered in order so I cant use the i++ loop in my
> bioperl script) Is there a way to write a script that only moves files with
> the >names given in a list in a text file i.e. I have a file that has the
> names of the files I want to copy fro m the folder and I want to write
> script that does >this Thank you so much
>
> I don't know if there is a BioPerl solution, but you could use the
> File::Copy module (available from CPAN):
>
> use File::Copy;
>  copy("path_to_file_you_want_to_copy","path_to_target_file") or die "Copy
> failed: $!";
>
> Regards,
> Minou
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


From maj at fortinbras.us  Tue Nov 27 08:49:46 2012
From: maj at fortinbras.us (Mark A. Jensen)
Date: Tue, 27 Nov 2012 13:49:46 +0000
Subject: [Bioperl-l] Neo4j : applying user defined validation and constraints
Message-ID: <W2391426705276111354024186@webmail57>

Hi Folks,
Since there was some enthusiasm about REST::Neo4p, my interface to Neo4j, I thought I would let you know about
https://metacpan.org/module/REST::Neo4p::Constrain
This is a framework that lets you apply constraints on node and relationship properties, relationships, and relationship types. You can specify your constraints, and have REST::Neo4p throw exceptions when the constraints aren't met, or you can do validation on existing database items. The pod has a full explanation and examples aplenty.

Please have a look and send bugs my way via RT.
Cheers all,
MAJ


From francescomusacchia at gmail.com  Wed Nov 28 05:27:16 2012
From: francescomusacchia at gmail.com (Francesco Musacchia)
Date: Wed, 28 Nov 2012 02:27:16 -0800 (PST)
Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access
Message-ID: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com>

Hi all,
I have a big problem with using GFF3 database with BioPerl. This is not a 
question about what is the way to write some bioperl code. I'm experiencing 
that when I have to do a lot of accessess on a GFF database (with Bio:DB::SeqFeature::Store) 
the slowness increase until my script can stay running for more than a day.

How can I solve it? Or it cannot be done?

Thanks a lot!


From florent.angly at gmail.com  Thu Nov  1 01:49:13 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Thu, 01 Nov 2012 15:49:13 +1000
Subject: [Bioperl-l] Bio::PrimarySeq speedup
Message-ID: <50920D59.4010307@gmail.com>

Hi all,

I was working with Ben Woodcroft on identifying ways to speed up 
Grinder, which relies heavily on Bioperl. Ben did some profiling with 
NYTProf and we realized that a lot of computation time was spent in 
Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we 
used for the profiling were microbial genomes, i.e. several Mbp long 
sequences, which is quite long. A lot of the performance cost was 
associated with passing full genomes between functions. For example, 
when doing a call to length(), length() requests the full sequence from 
seq(), which returns it back to length() (it makes a copy!). So, every 
call to length is very expensive for long sequences. And there is a lot 
of code that calls length(), for error checking.

I know that there are a few Bioperl modules that are more adapted to 
handling very long sequences, e.g. Bio::DB::Fasta or 
Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at 
Bio::PrimarySeq with Ben and I released this commit: 
https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. 
But in fact, there were more things that I wanted to try to improve, 
which led me to start this new branch: 
https://github.com/bioperl/bioperl-live/tree/seqlength

I wrote quite a few tests for functionalities that were not previously 
covered by tests, and tried to improve the documentation. In addition, 
to address the speed issue, I did some changes to Bio::PrimarySeq and 
Bio::PrimarySeqI :
? The length of a sequence is now computed as soon as the sequence is 
set, not after. This way, there is no extra call to seq() (which would 
incur the cost of copying the entire sequence between functions).
? The length is saved as an object attribute. So, calling length() is 
very cheap since it only needs to retrieve the stored value for the length.
? There is a constructor called -direct, which skips sequence 
validation. However, it was only active in conjunction with the 
-ref_to_seq constructor. To make -direct conform better to its 
documented purpose, I made it -direct work when a sequence is set 
through -seq as well.
? This brings us to trunc(), revcom() and other methods of 
Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq 
object from an existing (already validated!) Bio::PrimarySeq object, the 
new object can be constructed with the -direct constructor, to save some 
time.
? Finally, I noticed that subseq() used calls to eval() to do its work. 
eval() is notoriously slow and these calls were easily replaced by 
simple calls to substr() to save some time.

A real-world test I performed with Grinder took 3m28s before the changes 
(and ~1 min is spent doing something unrelated). After the changes, the 
same test took only 2min28s. So, it's quite a significant improvement 
and on more specific test cases, performance gains can obviously be much 
bigger. Also, I anticipate that the gains would be bigger for even 
longer sequences.

All the changes I made are meant to be backward compatible and all the 
tests in the Bioperl test suite passed. So, there _should_ not be any 
issues. However, I know that Bio::PrimarySeq is a central module of 
Bioperl, so please, have a look at it and let me know if there are any 
glaring errors.

Thanks,

Florent


From shalabh.sharma7 at gmail.com  Thu Nov  1 15:36:35 2012
From: shalabh.sharma7 at gmail.com (shalabh sharma)
Date: Thu, 1 Nov 2012 15:36:35 -0400
Subject: [Bioperl-l] blast question
Message-ID: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>

Hi All,
          First of all i am really very sorry for posting blast question in
this forum, I am not sure if this is the right place.
I will really appreciate if anyone can guide me to the right direction.

I am using blastall to get a top hit from a database so i am using -v 1 -b
1 (i hope this is right).
But the strange part is that i am getting wrong results.

for example: if i use -v 1 -b 1 then for one of the hit i am getting this:


Sequences producing significant alignments:                      (bits)
Value

fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
4e-04


If i use -v 3 -b 3 then i am getting this for the same query:

Sequences producing significant alignments:                      (bits)
Value

fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
e-167
fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
9e-07
fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
1.0

As you can see the top hit in the first case is totally wrong.

I would really appreciate if someone can help me out, or direct to in the
right direction.

Thanks
Shalabh


-- 
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist)
Department of Marine Sciences
University of Georgia
Athens, GA 30602-3636


From cjfields at illinois.edu  Thu Nov  1 17:41:43 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Thu, 1 Nov 2012 21:41:43 +0000
Subject: [Bioperl-l] blast question
In-Reply-To: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>

That's a scary error, but the best place to submit this would be the BLAST help list at NCBI (cc'd)

chris

On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com> wrote:

> Hi All,
>          First of all i am really very sorry for posting blast question in
> this forum, I am not sure if this is the right place.
> I will really appreciate if anyone can guide me to the right direction.
> 
> I am using blastall to get a top hit from a database so i am using -v 1 -b
> 1 (i hope this is right).
> But the strange part is that i am getting wrong results.
> 
> for example: if i use -v 1 -b 1 then for one of the hit i am getting this:
> 
> 
> Sequences producing significant alignments:                      (bits)
> Value
> 
> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> 4e-04
> 
> 
> If i use -v 3 -b 3 then i am getting this for the same query:
> 
> Sequences producing significant alignments:                      (bits)
> Value
> 
> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
> e-167
> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> 9e-07
> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
> 1.0
> 
> As you can see the top hit in the first case is totally wrong.
> 
> I would really appreciate if someone can help me out, or direct to in the
> right direction.
> 
> Thanks
> Shalabh
> 
> 
> 
> -- 
> Shalabh Sharma
> Scientific Computing Professional Associate (Bioinformatics Specialist)
> Department of Marine Sciences
> University of Georgia
> Athens, GA 30602-3636
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From shalabh.sharma7 at gmail.com  Fri Nov  2 10:50:17 2012
From: shalabh.sharma7 at gmail.com (shalabh sharma)
Date: Fri, 2 Nov 2012 10:50:17 -0400
Subject: [Bioperl-l] blast question
In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
Message-ID: <CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>

I know, i am really worried about my past analysis now.
Thanks a lot for cc'ing this mail Chris.

-Shalabh

On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <cjfields at illinois.edu
> wrote:

> That's a scary error, but the best place to submit this would be the BLAST
> help list at NCBI (cc'd)
>
> chris
>
> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com>
> wrote:
>
> > Hi All,
> >          First of all i am really very sorry for posting blast question
> in
> > this forum, I am not sure if this is the right place.
> > I will really appreciate if anyone can guide me to the right direction.
> >
> > I am using blastall to get a top hit from a database so i am using -v 1
> -b
> > 1 (i hope this is right).
> > But the strange part is that i am getting wrong results.
> >
> > for example: if i use -v 1 -b 1 then for one of the hit i am getting
> this:
> >
> >
> > Sequences producing significant alignments:                      (bits)
> > Value
> >
> > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> > 4e-04
> >
> >
> > If i use -v 3 -b 3 then i am getting this for the same query:
> >
> > Sequences producing significant alignments:                      (bits)
> > Value
> >
> > fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
> > e-167
> > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> > 9e-07
> > fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
> > 1.0
> >
> > As you can see the top hit in the first case is totally wrong.
> >
> > I would really appreciate if someone can help me out, or direct to in the
> > right direction.
> >
> > Thanks
> > Shalabh
> >
> >
> >
> > --
> > Shalabh Sharma
> > Scientific Computing Professional Associate (Bioinformatics Specialist)
> > Department of Marine Sciences
> > University of Georgia
> > Athens, GA 30602-3636
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


-- 
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist)
Department of Marine Sciences
University of Georgia
Athens, GA 30602-3636


From Scott.Markel at accelrys.com  Fri Nov  2 20:13:59 2012
From: Scott.Markel at accelrys.com (Scott Markel)
Date: Fri, 2 Nov 2012 17:13:59 -0700
Subject: [Bioperl-l] Structure::IO::pdb update needed to comply with PDB
 file format specification change
Message-ID: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net>

In tracking down a regression test failure we discovered that the Structure::IO::pdb module is out of date relative to the PDB file format specification (http://www.wwpdb.org/docs.html).  PDB now writes out to column 79, while pdb.pm is still using the old line length of 71.  Our regression failure was caused by a reformatting of 1CRN; journal titles started getting truncated.

Some of the Perl lines are really simple, e.g.,

	$keywds = $self->_read_PDB_singlecontline("KEYWDS","11-70",\$buffer);

with others being just a little more detailed, e.g.,

	my ($rec, $subr, $cont, $rol) = unpack "A6 x6 A4 A2 x1 A51", $_;

It doesn't look like pdb.pm has changed in about 1.5 years.  Is there a current module owner?  Or someone else working on this?

If not, we're willing to either compile a list of needed changes (walking through the PDB file format specification and comparing the corresponding column indices in pdb.pm) or provide a new version of the entire file.  Please let us know which is preferred.

Scott

Scott Markel, Ph.D.
Principal Bioinformatics Architect? email:? smarkel at accelrys.com
Accelrys (Pipeline Pilot R&D)?????? mobile: +1 858 205 3653
10188 Telesis Court, Suite 100????? voice:? +1 858 799 5603
San Diego, CA 92121???????????????? fax:??? +1 858 799 5222
USA???????????????????????????????? web:??? http://www.accelrys.com

http://www.linkedin.com/in/smarkel
Secretary, Board of Directors:
??? International Society for Computational Biology
Chair: ISCB Publications and Communications Committee
Associate Editor: PLoS Computational Biology
Editorial Board: Briefings in Bioinformatics


From cjfields at illinois.edu  Fri Nov  2 22:08:52 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Sat, 3 Nov 2012 02:08:52 +0000
Subject: [Bioperl-l] Structure::IO::pdb update needed to comply with PDB
 file format specification change
In-Reply-To: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net>
References: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC817F8@CHIMBX5.ad.uillinois.edu>

On Nov 2, 2012, at 7:13 PM, Scott Markel <Scott.Markel at accelrys.com> wrote:

> In tracking down a regression test failure we discovered that the Structure::IO::pdb module is out of date relative to the PDB file format specification (http://www.wwpdb.org/docs.html).  PDB now writes out to column 79, while pdb.pm is still using the old line length of 71.  Our regression failure was caused by a reformatting of 1CRN; journal titles started getting truncated.
> 
> Some of the Perl lines are really simple, e.g.,
> 
> 	$keywds = $self->_read_PDB_singlecontline("KEYWDS","11-70",\$buffer);
> 
> with others being just a little more detailed, e.g.,
> 
> 	my ($rec, $subr, $cont, $rol) = unpack "A6 x6 A4 A2 x1 A51", $_;
> 
> It doesn't look like pdb.pm has changed in about 1.5 years.  Is there a current module owner?  Or someone else working on this?

No one has really taken ownership, so as far as I'm concerned it's open.  Any objections?

> If not, we're willing to either compile a list of needed changes (walking through the PDB file format specification and comparing the corresponding column indices in pdb.pm) or provide a new version of the entire file.  Please let us know which is preferred.

A new version of the file is fine if you have someone who can work on it.  We would also like to change relevant tests and documentation if there is time.

> Scott
> 
> Scott Markel, Ph.D.
> Principal Bioinformatics Architect  email:  smarkel at accelrys.com
> Accelrys (Pipeline Pilot R&D)       mobile: +1 858 205 3653
> 10188 Telesis Court, Suite 100      voice:  +1 858 799 5603
> San Diego, CA 92121                 fax:    +1 858 799 5222
> USA                                 web:    http://www.accelrys.com
> 
> http://www.linkedin.com/in/smarkel
> Secretary, Board of Directors:
>     International Society for Computational Biology
> Chair: ISCB Publications and Communications Committee
> Associate Editor: PLoS Computational Biology
> Editorial Board: Briefings in Bioinformatics

Thanks Scott!

chris


From Russell.Smithies at agresearch.co.nz  Sun Nov  4 16:00:37 2012
From: Russell.Smithies at agresearch.co.nz (Smithies, Russell)
Date: Mon, 5 Nov 2012 10:00:37 +1300
Subject: [Bioperl-l] blast question
In-Reply-To: <CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>

What version of blast are you using?
There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+


--Russell

-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma
Sent: Saturday, 3 November 2012 3:50 a.m.
To: Fields, Christopher J
Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov
Subject: Re: [Bioperl-l] blast question

I know, i am really worried about my past analysis now.
Thanks a lot for cc'ing this mail Chris.

-Shalabh

On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <cjfields at illinois.edu
> wrote:

> That's a scary error, but the best place to submit this would be the 
> BLAST help list at NCBI (cc'd)
>
> chris
>
> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com>
> wrote:
>
> > Hi All,
> >          First of all i am really very sorry for posting blast 
> > question
> in
> > this forum, I am not sure if this is the right place.
> > I will really appreciate if anyone can guide me to the right direction.
> >
> > I am using blastall to get a top hit from a database so i am using 
> > -v 1
> -b
> > 1 (i hope this is right).
> > But the strange part is that i am getting wrong results.
> >
> > for example: if i use -v 1 -b 1 then for one of the hit i am getting
> this:
> >
> >
> > Sequences producing significant alignments:                      (bits)
> > Value
> >
> > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> > 4e-04
> >
> >
> > If i use -v 3 -b 3 then i am getting this for the same query:
> >
> > Sequences producing significant alignments:                      (bits)
> > Value
> >
> > fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
> > e-167
> > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> > 9e-07
> > fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
> > 1.0
> >
> > As you can see the top hit in the first case is totally wrong.
> >
> > I would really appreciate if someone can help me out, or direct to 
> > in the right direction.
> >
> > Thanks
> > Shalabh
> >
> >
> >
> > --
> > Shalabh Sharma
> > Scientific Computing Professional Associate (Bioinformatics 
> > Specialist) Department of Marine Sciences University of Georgia 
> > Athens, GA 30602-3636 
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


--
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l

=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================


From cjfields at illinois.edu  Sun Nov  4 17:13:37 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Sun, 4 Nov 2012 22:13:37 +0000
Subject: [Bioperl-l] blast question
In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
	<18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu>

That in fact is the recommendation (migrate to BLAST+).

chris

On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" <Russell.Smithies at agresearch.co.nz> wrote:

> What version of blast are you using?
> There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+
> 
> 
> --Russell
> 
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma
> Sent: Saturday, 3 November 2012 3:50 a.m.
> To: Fields, Christopher J
> Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov
> Subject: Re: [Bioperl-l] blast question
> 
> I know, i am really worried about my past analysis now.
> Thanks a lot for cc'ing this mail Chris.
> 
> -Shalabh
> 
> On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <cjfields at illinois.edu
>> wrote:
> 
>> That's a scary error, but the best place to submit this would be the 
>> BLAST help list at NCBI (cc'd)
>> 
>> chris
>> 
>> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com>
>> wrote:
>> 
>>> Hi All,
>>>         First of all i am really very sorry for posting blast 
>>> question
>> in
>>> this forum, I am not sure if this is the right place.
>>> I will really appreciate if anyone can guide me to the right direction.
>>> 
>>> I am using blastall to get a top hit from a database so i am using 
>>> -v 1
>> -b
>>> 1 (i hope this is right).
>>> But the strange part is that i am getting wrong results.
>>> 
>>> for example: if i use -v 1 -b 1 then for one of the hit i am getting
>> this:
>>> 
>>> 
>>> Sequences producing significant alignments:                      (bits)
>>> Value
>>> 
>>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
>>> 4e-04
>>> 
>>> 
>>> If i use -v 3 -b 3 then i am getting this for the same query:
>>> 
>>> Sequences producing significant alignments:                      (bits)
>>> Value
>>> 
>>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
>>> e-167
>>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
>>> 9e-07
>>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
>>> 1.0
>>> 
>>> As you can see the top hit in the first case is totally wrong.
>>> 
>>> I would really appreciate if someone can help me out, or direct to 
>>> in the right direction.
>>> 
>>> Thanks
>>> Shalabh
>>> 
>>> 
>>> 
>>> --
>>> Shalabh Sharma
>>> Scientific Computing Professional Associate (Bioinformatics 
>>> Specialist) Department of Marine Sciences University of Georgia 
>>> Athens, GA 30602-3636 
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
>> 
> 
> 
> --
> Shalabh Sharma
> Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From florent.angly at gmail.com  Sun Nov  4 19:46:44 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Mon, 05 Nov 2012 10:46:44 +1000
Subject: [Bioperl-l] Bio::PrimarySeq speedup
In-Reply-To: <50920D59.4010307@gmail.com>
References: <50920D59.4010307@gmail.com>
Message-ID: <50970C74.7070605@gmail.com>

I am planning on merging the branch with master this week.
Best,
Florent


On 01/11/12 15:49, Florent Angly wrote:
> Hi all,
>
> I was working with Ben Woodcroft on identifying ways to speed up 
> Grinder, which relies heavily on Bioperl. Ben did some profiling with 
> NYTProf and we realized that a lot of computation time was spent in 
> Bio::PrimarySeq, doing calls to subseq() and length(). The sequences 
> we used for the profiling were microbial genomes, i.e. several Mbp 
> long sequences, which is quite long. A lot of the performance cost was 
> associated with passing full genomes between functions. For example, 
> when doing a call to length(), length() requests the full sequence 
> from seq(), which returns it back to length() (it makes a copy!). So, 
> every call to length is very expensive for long sequences. And there 
> is a lot of code that calls length(), for error checking.
>
> I know that there are a few Bioperl modules that are more adapted to 
> handling very long sequences, e.g. Bio::DB::Fasta or 
> Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at 
> Bio::PrimarySeq with Ben and I released this commit: 
> https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. 
> But in fact, there were more things that I wanted to try to improve, 
> which led me to start this new branch: 
> https://github.com/bioperl/bioperl-live/tree/seqlength
>
> I wrote quite a few tests for functionalities that were not previously 
> covered by tests, and tried to improve the documentation. In addition, 
> to address the speed issue, I did some changes to Bio::PrimarySeq and 
> Bio::PrimarySeqI :
> ? The length of a sequence is now computed as soon as the sequence is 
> set, not after. This way, there is no extra call to seq() (which would 
> incur the cost of copying the entire sequence between functions).
> ? The length is saved as an object attribute. So, calling length() is 
> very cheap since it only needs to retrieve the stored value for the 
> length.
> ? There is a constructor called -direct, which skips sequence 
> validation. However, it was only active in conjunction with the 
> -ref_to_seq constructor. To make -direct conform better to its 
> documented purpose, I made it -direct work when a sequence is set 
> through -seq as well.
> ? This brings us to trunc(), revcom() and other methods of 
> Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq 
> object from an existing (already validated!) Bio::PrimarySeq object, 
> the new object can be constructed with the -direct constructor, to 
> save some time.
> ? Finally, I noticed that subseq() used calls to eval() to do its 
> work. eval() is notoriously slow and these calls were easily replaced 
> by simple calls to substr() to save some time.
>
> A real-world test I performed with Grinder took 3m28s before the 
> changes (and ~1 min is spent doing something unrelated). After the 
> changes, the same test took only 2min28s. So, it's quite a significant 
> improvement and on more specific test cases, performance gains can 
> obviously be much bigger. Also, I anticipate that the gains would be 
> bigger for even longer sequences.
>
> All the changes I made are meant to be backward compatible and all the 
> tests in the Bioperl test suite passed. So, there _should_ not be any 
> issues. However, I know that Bio::PrimarySeq is a central module of 
> Bioperl, so please, have a look at it and let me know if there are any 
> glaring errors.
>
> Thanks,
>
> Florent
>


From cjfields at illinois.edu  Sun Nov  4 21:43:28 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Mon, 5 Nov 2012 02:43:28 +0000
Subject: [Bioperl-l] Bio::PrimarySeq speedup
In-Reply-To: <50970C74.7070605@gmail.com>
References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu>

Florent,

Ran tests on it, they pass but I am seeing this (if these are expected, you can catch the warnings using Test::Warn):

[cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr t/Seq/PrimarySeq.t 
t/Seq/PrimarySeq.t .. 1/167 
--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is !
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is &
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is \,$,+
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/
---------------------------------------------------
t/Seq/PrimarySeq.t .. ok       
All tests successful.
Files=1, Tests=167,  0 wallclock secs ( 0.03 usr  0.01 sys +  0.18 cusr  0.01 csys =  0.23 CPU)
Result: PASS

chris

On Nov 4, 2012, at 6:46 PM, Florent Angly <florent.angly at gmail.com>
 wrote:

> I am planning on merging the branch with master this week.
> Best,
> Florent
> 
> 
> On 01/11/12 15:49, Florent Angly wrote:
>> Hi all,
>> 
>> I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was associated with passing full genomes between functions. For example, when doing a call to length(), length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So, every call to length is very expensive for long sequences. And there is a lot of code that calls length(), for error checking.
>> 
>> I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g. Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength
>> 
>> I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to improve the documentation. In addition, to address the speed issue, I did some changes to Bio::PrimarySeq and Bio::PrimarySeqI :
>> ? The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no extra call to seq() (which would incur the cost of copying the entire sequence between functions).
>> ? The length is saved as an object attribute. So, calling length() is very cheap since it only needs to retrieve the stored value for the length.
>> ? There is a constructor called -direct, which skips sequence validation. However, it was only active in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented purpose, I made it -direct work when a sequence is set through -seq as well.
>> ? This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the new object can be constructed with the -direct constructor, to save some time.
>> ? Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and these calls were easily replaced by simple calls to substr() to save some time.
>> 
>> A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I anticipate that the gains would be bigger for even longer sequences.
>> 
>> All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of Bioperl, so please, have a look at it and let me know if there are any glaring errors.
>> 
>> Thanks,
>> 
>> Florent
>> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From shalabh.sharma7 at gmail.com  Mon Nov  5 12:03:38 2012
From: shalabh.sharma7 at gmail.com (shalabh sharma)
Date: Mon, 5 Nov 2012 12:03:38 -0500
Subject: [Bioperl-l] blast question
In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
	<18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>
	<118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu>
Message-ID: <CAA7rn9f1GOsib-PhNAsQo327dQznMn-15UZz7w49iYKjg-VLHg@mail.gmail.com>

Hi All,
         thanks for all your responses.

Currently i am using the old version of blastall 2.2.22.

@Peter: I will update my blast and will see if the problem still exist. But
i can't restrict my blast with e value because i work on environmental
samples , i have to reduce the size of my blast files as i am only
interested in the top hit and my data sets are really huge.

Thanks
Shalabh

On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J <cjfields at illinois.edu
> wrote:

> That in fact is the recommendation (migrate to BLAST+).
>
> chris
>
> On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" <
> Russell.Smithies at agresearch.co.nz> wrote:
>
> > What version of blast are you using?
> > There have been quite a few bug fixes and I suspect any responses from
> NCBI will suggest upgrading to the current version of blast+
> >
> >
> > --Russell
> >
> > -----Original Message-----
> > From: bioperl-l-bounces at lists.open-bio.org [mailto:
> bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma
> > Sent: Saturday, 3 November 2012 3:50 a.m.
> > To: Fields, Christopher J
> > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov
> > Subject: Re: [Bioperl-l] blast question
> >
> > I know, i am really worried about my past analysis now.
> > Thanks a lot for cc'ing this mail Chris.
> >
> > -Shalabh
> >
> > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <
> cjfields at illinois.edu
> >> wrote:
> >
> >> That's a scary error, but the best place to submit this would be the
> >> BLAST help list at NCBI (cc'd)
> >>
> >> chris
> >>
> >> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com>
> >> wrote:
> >>
> >>> Hi All,
> >>>         First of all i am really very sorry for posting blast
> >>> question
> >> in
> >>> this forum, I am not sure if this is the right place.
> >>> I will really appreciate if anyone can guide me to the right direction.
> >>>
> >>> I am using blastall to get a top hit from a database so i am using
> >>> -v 1
> >> -b
> >>> 1 (i hope this is right).
> >>> But the strange part is that i am getting wrong results.
> >>>
> >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting
> >> this:
> >>>
> >>>
> >>> Sequences producing significant alignments:                      (bits)
> >>> Value
> >>>
> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA
>  38
> >>> 4e-04
> >>>
> >>>
> >>> If i use -v 3 -b 3 then i am getting this for the same query:
> >>>
> >>> Sequences producing significant alignments:                      (bits)
> >>> Value
> >>>
> >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...
> 570
> >>> e-167
> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA
>  38
> >>> 9e-07
> >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...
>  18
> >>> 1.0
> >>>
> >>> As you can see the top hit in the first case is totally wrong.
> >>>
> >>> I would really appreciate if someone can help me out, or direct to
> >>> in the right direction.
> >>>
> >>> Thanks
> >>> Shalabh
> >>>
> >>>
> >>>
> >>> --
> >>> Shalabh Sharma
> >>> Scientific Computing Professional Associate (Bioinformatics
> >>> Specialist) Department of Marine Sciences University of Georgia
> >>> Athens, GA 30602-3636
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >>
> >
> >
> > --
> > Shalabh Sharma
> > Scientific Computing Professional Associate (Bioinformatics Specialist)
> Department of Marine Sciences University of Georgia Athens, GA 30602-3636
> _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > =======================================================================
> > Attention: The information contained in this message and/or attachments
> > from AgResearch Limited is intended only for the persons or entities
> > to which it is addressed and may contain confidential and/or privileged
> > material. Any review, retransmission, dissemination or other use of, or
> > taking of any action in reliance upon, this information by persons or
> > entities other than the intended recipients is prohibited by AgResearch
> > Limited. If you have received this message in error, please notify the
> > sender immediately.
> > =======================================================================
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


-- 
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist)
Department of Marine Sciences
University of Georgia
Athens, GA 30602-3636


From Russell.Smithies at agresearch.co.nz  Mon Nov  5 16:04:07 2012
From: Russell.Smithies at agresearch.co.nz (Smithies, Russell)
Date: Tue, 6 Nov 2012 10:04:07 +1300
Subject: [Bioperl-l] blast question
In-Reply-To: <CAA7rn9f1GOsib-PhNAsQo327dQznMn-15UZz7w49iYKjg-VLHg@mail.gmail.com>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
	<18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>
	<118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9f1GOsib-PhNAsQo327dQznMn-15UZz7w49iYKjg-VLHg@mail.gmail.com>
Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz>

If you're using an older version of blast there was a bug where not all results were returned - I think the limit was 10,000 hits?
Not usually a problem running basic queries but a big problem for environmental or metagenomic samples, or when aligning short reads.

--Russell

From: shalabh sharma [mailto:shalabh.sharma7 at gmail.com]
Sent: Tuesday, 6 November 2012 6:04 a.m.
To: Fields, Christopher J
Cc: Smithies, Russell; bioperl-l
Subject: Re: [Bioperl-l] blast question

Hi All,
         thanks for all your responses.

Currently i am using the old version of blastall 2.2.22.

@Peter: I will update my blast and will see if the problem still exist. But i can't restrict my blast with e value because i work on environmental samples , i have to reduce the size of my blast files as i am only interested in the top hit and my data sets are really huge.

Thanks
Shalabh

On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J <cjfields at illinois.edu<mailto:cjfields at illinois.edu>> wrote:
That in fact is the recommendation (migrate to BLAST+).

chris

On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" <Russell.Smithies at agresearch.co.nz<mailto:Russell.Smithies at agresearch.co.nz>> wrote:

> What version of blast are you using?
> There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+
>
>
> --Russell
>
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org<mailto:bioperl-l-bounces at lists.open-bio.org> [mailto:bioperl-l-bounces at lists.open-bio.org<mailto:bioperl-l-bounces at lists.open-bio.org>] On Behalf Of shalabh sharma
> Sent: Saturday, 3 November 2012 3:50 a.m.
> To: Fields, Christopher J
> Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov<mailto:blast-help at ncbi.nlm.nih.gov>
> Subject: Re: [Bioperl-l] blast question
>
> I know, i am really worried about my past analysis now.
> Thanks a lot for cc'ing this mail Chris.
>
> -Shalabh
>
> On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <cjfields at illinois.edu<mailto:cjfields at illinois.edu>
>> wrote:
>
>> That's a scary error, but the best place to submit this would be the
>> BLAST help list at NCBI (cc'd)
>>
>> chris
>>
>> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com<mailto:shalabh.sharma7 at gmail.com>>
>> wrote:
>>
>>> Hi All,
>>>         First of all i am really very sorry for posting blast
>>> question
>> in
>>> this forum, I am not sure if this is the right place.
>>> I will really appreciate if anyone can guide me to the right direction.
>>>
>>> I am using blastall to get a top hit from a database so i am using
>>> -v 1
>> -b
>>> 1 (i hope this is right).
>>> But the strange part is that i am getting wrong results.
>>>
>>> for example: if i use -v 1 -b 1 then for one of the hit i am getting
>> this:
>>>
>>>
>>> Sequences producing significant alignments:                      (bits)
>>> Value
>>>
>>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
>>> 4e-04
>>>
>>>
>>> If i use -v 3 -b 3 then i am getting this for the same query:
>>>
>>> Sequences producing significant alignments:                      (bits)
>>> Value
>>>
>>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
>>> e-167
>>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
>>> 9e-07
>>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
>>> 1.0
>>>
>>> As you can see the top hit in the first case is totally wrong.
>>>
>>> I would really appreciate if someone can help me out, or direct to
>>> in the right direction.
>>>
>>> Thanks
>>> Shalabh
>>>
>>>
>>>
>>> --
>>> Shalabh Sharma
>>> Scientific Computing Professional Associate (Bioinformatics
>>> Specialist) Department of Marine Sciences University of Georgia
>>> Athens, GA 30602-3636
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>
>
> --
> Shalabh Sharma
> Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


--
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist)
Department of Marine Sciences
University of Georgia
Athens, GA 30602-3636

=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================


From shalabh.sharma7 at gmail.com  Mon Nov  5 16:09:03 2012
From: shalabh.sharma7 at gmail.com (shalabh sharma)
Date: Mon, 5 Nov 2012 16:09:03 -0500
Subject: [Bioperl-l] blast question
In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
	<18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>
	<118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9f1GOsib-PhNAsQo327dQznMn-15UZz7w49iYKjg-VLHg@mail.gmail.com>
	<18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz>
Message-ID: <CAA7rn9ech--TYQdezH6fLLArjsTdypkNjSkQeO1LiaLTR1zoHQ@mail.gmail.com>

Hi All,
           Thanks for all the suggestion. The problem is fixed by using
latest blast+ .
Thanks
Shalabh

On Mon, Nov 5, 2012 at 4:04 PM, Smithies, Russell <
Russell.Smithies at agresearch.co.nz> wrote:

> If you?re using an older version of blast there was a bug where not all
> results were returned ? I think the limit was 10,000 hits?****
>
> Not usually a problem running basic queries but a big problem for
> environmental or metagenomic samples, or when aligning short reads.****
>
> ** **
>
> --Russell****
>
> ** **
>
> *From:* shalabh sharma [mailto:shalabh.sharma7 at gmail.com]
> *Sent:* Tuesday, 6 November 2012 6:04 a.m.
> *To:* Fields, Christopher J
> *Cc:* Smithies, Russell; bioperl-l
>
> *Subject:* Re: [Bioperl-l] blast question****
>
> ** **
>
> Hi All,****
>
>          thanks for all your responses.****
>
> ** **
>
> Currently i am using the old version of blastall 2.2.22.****
>
> ** **
>
> @Peter: I will update my blast and will see if the problem still exist.
> But i can't restrict my blast with e value because i work on environmental
> samples , i have to reduce the size of my blast files as i am only
> interested in the top hit and my data sets are really huge.****
>
> ** **
>
> Thanks****
>
> Shalabh****
>
> ** **
>
> On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J <
> cjfields at illinois.edu> wrote:****
>
> That in fact is the recommendation (migrate to BLAST+).
>
> chris****
>
>
> On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" <
> Russell.Smithies at agresearch.co.nz> wrote:
>
> > What version of blast are you using?
> > There have been quite a few bug fixes and I suspect any responses from
> NCBI will suggest upgrading to the current version of blast+
> >
> >
> > --Russell
> >
> > -----Original Message-----
> > From: bioperl-l-bounces at lists.open-bio.org [mailto:
> bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma
> > Sent: Saturday, 3 November 2012 3:50 a.m.
> > To: Fields, Christopher J
> > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov
> > Subject: Re: [Bioperl-l] blast question
> >
> > I know, i am really worried about my past analysis now.
> > Thanks a lot for cc'ing this mail Chris.
> >
> > -Shalabh
> >
> > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <
> cjfields at illinois.edu
> >> wrote:
> >
> >> That's a scary error, but the best place to submit this would be the
> >> BLAST help list at NCBI (cc'd)
> >>
> >> chris
> >>
> >> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com>
> >> wrote:
> >>
> >>> Hi All,
> >>>         First of all i am really very sorry for posting blast
> >>> question
> >> in
> >>> this forum, I am not sure if this is the right place.
> >>> I will really appreciate if anyone can guide me to the right direction.
> >>>
> >>> I am using blastall to get a top hit from a database so i am using
> >>> -v 1
> >> -b
> >>> 1 (i hope this is right).
> >>> But the strange part is that i am getting wrong results.
> >>>
> >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting
> >> this:
> >>>
> >>>
> >>> Sequences producing significant alignments:                      (bits)
> >>> Value
> >>>
> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA
>  38
> >>> 4e-04
> >>>
> >>>
> >>> If i use -v 3 -b 3 then i am getting this for the same query:
> >>>
> >>> Sequences producing significant alignments:                      (bits)
> >>> Value
> >>>
> >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...
> 570
> >>> e-167
> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA
>  38
> >>> 9e-07
> >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...
>  18
> >>> 1.0
> >>>
> >>> As you can see the top hit in the first case is totally wrong.
> >>>
> >>> I would really appreciate if someone can help me out, or direct to
> >>> in the right direction.
> >>>
> >>> Thanks
> >>> Shalabh
> >>>
> >>>
> >>>
> >>> --
> >>> Shalabh Sharma
> >>> Scientific Computing Professional Associate (Bioinformatics
> >>> Specialist) Department of Marine Sciences University of Georgia
> >>> Athens, GA 30602-3636
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >>
> >
> >
> > --
> > Shalabh Sharma
> > Scientific Computing Professional Associate (Bioinformatics Specialist)
> Department of Marine Sciences University of Georgia Athens, GA 30602-3636
> _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > =======================================================================
> > Attention: The information contained in this message and/or attachments
> > from AgResearch Limited is intended only for the persons or entities
> > to which it is addressed and may contain confidential and/or privileged
> > material. Any review, retransmission, dissemination or other use of, or
> > taking of any action in reliance upon, this information by persons or
> > entities other than the intended recipients is prohibited by AgResearch
> > Limited. If you have received this message in error, please notify the
> > sender immediately.
> > =======================================================================
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l****
>
>
>
> ****
>
> ** **
>
> --
> Shalabh Sharma
> Scientific Computing Professional Associate (Bioinformatics Specialist)
> Department of Marine Sciences
> University of Georgia
> Athens, GA 30602-3636****
>
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
>


-- 
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist)
Department of Marine Sciences
University of Georgia
Athens, GA 30602-3636


From florent.angly at gmail.com  Tue Nov  6 06:06:56 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Tue, 06 Nov 2012 21:06:56 +1000
Subject: [Bioperl-l] Bio::PrimarySeq speedup
In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu>
References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu>
Message-ID: <5098EF50.5040208@gmail.com>

Yes, good idea, Chris.

Actually, thinking about it, most of these warnings were redundant. So, 
I changed the behaviour of Bio::PrimarySeq::validate_seq() so that it 
issues exceptions if requested.

Florent


On 05/11/12 12:43, Fields, Christopher J wrote:
> Florent,
>
> Ran tests on it, they pass but I am seeing this (if these are expected, you can catch the warnings using Test::Warn):
>
> [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr t/Seq/PrimarySeq.t
> t/Seq/PrimarySeq.t .. 1/167
> --------------------- WARNING ---------------------
> MSG: Got a sequence without letters. Could not guess alphabet
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is !
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is &
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is \,$,+
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/
> ---------------------------------------------------
> t/Seq/PrimarySeq.t .. ok
> All tests successful.
> Files=1, Tests=167,  0 wallclock secs ( 0.03 usr  0.01 sys +  0.18 cusr  0.01 csys =  0.23 CPU)
> Result: PASS
>
> chris
>
> On Nov 4, 2012, at 6:46 PM, Florent Angly <florent.angly at gmail.com>
>   wrote:
>
>> I am planning on merging the branch with master this week.
>> Best,
>> Florent
>>
>>
>> On 01/11/12 15:49, Florent Angly wrote:
>>> Hi all,
>>>
>>> I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was associated with passing full genomes between functions. For example, when doing a call to length(), length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So, every call to length is very expensive for long sequences. And there is a lot of code that calls length(), for error checking.
>>>
>>> I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g. Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength
>>>
>>> I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to improve the documentation. In addition, to address the speed issue, I did some changes to Bio::PrimarySeq and Bio::PrimarySeqI :
>>> ? The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no extra call to seq() (which would incur the cost of copying the entire sequence between functions).
>>> ? The length is saved as an object attribute. So, calling length() is very cheap since it only needs to retrieve the stored value for the length.
>>> ? There is a constructor called -direct, which skips sequence validation. However, it was only active in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented purpose, I made it -direct work when a sequence is set through -seq as well.
>>> ? This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the new object can be constructed with the -direct constructor, to save some time.
>>> ? Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and these calls were easily replaced by simple calls to substr() to save some time.
>>>
>>> A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I anticipate that the gains would be bigger for even longer sequences.
>>>
>>> All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of Bioperl, so please, have a look at it and let me know if there are any glaring errors.
>>>
>>> Thanks,
>>>
>>> Florent
>>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From shlomif at shlomifish.org  Tue Nov  6 07:27:00 2012
From: shlomif at shlomifish.org (Shlomi Fish)
Date: Tue, 6 Nov 2012 14:27:00 +0200
Subject: [Bioperl-l] [Request] Please Help Add Some Information about
 Perl for Bio-Informatics to http://perl-begin.org/uses/bio-info/
In-Reply-To: <20121026192203.6d1e59c0@lap.shlomifish.org>
References: <20121026192203.6d1e59c0@lap.shlomifish.org>
Message-ID: <20121106142700.192f456e@lap.shlomifish.org>

Hi,

Can anyone help with that?

Regards,

	Shlomi Fish

On Fri, 26 Oct 2012 19:22:03 +0200
Shlomi Fish <shlomif at shlomifish.org> wrote:

> Hi all,
> 
> I am the maintainer of http://perl-begin.org/ , the Perl Beginners' Site. I
> had this page there for a long time, but it's empty:
> 
> http://perl-begin.org/uses/bio-info/
> 
> Can someone help me add some information there? A short XHTML page will be OK.
> For reference, see the other pages in the section
> ( http://perl-begin.org/uses/ ) such as:
> 
> * http://perl-begin.org/uses/web/
> 
> * http://perl-begin.org/uses/sys-admin/
> 
> * http://perl-begin.org/uses/qa/
> 
> Note that you agree that the content will be licensed under the Creative
> Commons Attribution 3.0 Unported License (or higher versions) and so you
> should make sure it is original.
> 
> I shall be obliged for any help.
> 
> Regards,
> 
> 	Shlomi Fish
> 


-- 
-----------------------------------------------------------------
Shlomi Fish       http://www.shlomifish.org/
Perl Humour - http://perl-begin.org/humour/

A wiseman can learn from a fool much more than a fool can ever learn from a
wiseman.               ? http://en.wikiquote.org/wiki/Cato_the_Elder

Please reply to list if it's a mailing list post - http://shlom.in/reply .


From florent.angly at gmail.com  Thu Nov 15 11:29:30 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Fri, 16 Nov 2012 02:29:30 +1000
Subject: [Bioperl-l] Bio::PrimarySeq speedup
In-Reply-To: <5098EF50.5040208@gmail.com>
References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu>
	<5098EF50.5040208@gmail.com>
Message-ID: <50A5186A.4060304@gmail.com>

I now merged the branch with master.
Best,
Florent

On 06/11/12 21:06, Florent Angly wrote:
> Yes, good idea, Chris.
>
> Actually, thinking about it, most of these warnings were redundant. 
> So, I changed the behaviour of Bio::PrimarySeq::validate_seq() so that 
> it issues exceptions if requested.
>
> Florent
>
>
> On 05/11/12 12:43, Fields, Christopher J wrote:
>> Florent,
>>
>> Ran tests on it, they pass but I am seeing this (if these are 
>> expected, you can catch the warnings using Test::Warn):
>>
>> [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr 
>> t/Seq/PrimarySeq.t
>> t/Seq/PrimarySeq.t .. 1/167
>> --------------------- WARNING ---------------------
>> MSG: Got a sequence without letters. Could not guess alphabet
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is !
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is &
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is 
>> \,$,+
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/
>> ---------------------------------------------------
>> t/Seq/PrimarySeq.t .. ok
>> All tests successful.
>> Files=1, Tests=167,  0 wallclock secs ( 0.03 usr  0.01 sys + 0.18 
>> cusr  0.01 csys =  0.23 CPU)
>> Result: PASS
>>
>> chris
>>
>> On Nov 4, 2012, at 6:46 PM, Florent Angly <florent.angly at gmail.com>
>>   wrote:
>>
>>> I am planning on merging the branch with master this week.
>>> Best,
>>> Florent
>>>
>>>
>>> On 01/11/12 15:49, Florent Angly wrote:
>>>> Hi all,
>>>>
>>>> I was working with Ben Woodcroft on identifying ways to speed up 
>>>> Grinder, which relies heavily on Bioperl. Ben did some profiling 
>>>> with NYTProf and we realized that a lot of computation time was 
>>>> spent in Bio::PrimarySeq, doing calls to subseq() and length(). The 
>>>> sequences we used for the profiling were microbial genomes, i.e. 
>>>> several Mbp long sequences, which is quite long. A lot of the 
>>>> performance cost was associated with passing full genomes between 
>>>> functions. For example, when doing a call to length(), length() 
>>>> requests the full sequence from seq(), which returns it back to 
>>>> length() (it makes a copy!). So, every call to length is very 
>>>> expensive for long sequences. And there is a lot of code that calls 
>>>> length(), for error checking.
>>>>
>>>> I know that there are a few Bioperl modules that are more adapted 
>>>> to handling very long sequences, e.g. Bio::DB::Fasta or 
>>>> Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look 
>>>> at Bio::PrimarySeq with Ben and I released this commit: 
>>>> https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. 
>>>> But in fact, there were more things that I wanted to try to 
>>>> improve, which led me to start this new branch: 
>>>> https://github.com/bioperl/bioperl-live/tree/seqlength
>>>>
>>>> I wrote quite a few tests for functionalities that were not 
>>>> previously covered by tests, and tried to improve the 
>>>> documentation. In addition, to address the speed issue, I did some 
>>>> changes to Bio::PrimarySeq and Bio::PrimarySeqI :
>>>> ? The length of a sequence is now computed as soon as the sequence 
>>>> is set, not after. This way, there is no extra call to seq() (which 
>>>> would incur the cost of copying the entire sequence between 
>>>> functions).
>>>> ? The length is saved as an object attribute. So, calling length() 
>>>> is very cheap since it only needs to retrieve the stored value for 
>>>> the length.
>>>> ? There is a constructor called -direct, which skips sequence 
>>>> validation. However, it was only active in conjunction with the 
>>>> -ref_to_seq constructor. To make -direct conform better to its 
>>>> documented purpose, I made it -direct work when a sequence is set 
>>>> through -seq as well.
>>>> ? This brings us to trunc(), revcom() and other methods of 
>>>> Bio::PrimarySeqI. Since all these methods create a new 
>>>> Bio::PrimarySeq object from an existing (already validated!) 
>>>> Bio::PrimarySeq object, the new object can be constructed with the 
>>>> -direct constructor, to save some time.
>>>> ? Finally, I noticed that subseq() used calls to eval() to do its 
>>>> work. eval() is notoriously slow and these calls were easily 
>>>> replaced by simple calls to substr() to save some time.
>>>>
>>>> A real-world test I performed with Grinder took 3m28s before the 
>>>> changes (and ~1 min is spent doing something unrelated). After the 
>>>> changes, the same test took only 2min28s. So, it's quite a 
>>>> significant improvement and on more specific test cases, 
>>>> performance gains can obviously be much bigger. Also, I anticipate 
>>>> that the gains would be bigger for even longer sequences.
>>>>
>>>> All the changes I made are meant to be backward compatible and all 
>>>> the tests in the Bioperl test suite passed. So, there _should_ not 
>>>> be any issues. However, I know that Bio::PrimarySeq is a central 
>>>> module of Bioperl, so please, have a look at it and let me know if 
>>>> there are any glaring errors.
>>>>
>>>> Thanks,
>>>>
>>>> Florent
>>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


From mahakadry at aucegypt.edu  Tue Nov 20 13:44:53 2012
From: mahakadry at aucegypt.edu (maha ahmed)
Date: Tue, 20 Nov 2012 20:44:53 +0200
Subject: [Bioperl-l] Parsing a blast report with multiple queries into
 separate one query files that only contain the fasta sequences
Message-ID: <CAE=MQgz9BY7-nUOePoqVZS6FD0ME76yXyTKmODWFhc+TjBGdrg@mail.gmail.com>

Dear BioPerl list,
I blasted a file that has several fasta queries against nr, however I need
to align each query with its hits for further computational analysis so I
need to parse the produced blast report into several files that each has
only the fasta query sequence and its hits in fasta format.
I found this script online,

use Bio::Search::Result::BlastResult;use Bio::SearchIO;
 my $report = Bio::SearchIO->new( -file=>'full-report.bls', -format
<http://perldoc.perl.org/functions/format.html> => blast);my $result =
$report->next_result;my %hits_by_query;while (my $hit =
$result->next_hit) {
  push <http://perldoc.perl.org/functions/push.html>
@{$hits_by_query{$hit->name}}, $hit;}
 foreach my $qid ( keys <http://perldoc.perl.org/functions/keys.html>
%hits_by_query ) {
  my $result = Bio::Search::Result::BlastResult->new();
  $result->add_hit($_) for ( @{$hits_by_query{$qid}} );
  my $blio = Bio::SearchIO->new( -file => ">$qid\.bls", -format
<http://perldoc.perl.org/functions/format.html>=>'blast' );
  $blio->write_result($result);}


however on using it this produced the following error message


BlastResult::new(): Not adding iterations.

------------- EXCEPTION: Bio::Root::NoSuchThing -------------
MSG: No such iteration number: 0. Valid range=1-0
VALUE: The number zero (0)
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/local/share/perl/5.10.0/Bio/Root/Root.pm:472
STACK: Bio::Search::Result::BlastResult::iteration
/usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:327
STACK: Bio::Search::Result::BlastResult::add_hit
/usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:257
STACK: ./parsing.blast.results.into.per.query.files.pl:15

I tried to search for other scripts but I couldn't find any
I would really appreciate your comments to this
Thank you


From cjfields at illinois.edu  Tue Nov 20 14:21:25 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 20 Nov 2012 19:21:25 +0000
Subject: [Bioperl-l] Parsing a blast report with multiple queries into
 separate one query files that only contain the fasta sequences
In-Reply-To: <CAE=MQgz9BY7-nUOePoqVZS6FD0ME76yXyTKmODWFhc+TjBGdrg@mail.gmail.com>
References: <CAE=MQgz9BY7-nUOePoqVZS6FD0ME76yXyTKmODWFhc+TjBGdrg@mail.gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF22E15@CITESMBX5.ad.uillinois.edu>

Maha,

Do you need only the sequence reported in the report (e.g. the HSP alignments) or the original FASTA sequences?  

The former can be recovered from the Bio::Search::HSP::GenericHSP objects as an alignment, and this can be redirected to a FASTA file.  The latter is a little trickier, as you will have to retrieve the sequences from their original source files.  

chris

On Nov 20, 2012, at 12:44 PM, maha ahmed <mahakadry at aucegypt.edu> wrote:

> Dear BioPerl list,
> I blasted a file that has several fasta queries against nr, however I need
> to align each query with its hits for further computational analysis so I
> need to parse the produced blast report into several files that each has
> only the fasta query sequence and its hits in fasta format.
> I found this script online,
> 
> use Bio::Search::Result::BlastResult;use Bio::SearchIO;
> my $report = Bio::SearchIO->new( -file=>'full-report.bls', -format
> <http://perldoc.perl.org/functions/format.html> => blast);my $result =
> $report->next_result;my %hits_by_query;while (my $hit =
> $result->next_hit) {
>  push <http://perldoc.perl.org/functions/push.html>
> @{$hits_by_query{$hit->name}}, $hit;}
> foreach my $qid ( keys <http://perldoc.perl.org/functions/keys.html>
> %hits_by_query ) {
>  my $result = Bio::Search::Result::BlastResult->new();
>  $result->add_hit($_) for ( @{$hits_by_query{$qid}} );
>  my $blio = Bio::SearchIO->new( -file => ">$qid\.bls", -format
> <http://perldoc.perl.org/functions/format.html>=>'blast' );
>  $blio->write_result($result);}
> 
> 
> 
> however on using it this produced the following error message
> 
> 
> 
> BlastResult::new(): Not adding iterations.
> 
> ------------- EXCEPTION: Bio::Root::NoSuchThing -------------
> MSG: No such iteration number: 0. Valid range=1-0
> VALUE: The number zero (0)
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /usr/local/share/perl/5.10.0/Bio/Root/Root.pm:472
> STACK: Bio::Search::Result::BlastResult::iteration
> /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:327
> STACK: Bio::Search::Result::BlastResult::add_hit
> /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:257
> STACK: ./parsing.blast.results.into.per.query.files.pl:15
> 
> I tried to search for other scripts but I couldn't find any
> I would really appreciate your comments to this
> Thank you
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From rfhorns at gmail.com  Thu Nov  1 20:01:34 2012
From: rfhorns at gmail.com (Felix Horns)
Date: Fri, 02 Nov 2012 00:01:34 -0000
Subject: [Bioperl-l] get_Stream_by_query() appears to return empty stream
Message-ID: <CANpYnv_dSVMM85iNVn8gnktNyOuKmevsWifyMFm9BY_QboJT8w@mail.gmail.com>

Hello everyone.

I am having trouble using the get_Stream_by_query() function
in Bio::DB::GenBank.  It seems to return an empty stream, such that
$stream->next_seq never returns anything.

However, $query->count is returning the expected value (139).  Also,
get_Stream_by_query() seems to be querying the database, as when I pass it
an array of GeneIDs that have not been properly formatted, i.e.
GeneID:7816864, instead of simply 7816864, it returns warnings and errors:
"MSG: Warning(s) from GenBank: <PhraseNotFound>GeneID 7817709...; MSG:
Error from Genbank: No items found.".

I have included my full code below. I have also included the output from
the code below that.  The code is intended to find genes located within a
genomic region. I will later find the protein domains and pathways that
those genes are involved in.

Any help would be greatly appreciated.  I realize that this is probably a
very simple question, but I am relatively new to BioPerl and I've spent the
better part of the day trying to figure out such issues, so I would be very
thankful for help.

Felix


#!/usr/bin/perl
use strict;
use Bio::SeqIO;
use Bio::DB::EntrezGene;
use Bio::DB::GenBank;

# Load reference sequence
# Load from local .gb file
# Note that .gb file does not include sequences
# my $gbfile = "NC_012660.1.gb";
# my $seqio = Bio::SeqIO->new(-file => $gbfile);
# my $ref_seq = $seqio->next_seq;

# To access reference sequence programatically, uncomment this code
my $gb = new Bio::DB::GenBank;
my $ref_seq = $gb->get_Seq_by_acc("NC_012660.1");

# Specify coordinates of gap
my $gap_start = 2050506;
my $gap_end = 2190530;

my $gene_count = 0;
my @features;
my @starts;
my @ends;
my @db_xrefs;

my @products;
my @protein_ids;

# Get gene features in gap
for my $feat ($ref_seq->get_SeqFeatures) {
  my $start=$feat->location->start;
  my $end=$feat->location->end;

  if (($feat->primary_tag eq 'gene') &
      ($gap_start < $start) & ($start < $gap_end) &
      ($gap_start < $end) & ($end < $gap_end)) {

    $gene_count += 1;

    # Get GeneID reference
    my $db_xref = ($feat->get_tag_values('db_xref'))[0];
    $db_xref =~ s/GeneID://;    # Trim "GeneID:" from start of $db_xref

    push @features, $feat;
    push @starts, $start;
    push @ends, $end;
    push @db_xrefs, $db_xref;
  }
}

# Get data about gene features from GeneID reference
my $query = Bio::DB::Query::GenBank->new(-db => 'gene',
 -ids => [@db_xrefs]);
my $stream = $gb->get_Stream_by_query($query);

while (my $seq = $stream->next_seq) {
  for my $feat ($seq->all_SeqFeatures) {
    print "primary tag: ", $feat->primary_tag, "\n";
    for my $tag ($feat->get_all_tags) {
      print "  tag: ", $tag, "\n";
      for my $value ($feat->get_tag_values($tag)) {
print "    value: ", $value, "\n";
      }
    }
  }
}

print $query->count,"\n";
print $gene_count, "\n";


OUTPUT
> perl analyze_gap.pl
139
139

Note that no "primary tag; tag; value" items are printed.  Furthermore,
when I put a print line immediately after the (while (my $seq =
$stream->next_seq)) statement, it was never called, seemingly indicating
that the stream is empty.


From mooldhu at gmail.com  Tue Nov  6 02:38:57 2012
From: mooldhu at gmail.com (=?GB2312?B?uvq9rQ==?=)
Date: Tue, 6 Nov 2012 15:38:57 +0800
Subject: [Bioperl-l] Ask for help about Bioperl
Message-ID: <CACdwZFTvgwGzaa2wTfinu9=JKYAXL_O0kQ-Osn7FGcsDx33+Ng@mail.gmail.com>

hi,
when I use bioperl ,it report errors like this :---------------------
WARNING ---------------------
MSG: sequence '1' doesn't validate, mismatch is )81,95,0(,::
---------------------------------------------------
Error providing evidence type: GeneModel
The error was:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Attempting to set the sequence '1' to
[)81hvf95x0(DSTD=qeSrytkiyP::oiV] which does not look healthy
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:486
STACK: Bio::PrimarySeq::seq /usr/share/perl5/Bio/PrimarySeq.pm:285
STACK: Bio::PrimarySeq::new /usr/share/perl5/Bio/PrimarySeq.pm:239
STACK: Bio::PrimarySeqI::revcom /usr/share/perl5/Bio/PrimarySeqI.pm:383


but,I am sure that the input file only cotain [ATGCN],I also try to use
another sequences ,but the errors are the same.my bioperl is Bioperl-live
1.006902;

-- 
????


From assayagy at gmail.com  Sat Nov 10 13:27:03 2012
From: assayagy at gmail.com (eyla4ever)
Date: Sat, 10 Nov 2012 10:27:03 -0800 (PST)
Subject: [Bioperl-l] Extracting sequences from Genbank files
In-Reply-To: <CAJLmuDKPBA_DUtnfQjcAXr+JU=nL+orBP4cCgwBi=4BWQiYmpw@mail.gmail.com>
References: <CAJLmuDKPBA_DUtnfQjcAXr+JU=nL+orBP4cCgwBi=4BWQiYmpw@mail.gmail.com>
Message-ID: <34664632.post@talk.nabble.com>


hello Brian

i wuold like you to send me your script, i think it can help me to solve a
big problem
and help me to finish my final project.
i hope it will be posible

regards Eyla


BForde wrote:
> 
> Hello,
> 
> I have been modifying a script which extracts all the protein sequences
> from a genbank file and saves them in a multi-fasta file.
> 
> I wish the fasta header to have both the locus_tag of the protein and the
> product. However I cannot get the  product tag to write to the fasta
> header
> 
> this is the relevant section of the script
> 
>  $s->display_id($f->has_tag('locus_tag') ? join(',',sort
> $f->each_tag_value('locus_tag')) :
>                            $f->has_tag('product') ?
> join(',',$f->each_tag_value('product')):
>                            $s->display_id);
> 
> is "product" not an actual tag
> 
> regards
> 
> Brian
> 
> 
> 
> -- 
> Brian Forde
> Microbiology Dept.
> Bioscience Institute. Room 4.11
> University College Cork
> Cork
> Ireland
> tel:+353 21 4901306
> email: b.m.forde at umail.ucc.ie
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 

-- 
View this message in context: http://old.nabble.com/Extracting-sequences-from-Genbank-files-tp33901023p34664632.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From bosborne11 at verizon.net  Tue Nov 20 18:50:00 2012
From: bosborne11 at verizon.net (Brian Osborne)
Date: Tue, 20 Nov 2012 18:50:00 -0500
Subject: [Bioperl-l] get_Stream_by_query() appears to return empty stream
In-Reply-To: <CANpYnv_dSVMM85iNVn8gnktNyOuKmevsWifyMFm9BY_QboJT8w@mail.gmail.com>
References: <CANpYnv_dSVMM85iNVn8gnktNyOuKmevsWifyMFm9BY_QboJT8w@mail.gmail.com>
Message-ID: <5F077DEA-DEBD-42BC-87E7-327697764CFE@verizon.net>

Felix,

I took a look at the Bio::DB::Query::GenBank documentation, it says this:

If you provide an array reference of IDs in -ids, the query will be ignored and the list of IDs will be used when the query is passed to a Bio::DB::GenBank object's get_Stream_by_query() method. 

Bio::DB::Genbank queries "nucleotide", by default. You have GeneIDs. I see that you're setting "-id" to "gene" but note that you're passing that query to a plain Bio::DB::GenBank object. Not sure what the expected behavior is here.

I would try using the NCBI Eutilities for that second query, rather than Bio::DB::Query::GenBank (http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook).

Brian O.

On Nov 1, 2012, at 8:01 PM, Felix Horns <rfhorns at gmail.com> wrote:

> Hello everyone.
> 
> I am having trouble using the get_Stream_by_query() function
> in Bio::DB::GenBank.  It seems to return an empty stream, such that
> $stream->next_seq never returns anything.
> 
> However, $query->count is returning the expected value (139).  Also,
> get_Stream_by_query() seems to be querying the database, as when I pass it
> an array of GeneIDs that have not been properly formatted, i.e.
> GeneID:7816864, instead of simply 7816864, it returns warnings and errors:
> "MSG: Warning(s) from GenBank: <PhraseNotFound>GeneID 7817709...; MSG:
> Error from Genbank: No items found.".
> 
> I have included my full code below. I have also included the output from
> the code below that.  The code is intended to find genes located within a
> genomic region. I will later find the protein domains and pathways that
> those genes are involved in.
> 
> Any help would be greatly appreciated.  I realize that this is probably a
> very simple question, but I am relatively new to BioPerl and I've spent the
> better part of the day trying to figure out such issues, so I would be very
> thankful for help.
> 
> Felix
> 
> 
> #!/usr/bin/perl
> use strict;
> use Bio::SeqIO;
> use Bio::DB::EntrezGene;
> use Bio::DB::GenBank;
> 
> # Load reference sequence
> # Load from local .gb file
> # Note that .gb file does not include sequences
> # my $gbfile = "NC_012660.1.gb";
> # my $seqio = Bio::SeqIO->new(-file => $gbfile);
> # my $ref_seq = $seqio->next_seq;
> 
> # To access reference sequence programatically, uncomment this code
> my $gb = new Bio::DB::GenBank;
> my $ref_seq = $gb->get_Seq_by_acc("NC_012660.1");
> 
> # Specify coordinates of gap
> my $gap_start = 2050506;
> my $gap_end = 2190530;
> 
> my $gene_count = 0;
> my @features;
> my @starts;
> my @ends;
> my @db_xrefs;
> 
> my @products;
> my @protein_ids;
> 
> # Get gene features in gap
> for my $feat ($ref_seq->get_SeqFeatures) {
>  my $start=$feat->location->start;
>  my $end=$feat->location->end;
> 
>  if (($feat->primary_tag eq 'gene') &
>      ($gap_start < $start) & ($start < $gap_end) &
>      ($gap_start < $end) & ($end < $gap_end)) {
> 
>    $gene_count += 1;
> 
>    # Get GeneID reference
>    my $db_xref = ($feat->get_tag_values('db_xref'))[0];
>    $db_xref =~ s/GeneID://;    # Trim "GeneID:" from start of $db_xref
> 
>    push @features, $feat;
>    push @starts, $start;
>    push @ends, $end;
>    push @db_xrefs, $db_xref;
>  }
> }
> 
> # Get data about gene features from GeneID reference
> my $query = Bio::DB::Query::GenBank->new(-db => 'gene',
> -ids => [@db_xrefs]);
> my $stream = $gb->get_Stream_by_query($query);
> 
> while (my $seq = $stream->next_seq) {
>  for my $feat ($seq->all_SeqFeatures) {
>    print "primary tag: ", $feat->primary_tag, "\n";
>    for my $tag ($feat->get_all_tags) {
>      print "  tag: ", $tag, "\n";
>      for my $value ($feat->get_tag_values($tag)) {
> print "    value: ", $value, "\n";
>      }
>    }
>  }
> }
> 
> print $query->count,"\n";
> print $gene_count, "\n";
> 
> 
> OUTPUT
>> perl analyze_gap.pl
> 139
> 139
> 
> Note that no "primary tag; tag; value" items are printed.  Furthermore,
> when I put a print line immediately after the (while (my $seq =
> $stream->next_seq)) statement, it was never called, seemingly indicating
> that the stream is empty.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From bosborne11 at verizon.net  Tue Nov 20 18:52:00 2012
From: bosborne11 at verizon.net (Brian Osborne)
Date: Tue, 20 Nov 2012 18:52:00 -0500
Subject: [Bioperl-l] Ask for help about Bioperl
In-Reply-To: <CACdwZFTvgwGzaa2wTfinu9=JKYAXL_O0kQ-Osn7FGcsDx33+Ng@mail.gmail.com>
References: <CACdwZFTvgwGzaa2wTfinu9=JKYAXL_O0kQ-Osn7FGcsDx33+Ng@mail.gmail.com>
Message-ID: <3B472C83-7F6C-41E2-A629-E6A2BDC6B075@verizon.net>

????,

You're going to have show us your code, we can't help you just by seeing the error messages. Show us the input file as well, or the beginning of it.

Brian O.


On Nov 6, 2012, at 2:38 AM, ???? <mooldhu at gmail.com> wrote:

> hi,
> when I use bioperl ,it report errors like this :---------------------
> WARNING ---------------------
> MSG: sequence '1' doesn't validate, mismatch is )81,95,0(,::
> ---------------------------------------------------
> Error providing evidence type: GeneModel
> The error was:
> 
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: Attempting to set the sequence '1' to
> [)81hvf95x0(DSTD=qeSrytkiyP::oiV] which does not look healthy
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:486
> STACK: Bio::PrimarySeq::seq /usr/share/perl5/Bio/PrimarySeq.pm:285
> STACK: Bio::PrimarySeq::new /usr/share/perl5/Bio/PrimarySeq.pm:239
> STACK: Bio::PrimarySeqI::revcom /usr/share/perl5/Bio/PrimarySeqI.pm:383
> 
> 
> but,I am sure that the input file only cotain [ATGCN],I also try to use
> another sequences ,but the errors are the same.my bioperl is Bioperl-live
> 1.006902;
> 
> -- 
> ????
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From hlapp at drycafe.net  Tue Nov 20 21:24:50 2012
From: hlapp at drycafe.net (Hilmar Lapp)
Date: Tue, 20 Nov 2012 21:24:50 -0500
Subject: [Bioperl-l] handle with file in perl
In-Reply-To: <34626730.post@talk.nabble.com>
References: <34626730.post@talk.nabble.com>
Message-ID: <1DE09B34-5124-478C-8925-0045EC119CFC@drycafe.net>

This sounds like a homework assignment. We're not here to do your homework or assignments for you. You can post if you run into a specific problem when solving your assignment with Bioperl, and we'll help with that. 

-hilmar

Sent with a tap.

On Oct 31, 2012, at 7:45 PM, eyla4ever <assayagy at gmail.com> wrote:

> 
> hi 
> 
> i want to write a function that get as parameters : file_name, hsp , hit.
> and i want her to print all the blast Field that i need to this file.
> 
> i do it because i have 2 files with the same Fields.
>        
> 
> 10X
> -- 
> View this message in context: http://old.nabble.com/handle-with-file-in-perl-tp34626730p34626730.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From mahakadry at aucegypt.edu  Fri Nov 23 20:33:59 2012
From: mahakadry at aucegypt.edu (maha ahmed)
Date: Sat, 24 Nov 2012 03:33:59 +0200
Subject: [Bioperl-l] retrieving a subset of files from a folder
Message-ID: <CAE=MQgzV7Z-bQxiQFDuHHaEr=XL5=cTc6Mhq1Su7VPByZNEP6Q@mail.gmail.com>

Dear Bioperl list,
I have a folder that has 60,000 files (one file for each phylogenetic tree)
However I only need to work with a subset of 1,000 files from that folder
(the files are not numbered in order so I cant use the i++ loop in my
bioperl script)
Is there a way to write a script that only moves files with the names given
in a list in a text file
i.e. I have a file that has the names of the files I want to copy fro m the
folder and I want to write script that does this
Thank you so much


From kellert at ohsu.edu  Sat Nov 24 13:08:11 2012
From: kellert at ohsu.edu (Tom Keller)
Date: Sat, 24 Nov 2012 10:08:11 -0800
Subject: [Bioperl-l] use cookbook to work with a directory of files
In-Reply-To: <mailman.7.1353776405.32614.bioperl-l@lists.open-bio.org>
References: <mailman.7.1353776405.32614.bioperl-l@lists.open-bio.org>
Message-ID: <C969FE0E-18FE-4771-B031-22EEA42AEA77@ohsu.edu>

A search with the phrase "perl cookbook filenames from directory" should help you find what you need.

On Nov 24, 2012, at 9:00 AM, bioperl-l-request at lists.open-bio.org wrote:

> Send Bioperl-l mailing list submissions to
> 	bioperl-l at lists.open-bio.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://lists.open-bio.org/mailman/listinfo/bioperl-l
> or, via email, send a message with subject or body 'help' to
> 	bioperl-l-request at lists.open-bio.org
> 
> You can reach the person managing the list at
> 	bioperl-l-owner at lists.open-bio.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Bioperl-l digest..."
> 
> 
> Today's Topics:
> 
>   1.  retrieving a subset of files from a folder (maha ahmed)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Sat, 24 Nov 2012 03:33:59 +0200
> From: maha ahmed <mahakadry at aucegypt.edu>
> Subject: [Bioperl-l] retrieving a subset of files from a folder
> To: Bioperl-l at lists.open-bio.org
> Message-ID:
> 	<CAE=MQgzV7Z-bQxiQFDuHHaEr=XL5=cTc6Mhq1Su7VPByZNEP6Q at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> Dear Bioperl list,
> I have a folder that has 60,000 files (one file for each phylogenetic tree)
> However I only need to work with a subset of 1,000 files from that folder
> (the files are not numbered in order so I cant use the i++ loop in my
> bioperl script)
> Is there a way to write a script that only moves files with the names given
> in a list in a text file
> i.e. I have a file that has the names of the files I want to copy fro m the
> folder and I want to write script that does this
> Thank you so much
> 
> 
> ------------------------------
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> End of Bioperl-l Digest, Vol 115, Issue 8
> *****************************************


From minou.nowrousian at rub.de  Sat Nov 24 13:24:02 2012
From: minou.nowrousian at rub.de (Minou Nowrousian)
Date: 24 Nov 2012 19:24:02 +0100
Subject: [Bioperl-l] retrieving a subset of files from a folder
Message-ID: <000001cdca70$e1a97720$a4fc6560$@rub.de>


>Dear Bioperl list,
>I have a folder that has 60,000 files (one file for each phylogenetic tree)
However I only need to work with a subset of 1,000 files from that folder
>(the files are not numbered in order so I cant use the i++ loop in my
bioperl script) Is there a way to write a script that only moves files with
the >names given in a list in a text file i.e. I have a file that has the
names of the files I want to copy fro m the folder and I want to write
script that does >this Thank you so much

I don't know if there is a BioPerl solution, but you could use the
File::Copy module (available from CPAN):

use File::Copy;
 copy("path_to_file_you_want_to_copy","path_to_target_file") or die "Copy
failed: $!";

Regards,
Minou


From mahakadry at aucegypt.edu  Sat Nov 24 14:04:09 2012
From: mahakadry at aucegypt.edu (maha ahmed)
Date: Sat, 24 Nov 2012 21:04:09 +0200
Subject: [Bioperl-l] retrieving a subset of files from a folder
In-Reply-To: <000001cdca70$e1a97720$a4fc6560$@rub.de>
References: <000001cdca70$e1a97720$a4fc6560$@rub.de>
Message-ID: <CAE=MQgztf_isVyt=WPF9LMXCtX4Q2U9vHL1AV+TwpueUjKayuw@mail.gmail.com>

Thanks everyone , I actually found a one line command that I am going to
try:
xargs -a file_list.txt mv -t /path/to/des
thanks for your help I will read have a look at the readings you suggested
thank you

On Sat, Nov 24, 2012 at 8:24 PM, Minou Nowrousian
<minou.nowrousian at rub.de>wrote:

>
> >Dear Bioperl list,
> >I have a folder that has 60,000 files (one file for each phylogenetic
> tree)
> However I only need to work with a subset of 1,000 files from that folder
> >(the files are not numbered in order so I cant use the i++ loop in my
> bioperl script) Is there a way to write a script that only moves files with
> the >names given in a list in a text file i.e. I have a file that has the
> names of the files I want to copy fro m the folder and I want to write
> script that does >this Thank you so much
>
> I don't know if there is a BioPerl solution, but you could use the
> File::Copy module (available from CPAN):
>
> use File::Copy;
>  copy("path_to_file_you_want_to_copy","path_to_target_file") or die "Copy
> failed: $!";
>
> Regards,
> Minou
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


From maj at fortinbras.us  Tue Nov 27 08:49:46 2012
From: maj at fortinbras.us (Mark A. Jensen)
Date: Tue, 27 Nov 2012 13:49:46 +0000
Subject: [Bioperl-l] Neo4j : applying user defined validation and constraints
Message-ID: <W2391426705276111354024186@webmail57>

Hi Folks,
Since there was some enthusiasm about REST::Neo4p, my interface to Neo4j, I thought I would let you know about
https://metacpan.org/module/REST::Neo4p::Constrain
This is a framework that lets you apply constraints on node and relationship properties, relationships, and relationship types. You can specify your constraints, and have REST::Neo4p throw exceptions when the constraints aren't met, or you can do validation on existing database items. The pod has a full explanation and examples aplenty.

Please have a look and send bugs my way via RT.
Cheers all,
MAJ


From francescomusacchia at gmail.com  Wed Nov 28 05:27:16 2012
From: francescomusacchia at gmail.com (Francesco Musacchia)
Date: Wed, 28 Nov 2012 02:27:16 -0800 (PST)
Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access
Message-ID: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com>

Hi all,
I have a big problem with using GFF3 database with BioPerl. This is not a 
question about what is the way to write some bioperl code. I'm experiencing 
that when I have to do a lot of accessess on a GFF database (with Bio:DB::SeqFeature::Store) 
the slowness increase until my script can stay running for more than a day.

How can I solve it? Or it cannot be done?

Thanks a lot!


From florent.angly at gmail.com  Thu Nov  1 05:49:13 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Thu, 01 Nov 2012 15:49:13 +1000
Subject: [Bioperl-l] Bio::PrimarySeq speedup
Message-ID: <50920D59.4010307@gmail.com>

Hi all,

I was working with Ben Woodcroft on identifying ways to speed up 
Grinder, which relies heavily on Bioperl. Ben did some profiling with 
NYTProf and we realized that a lot of computation time was spent in 
Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we 
used for the profiling were microbial genomes, i.e. several Mbp long 
sequences, which is quite long. A lot of the performance cost was 
associated with passing full genomes between functions. For example, 
when doing a call to length(), length() requests the full sequence from 
seq(), which returns it back to length() (it makes a copy!). So, every 
call to length is very expensive for long sequences. And there is a lot 
of code that calls length(), for error checking.

I know that there are a few Bioperl modules that are more adapted to 
handling very long sequences, e.g. Bio::DB::Fasta or 
Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at 
Bio::PrimarySeq with Ben and I released this commit: 
https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. 
But in fact, there were more things that I wanted to try to improve, 
which led me to start this new branch: 
https://github.com/bioperl/bioperl-live/tree/seqlength

I wrote quite a few tests for functionalities that were not previously 
covered by tests, and tried to improve the documentation. In addition, 
to address the speed issue, I did some changes to Bio::PrimarySeq and 
Bio::PrimarySeqI :
? The length of a sequence is now computed as soon as the sequence is 
set, not after. This way, there is no extra call to seq() (which would 
incur the cost of copying the entire sequence between functions).
? The length is saved as an object attribute. So, calling length() is 
very cheap since it only needs to retrieve the stored value for the length.
? There is a constructor called -direct, which skips sequence 
validation. However, it was only active in conjunction with the 
-ref_to_seq constructor. To make -direct conform better to its 
documented purpose, I made it -direct work when a sequence is set 
through -seq as well.
? This brings us to trunc(), revcom() and other methods of 
Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq 
object from an existing (already validated!) Bio::PrimarySeq object, the 
new object can be constructed with the -direct constructor, to save some 
time.
? Finally, I noticed that subseq() used calls to eval() to do its work. 
eval() is notoriously slow and these calls were easily replaced by 
simple calls to substr() to save some time.

A real-world test I performed with Grinder took 3m28s before the changes 
(and ~1 min is spent doing something unrelated). After the changes, the 
same test took only 2min28s. So, it's quite a significant improvement 
and on more specific test cases, performance gains can obviously be much 
bigger. Also, I anticipate that the gains would be bigger for even 
longer sequences.

All the changes I made are meant to be backward compatible and all the 
tests in the Bioperl test suite passed. So, there _should_ not be any 
issues. However, I know that Bio::PrimarySeq is a central module of 
Bioperl, so please, have a look at it and let me know if there are any 
glaring errors.

Thanks,

Florent


From shalabh.sharma7 at gmail.com  Thu Nov  1 19:36:35 2012
From: shalabh.sharma7 at gmail.com (shalabh sharma)
Date: Thu, 1 Nov 2012 15:36:35 -0400
Subject: [Bioperl-l] blast question
Message-ID: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>

Hi All,
          First of all i am really very sorry for posting blast question in
this forum, I am not sure if this is the right place.
I will really appreciate if anyone can guide me to the right direction.

I am using blastall to get a top hit from a database so i am using -v 1 -b
1 (i hope this is right).
But the strange part is that i am getting wrong results.

for example: if i use -v 1 -b 1 then for one of the hit i am getting this:


Sequences producing significant alignments:                      (bits)
Value

fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
4e-04


If i use -v 3 -b 3 then i am getting this for the same query:

Sequences producing significant alignments:                      (bits)
Value

fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
e-167
fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
9e-07
fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
1.0

As you can see the top hit in the first case is totally wrong.

I would really appreciate if someone can help me out, or direct to in the
right direction.

Thanks
Shalabh


-- 
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist)
Department of Marine Sciences
University of Georgia
Athens, GA 30602-3636


From cjfields at illinois.edu  Thu Nov  1 21:41:43 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Thu, 1 Nov 2012 21:41:43 +0000
Subject: [Bioperl-l] blast question
In-Reply-To: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>

That's a scary error, but the best place to submit this would be the BLAST help list at NCBI (cc'd)

chris

On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com> wrote:

> Hi All,
>          First of all i am really very sorry for posting blast question in
> this forum, I am not sure if this is the right place.
> I will really appreciate if anyone can guide me to the right direction.
> 
> I am using blastall to get a top hit from a database so i am using -v 1 -b
> 1 (i hope this is right).
> But the strange part is that i am getting wrong results.
> 
> for example: if i use -v 1 -b 1 then for one of the hit i am getting this:
> 
> 
> Sequences producing significant alignments:                      (bits)
> Value
> 
> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> 4e-04
> 
> 
> If i use -v 3 -b 3 then i am getting this for the same query:
> 
> Sequences producing significant alignments:                      (bits)
> Value
> 
> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
> e-167
> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> 9e-07
> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
> 1.0
> 
> As you can see the top hit in the first case is totally wrong.
> 
> I would really appreciate if someone can help me out, or direct to in the
> right direction.
> 
> Thanks
> Shalabh
> 
> 
> 
> -- 
> Shalabh Sharma
> Scientific Computing Professional Associate (Bioinformatics Specialist)
> Department of Marine Sciences
> University of Georgia
> Athens, GA 30602-3636
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From shalabh.sharma7 at gmail.com  Fri Nov  2 14:50:17 2012
From: shalabh.sharma7 at gmail.com (shalabh sharma)
Date: Fri, 2 Nov 2012 10:50:17 -0400
Subject: [Bioperl-l] blast question
In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
Message-ID: <CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>

I know, i am really worried about my past analysis now.
Thanks a lot for cc'ing this mail Chris.

-Shalabh

On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <cjfields at illinois.edu
> wrote:

> That's a scary error, but the best place to submit this would be the BLAST
> help list at NCBI (cc'd)
>
> chris
>
> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com>
> wrote:
>
> > Hi All,
> >          First of all i am really very sorry for posting blast question
> in
> > this forum, I am not sure if this is the right place.
> > I will really appreciate if anyone can guide me to the right direction.
> >
> > I am using blastall to get a top hit from a database so i am using -v 1
> -b
> > 1 (i hope this is right).
> > But the strange part is that i am getting wrong results.
> >
> > for example: if i use -v 1 -b 1 then for one of the hit i am getting
> this:
> >
> >
> > Sequences producing significant alignments:                      (bits)
> > Value
> >
> > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> > 4e-04
> >
> >
> > If i use -v 3 -b 3 then i am getting this for the same query:
> >
> > Sequences producing significant alignments:                      (bits)
> > Value
> >
> > fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
> > e-167
> > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> > 9e-07
> > fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
> > 1.0
> >
> > As you can see the top hit in the first case is totally wrong.
> >
> > I would really appreciate if someone can help me out, or direct to in the
> > right direction.
> >
> > Thanks
> > Shalabh
> >
> >
> >
> > --
> > Shalabh Sharma
> > Scientific Computing Professional Associate (Bioinformatics Specialist)
> > Department of Marine Sciences
> > University of Georgia
> > Athens, GA 30602-3636
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


-- 
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist)
Department of Marine Sciences
University of Georgia
Athens, GA 30602-3636


From Scott.Markel at accelrys.com  Sat Nov  3 00:13:59 2012
From: Scott.Markel at accelrys.com (Scott Markel)
Date: Fri, 2 Nov 2012 17:13:59 -0700
Subject: [Bioperl-l] Structure::IO::pdb update needed to comply with PDB
 file format specification change
Message-ID: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net>

In tracking down a regression test failure we discovered that the Structure::IO::pdb module is out of date relative to the PDB file format specification (http://www.wwpdb.org/docs.html).  PDB now writes out to column 79, while pdb.pm is still using the old line length of 71.  Our regression failure was caused by a reformatting of 1CRN; journal titles started getting truncated.

Some of the Perl lines are really simple, e.g.,

	$keywds = $self->_read_PDB_singlecontline("KEYWDS","11-70",\$buffer);

with others being just a little more detailed, e.g.,

	my ($rec, $subr, $cont, $rol) = unpack "A6 x6 A4 A2 x1 A51", $_;

It doesn't look like pdb.pm has changed in about 1.5 years.  Is there a current module owner?  Or someone else working on this?

If not, we're willing to either compile a list of needed changes (walking through the PDB file format specification and comparing the corresponding column indices in pdb.pm) or provide a new version of the entire file.  Please let us know which is preferred.

Scott

Scott Markel, Ph.D.
Principal Bioinformatics Architect? email:? smarkel at accelrys.com
Accelrys (Pipeline Pilot R&D)?????? mobile: +1 858 205 3653
10188 Telesis Court, Suite 100????? voice:? +1 858 799 5603
San Diego, CA 92121???????????????? fax:??? +1 858 799 5222
USA???????????????????????????????? web:??? http://www.accelrys.com

http://www.linkedin.com/in/smarkel
Secretary, Board of Directors:
??? International Society for Computational Biology
Chair: ISCB Publications and Communications Committee
Associate Editor: PLoS Computational Biology
Editorial Board: Briefings in Bioinformatics


From cjfields at illinois.edu  Sat Nov  3 02:08:52 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Sat, 3 Nov 2012 02:08:52 +0000
Subject: [Bioperl-l] Structure::IO::pdb update needed to comply with PDB
 file format specification change
In-Reply-To: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net>
References: <5ACBA19439E77B43A06F4CAB897EC97706C357BBC5@EXCH1-COLO.accelrys.net>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC817F8@CHIMBX5.ad.uillinois.edu>

On Nov 2, 2012, at 7:13 PM, Scott Markel <Scott.Markel at accelrys.com> wrote:

> In tracking down a regression test failure we discovered that the Structure::IO::pdb module is out of date relative to the PDB file format specification (http://www.wwpdb.org/docs.html).  PDB now writes out to column 79, while pdb.pm is still using the old line length of 71.  Our regression failure was caused by a reformatting of 1CRN; journal titles started getting truncated.
> 
> Some of the Perl lines are really simple, e.g.,
> 
> 	$keywds = $self->_read_PDB_singlecontline("KEYWDS","11-70",\$buffer);
> 
> with others being just a little more detailed, e.g.,
> 
> 	my ($rec, $subr, $cont, $rol) = unpack "A6 x6 A4 A2 x1 A51", $_;
> 
> It doesn't look like pdb.pm has changed in about 1.5 years.  Is there a current module owner?  Or someone else working on this?

No one has really taken ownership, so as far as I'm concerned it's open.  Any objections?

> If not, we're willing to either compile a list of needed changes (walking through the PDB file format specification and comparing the corresponding column indices in pdb.pm) or provide a new version of the entire file.  Please let us know which is preferred.

A new version of the file is fine if you have someone who can work on it.  We would also like to change relevant tests and documentation if there is time.

> Scott
> 
> Scott Markel, Ph.D.
> Principal Bioinformatics Architect  email:  smarkel at accelrys.com
> Accelrys (Pipeline Pilot R&D)       mobile: +1 858 205 3653
> 10188 Telesis Court, Suite 100      voice:  +1 858 799 5603
> San Diego, CA 92121                 fax:    +1 858 799 5222
> USA                                 web:    http://www.accelrys.com
> 
> http://www.linkedin.com/in/smarkel
> Secretary, Board of Directors:
>     International Society for Computational Biology
> Chair: ISCB Publications and Communications Committee
> Associate Editor: PLoS Computational Biology
> Editorial Board: Briefings in Bioinformatics

Thanks Scott!

chris


From Russell.Smithies at agresearch.co.nz  Sun Nov  4 21:00:37 2012
From: Russell.Smithies at agresearch.co.nz (Smithies, Russell)
Date: Mon, 5 Nov 2012 10:00:37 +1300
Subject: [Bioperl-l] blast question
In-Reply-To: <CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>

What version of blast are you using?
There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+


--Russell

-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma
Sent: Saturday, 3 November 2012 3:50 a.m.
To: Fields, Christopher J
Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov
Subject: Re: [Bioperl-l] blast question

I know, i am really worried about my past analysis now.
Thanks a lot for cc'ing this mail Chris.

-Shalabh

On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <cjfields at illinois.edu
> wrote:

> That's a scary error, but the best place to submit this would be the 
> BLAST help list at NCBI (cc'd)
>
> chris
>
> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com>
> wrote:
>
> > Hi All,
> >          First of all i am really very sorry for posting blast 
> > question
> in
> > this forum, I am not sure if this is the right place.
> > I will really appreciate if anyone can guide me to the right direction.
> >
> > I am using blastall to get a top hit from a database so i am using 
> > -v 1
> -b
> > 1 (i hope this is right).
> > But the strange part is that i am getting wrong results.
> >
> > for example: if i use -v 1 -b 1 then for one of the hit i am getting
> this:
> >
> >
> > Sequences producing significant alignments:                      (bits)
> > Value
> >
> > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> > 4e-04
> >
> >
> > If i use -v 3 -b 3 then i am getting this for the same query:
> >
> > Sequences producing significant alignments:                      (bits)
> > Value
> >
> > fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
> > e-167
> > fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
> > 9e-07
> > fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
> > 1.0
> >
> > As you can see the top hit in the first case is totally wrong.
> >
> > I would really appreciate if someone can help me out, or direct to 
> > in the right direction.
> >
> > Thanks
> > Shalabh
> >
> >
> >
> > --
> > Shalabh Sharma
> > Scientific Computing Professional Associate (Bioinformatics 
> > Specialist) Department of Marine Sciences University of Georgia 
> > Athens, GA 30602-3636 
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


--
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l

=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================


From cjfields at illinois.edu  Sun Nov  4 22:13:37 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Sun, 4 Nov 2012 22:13:37 +0000
Subject: [Bioperl-l] blast question
In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
	<18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu>

That in fact is the recommendation (migrate to BLAST+).

chris

On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" <Russell.Smithies at agresearch.co.nz> wrote:

> What version of blast are you using?
> There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+
> 
> 
> --Russell
> 
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma
> Sent: Saturday, 3 November 2012 3:50 a.m.
> To: Fields, Christopher J
> Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov
> Subject: Re: [Bioperl-l] blast question
> 
> I know, i am really worried about my past analysis now.
> Thanks a lot for cc'ing this mail Chris.
> 
> -Shalabh
> 
> On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <cjfields at illinois.edu
>> wrote:
> 
>> That's a scary error, but the best place to submit this would be the 
>> BLAST help list at NCBI (cc'd)
>> 
>> chris
>> 
>> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com>
>> wrote:
>> 
>>> Hi All,
>>>         First of all i am really very sorry for posting blast 
>>> question
>> in
>>> this forum, I am not sure if this is the right place.
>>> I will really appreciate if anyone can guide me to the right direction.
>>> 
>>> I am using blastall to get a top hit from a database so i am using 
>>> -v 1
>> -b
>>> 1 (i hope this is right).
>>> But the strange part is that i am getting wrong results.
>>> 
>>> for example: if i use -v 1 -b 1 then for one of the hit i am getting
>> this:
>>> 
>>> 
>>> Sequences producing significant alignments:                      (bits)
>>> Value
>>> 
>>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
>>> 4e-04
>>> 
>>> 
>>> If i use -v 3 -b 3 then i am getting this for the same query:
>>> 
>>> Sequences producing significant alignments:                      (bits)
>>> Value
>>> 
>>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
>>> e-167
>>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
>>> 9e-07
>>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
>>> 1.0
>>> 
>>> As you can see the top hit in the first case is totally wrong.
>>> 
>>> I would really appreciate if someone can help me out, or direct to 
>>> in the right direction.
>>> 
>>> Thanks
>>> Shalabh
>>> 
>>> 
>>> 
>>> --
>>> Shalabh Sharma
>>> Scientific Computing Professional Associate (Bioinformatics 
>>> Specialist) Department of Marine Sciences University of Georgia 
>>> Athens, GA 30602-3636 
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
>> 
> 
> 
> --
> Shalabh Sharma
> Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From florent.angly at gmail.com  Mon Nov  5 00:46:44 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Mon, 05 Nov 2012 10:46:44 +1000
Subject: [Bioperl-l] Bio::PrimarySeq speedup
In-Reply-To: <50920D59.4010307@gmail.com>
References: <50920D59.4010307@gmail.com>
Message-ID: <50970C74.7070605@gmail.com>

I am planning on merging the branch with master this week.
Best,
Florent


On 01/11/12 15:49, Florent Angly wrote:
> Hi all,
>
> I was working with Ben Woodcroft on identifying ways to speed up 
> Grinder, which relies heavily on Bioperl. Ben did some profiling with 
> NYTProf and we realized that a lot of computation time was spent in 
> Bio::PrimarySeq, doing calls to subseq() and length(). The sequences 
> we used for the profiling were microbial genomes, i.e. several Mbp 
> long sequences, which is quite long. A lot of the performance cost was 
> associated with passing full genomes between functions. For example, 
> when doing a call to length(), length() requests the full sequence 
> from seq(), which returns it back to length() (it makes a copy!). So, 
> every call to length is very expensive for long sequences. And there 
> is a lot of code that calls length(), for error checking.
>
> I know that there are a few Bioperl modules that are more adapted to 
> handling very long sequences, e.g. Bio::DB::Fasta or 
> Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at 
> Bio::PrimarySeq with Ben and I released this commit: 
> https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. 
> But in fact, there were more things that I wanted to try to improve, 
> which led me to start this new branch: 
> https://github.com/bioperl/bioperl-live/tree/seqlength
>
> I wrote quite a few tests for functionalities that were not previously 
> covered by tests, and tried to improve the documentation. In addition, 
> to address the speed issue, I did some changes to Bio::PrimarySeq and 
> Bio::PrimarySeqI :
> ? The length of a sequence is now computed as soon as the sequence is 
> set, not after. This way, there is no extra call to seq() (which would 
> incur the cost of copying the entire sequence between functions).
> ? The length is saved as an object attribute. So, calling length() is 
> very cheap since it only needs to retrieve the stored value for the 
> length.
> ? There is a constructor called -direct, which skips sequence 
> validation. However, it was only active in conjunction with the 
> -ref_to_seq constructor. To make -direct conform better to its 
> documented purpose, I made it -direct work when a sequence is set 
> through -seq as well.
> ? This brings us to trunc(), revcom() and other methods of 
> Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq 
> object from an existing (already validated!) Bio::PrimarySeq object, 
> the new object can be constructed with the -direct constructor, to 
> save some time.
> ? Finally, I noticed that subseq() used calls to eval() to do its 
> work. eval() is notoriously slow and these calls were easily replaced 
> by simple calls to substr() to save some time.
>
> A real-world test I performed with Grinder took 3m28s before the 
> changes (and ~1 min is spent doing something unrelated). After the 
> changes, the same test took only 2min28s. So, it's quite a significant 
> improvement and on more specific test cases, performance gains can 
> obviously be much bigger. Also, I anticipate that the gains would be 
> bigger for even longer sequences.
>
> All the changes I made are meant to be backward compatible and all the 
> tests in the Bioperl test suite passed. So, there _should_ not be any 
> issues. However, I know that Bio::PrimarySeq is a central module of 
> Bioperl, so please, have a look at it and let me know if there are any 
> glaring errors.
>
> Thanks,
>
> Florent
>


From cjfields at illinois.edu  Mon Nov  5 02:43:28 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Mon, 5 Nov 2012 02:43:28 +0000
Subject: [Bioperl-l] Bio::PrimarySeq speedup
In-Reply-To: <50970C74.7070605@gmail.com>
References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu>

Florent,

Ran tests on it, they pass but I am seeing this (if these are expected, you can catch the warnings using Test::Warn):

[cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr t/Seq/PrimarySeq.t 
t/Seq/PrimarySeq.t .. 1/167 
--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is !
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is &
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is \,$,+
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/
---------------------------------------------------
t/Seq/PrimarySeq.t .. ok       
All tests successful.
Files=1, Tests=167,  0 wallclock secs ( 0.03 usr  0.01 sys +  0.18 cusr  0.01 csys =  0.23 CPU)
Result: PASS

chris

On Nov 4, 2012, at 6:46 PM, Florent Angly <florent.angly at gmail.com>
 wrote:

> I am planning on merging the branch with master this week.
> Best,
> Florent
> 
> 
> On 01/11/12 15:49, Florent Angly wrote:
>> Hi all,
>> 
>> I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was associated with passing full genomes between functions. For example, when doing a call to length(), length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So, every call to length is very expensive for long sequences. And there is a lot of code that calls length(), for error checking.
>> 
>> I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g. Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength
>> 
>> I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to improve the documentation. In addition, to address the speed issue, I did some changes to Bio::PrimarySeq and Bio::PrimarySeqI :
>> ? The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no extra call to seq() (which would incur the cost of copying the entire sequence between functions).
>> ? The length is saved as an object attribute. So, calling length() is very cheap since it only needs to retrieve the stored value for the length.
>> ? There is a constructor called -direct, which skips sequence validation. However, it was only active in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented purpose, I made it -direct work when a sequence is set through -seq as well.
>> ? This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the new object can be constructed with the -direct constructor, to save some time.
>> ? Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and these calls were easily replaced by simple calls to substr() to save some time.
>> 
>> A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I anticipate that the gains would be bigger for even longer sequences.
>> 
>> All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of Bioperl, so please, have a look at it and let me know if there are any glaring errors.
>> 
>> Thanks,
>> 
>> Florent
>> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From shalabh.sharma7 at gmail.com  Mon Nov  5 17:03:38 2012
From: shalabh.sharma7 at gmail.com (shalabh sharma)
Date: Mon, 5 Nov 2012 12:03:38 -0500
Subject: [Bioperl-l] blast question
In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
	<18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>
	<118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu>
Message-ID: <CAA7rn9f1GOsib-PhNAsQo327dQznMn-15UZz7w49iYKjg-VLHg@mail.gmail.com>

Hi All,
         thanks for all your responses.

Currently i am using the old version of blastall 2.2.22.

@Peter: I will update my blast and will see if the problem still exist. But
i can't restrict my blast with e value because i work on environmental
samples , i have to reduce the size of my blast files as i am only
interested in the top hit and my data sets are really huge.

Thanks
Shalabh

On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J <cjfields at illinois.edu
> wrote:

> That in fact is the recommendation (migrate to BLAST+).
>
> chris
>
> On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" <
> Russell.Smithies at agresearch.co.nz> wrote:
>
> > What version of blast are you using?
> > There have been quite a few bug fixes and I suspect any responses from
> NCBI will suggest upgrading to the current version of blast+
> >
> >
> > --Russell
> >
> > -----Original Message-----
> > From: bioperl-l-bounces at lists.open-bio.org [mailto:
> bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma
> > Sent: Saturday, 3 November 2012 3:50 a.m.
> > To: Fields, Christopher J
> > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov
> > Subject: Re: [Bioperl-l] blast question
> >
> > I know, i am really worried about my past analysis now.
> > Thanks a lot for cc'ing this mail Chris.
> >
> > -Shalabh
> >
> > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <
> cjfields at illinois.edu
> >> wrote:
> >
> >> That's a scary error, but the best place to submit this would be the
> >> BLAST help list at NCBI (cc'd)
> >>
> >> chris
> >>
> >> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com>
> >> wrote:
> >>
> >>> Hi All,
> >>>         First of all i am really very sorry for posting blast
> >>> question
> >> in
> >>> this forum, I am not sure if this is the right place.
> >>> I will really appreciate if anyone can guide me to the right direction.
> >>>
> >>> I am using blastall to get a top hit from a database so i am using
> >>> -v 1
> >> -b
> >>> 1 (i hope this is right).
> >>> But the strange part is that i am getting wrong results.
> >>>
> >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting
> >> this:
> >>>
> >>>
> >>> Sequences producing significant alignments:                      (bits)
> >>> Value
> >>>
> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA
>  38
> >>> 4e-04
> >>>
> >>>
> >>> If i use -v 3 -b 3 then i am getting this for the same query:
> >>>
> >>> Sequences producing significant alignments:                      (bits)
> >>> Value
> >>>
> >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...
> 570
> >>> e-167
> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA
>  38
> >>> 9e-07
> >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...
>  18
> >>> 1.0
> >>>
> >>> As you can see the top hit in the first case is totally wrong.
> >>>
> >>> I would really appreciate if someone can help me out, or direct to
> >>> in the right direction.
> >>>
> >>> Thanks
> >>> Shalabh
> >>>
> >>>
> >>>
> >>> --
> >>> Shalabh Sharma
> >>> Scientific Computing Professional Associate (Bioinformatics
> >>> Specialist) Department of Marine Sciences University of Georgia
> >>> Athens, GA 30602-3636
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >>
> >
> >
> > --
> > Shalabh Sharma
> > Scientific Computing Professional Associate (Bioinformatics Specialist)
> Department of Marine Sciences University of Georgia Athens, GA 30602-3636
> _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > =======================================================================
> > Attention: The information contained in this message and/or attachments
> > from AgResearch Limited is intended only for the persons or entities
> > to which it is addressed and may contain confidential and/or privileged
> > material. Any review, retransmission, dissemination or other use of, or
> > taking of any action in reliance upon, this information by persons or
> > entities other than the intended recipients is prohibited by AgResearch
> > Limited. If you have received this message in error, please notify the
> > sender immediately.
> > =======================================================================
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


-- 
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist)
Department of Marine Sciences
University of Georgia
Athens, GA 30602-3636


From Russell.Smithies at agresearch.co.nz  Mon Nov  5 21:04:07 2012
From: Russell.Smithies at agresearch.co.nz (Smithies, Russell)
Date: Tue, 6 Nov 2012 10:04:07 +1300
Subject: [Bioperl-l] blast question
In-Reply-To: <CAA7rn9f1GOsib-PhNAsQo327dQznMn-15UZz7w49iYKjg-VLHg@mail.gmail.com>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
	<18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>
	<118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9f1GOsib-PhNAsQo327dQznMn-15UZz7w49iYKjg-VLHg@mail.gmail.com>
Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz>

If you're using an older version of blast there was a bug where not all results were returned - I think the limit was 10,000 hits?
Not usually a problem running basic queries but a big problem for environmental or metagenomic samples, or when aligning short reads.

--Russell

From: shalabh sharma [mailto:shalabh.sharma7 at gmail.com]
Sent: Tuesday, 6 November 2012 6:04 a.m.
To: Fields, Christopher J
Cc: Smithies, Russell; bioperl-l
Subject: Re: [Bioperl-l] blast question

Hi All,
         thanks for all your responses.

Currently i am using the old version of blastall 2.2.22.

@Peter: I will update my blast and will see if the problem still exist. But i can't restrict my blast with e value because i work on environmental samples , i have to reduce the size of my blast files as i am only interested in the top hit and my data sets are really huge.

Thanks
Shalabh

On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J <cjfields at illinois.edu<mailto:cjfields at illinois.edu>> wrote:
That in fact is the recommendation (migrate to BLAST+).

chris

On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" <Russell.Smithies at agresearch.co.nz<mailto:Russell.Smithies at agresearch.co.nz>> wrote:

> What version of blast are you using?
> There have been quite a few bug fixes and I suspect any responses from NCBI will suggest upgrading to the current version of blast+
>
>
> --Russell
>
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org<mailto:bioperl-l-bounces at lists.open-bio.org> [mailto:bioperl-l-bounces at lists.open-bio.org<mailto:bioperl-l-bounces at lists.open-bio.org>] On Behalf Of shalabh sharma
> Sent: Saturday, 3 November 2012 3:50 a.m.
> To: Fields, Christopher J
> Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov<mailto:blast-help at ncbi.nlm.nih.gov>
> Subject: Re: [Bioperl-l] blast question
>
> I know, i am really worried about my past analysis now.
> Thanks a lot for cc'ing this mail Chris.
>
> -Shalabh
>
> On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <cjfields at illinois.edu<mailto:cjfields at illinois.edu>
>> wrote:
>
>> That's a scary error, but the best place to submit this would be the
>> BLAST help list at NCBI (cc'd)
>>
>> chris
>>
>> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com<mailto:shalabh.sharma7 at gmail.com>>
>> wrote:
>>
>>> Hi All,
>>>         First of all i am really very sorry for posting blast
>>> question
>> in
>>> this forum, I am not sure if this is the right place.
>>> I will really appreciate if anyone can guide me to the right direction.
>>>
>>> I am using blastall to get a top hit from a database so i am using
>>> -v 1
>> -b
>>> 1 (i hope this is right).
>>> But the strange part is that i am getting wrong results.
>>>
>>> for example: if i use -v 1 -b 1 then for one of the hit i am getting
>> this:
>>>
>>>
>>> Sequences producing significant alignments:                      (bits)
>>> Value
>>>
>>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
>>> 4e-04
>>>
>>>
>>> If i use -v 3 -b 3 then i am getting this for the same query:
>>>
>>> Sequences producing significant alignments:                      (bits)
>>> Value
>>>
>>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...   570
>>> e-167
>>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA          38
>>> 9e-07
>>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...    18
>>> 1.0
>>>
>>> As you can see the top hit in the first case is totally wrong.
>>>
>>> I would really appreciate if someone can help me out, or direct to
>>> in the right direction.
>>>
>>> Thanks
>>> Shalabh
>>>
>>>
>>>
>>> --
>>> Shalabh Sharma
>>> Scientific Computing Professional Associate (Bioinformatics
>>> Specialist) Department of Marine Sciences University of Georgia
>>> Athens, GA 30602-3636
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>
>
> --
> Shalabh Sharma
> Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


--
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist)
Department of Marine Sciences
University of Georgia
Athens, GA 30602-3636

=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================


From shalabh.sharma7 at gmail.com  Mon Nov  5 21:09:03 2012
From: shalabh.sharma7 at gmail.com (shalabh sharma)
Date: Mon, 5 Nov 2012 16:09:03 -0500
Subject: [Bioperl-l] blast question
In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz>
References: <CAA7rn9d5hY5thb04ChyBQXwfbNasROiGtU4XSaDuMKtea7HeOA@mail.gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC801CB@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9csi-1gBd-CSdEv2XBKN+uF5uF5JzhXXSHtinv-TVLF-g@mail.gmail.com>
	<18DF7D20DFEC044098A1062202F5FFF34CD2A6005A@exchsth.agresearch.co.nz>
	<118F034CF4C3EF48A96F86CE585B94BF3FC82221@CHIMBX5.ad.uillinois.edu>
	<CAA7rn9f1GOsib-PhNAsQo327dQznMn-15UZz7w49iYKjg-VLHg@mail.gmail.com>
	<18DF7D20DFEC044098A1062202F5FFF34CD2AD5269@exchsth.agresearch.co.nz>
Message-ID: <CAA7rn9ech--TYQdezH6fLLArjsTdypkNjSkQeO1LiaLTR1zoHQ@mail.gmail.com>

Hi All,
           Thanks for all the suggestion. The problem is fixed by using
latest blast+ .
Thanks
Shalabh

On Mon, Nov 5, 2012 at 4:04 PM, Smithies, Russell <
Russell.Smithies at agresearch.co.nz> wrote:

> If you?re using an older version of blast there was a bug where not all
> results were returned ? I think the limit was 10,000 hits?****
>
> Not usually a problem running basic queries but a big problem for
> environmental or metagenomic samples, or when aligning short reads.****
>
> ** **
>
> --Russell****
>
> ** **
>
> *From:* shalabh sharma [mailto:shalabh.sharma7 at gmail.com]
> *Sent:* Tuesday, 6 November 2012 6:04 a.m.
> *To:* Fields, Christopher J
> *Cc:* Smithies, Russell; bioperl-l
>
> *Subject:* Re: [Bioperl-l] blast question****
>
> ** **
>
> Hi All,****
>
>          thanks for all your responses.****
>
> ** **
>
> Currently i am using the old version of blastall 2.2.22.****
>
> ** **
>
> @Peter: I will update my blast and will see if the problem still exist.
> But i can't restrict my blast with e value because i work on environmental
> samples , i have to reduce the size of my blast files as i am only
> interested in the top hit and my data sets are really huge.****
>
> ** **
>
> Thanks****
>
> Shalabh****
>
> ** **
>
> On Sun, Nov 4, 2012 at 5:13 PM, Fields, Christopher J <
> cjfields at illinois.edu> wrote:****
>
> That in fact is the recommendation (migrate to BLAST+).
>
> chris****
>
>
> On Nov 4, 2012, at 3:00 PM, "Smithies, Russell" <
> Russell.Smithies at agresearch.co.nz> wrote:
>
> > What version of blast are you using?
> > There have been quite a few bug fixes and I suspect any responses from
> NCBI will suggest upgrading to the current version of blast+
> >
> >
> > --Russell
> >
> > -----Original Message-----
> > From: bioperl-l-bounces at lists.open-bio.org [mailto:
> bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma
> > Sent: Saturday, 3 November 2012 3:50 a.m.
> > To: Fields, Christopher J
> > Cc: bioperl-l; blast-help at ncbi.nlm.nih.gov
> > Subject: Re: [Bioperl-l] blast question
> >
> > I know, i am really worried about my past analysis now.
> > Thanks a lot for cc'ing this mail Chris.
> >
> > -Shalabh
> >
> > On Thu, Nov 1, 2012 at 5:41 PM, Fields, Christopher J <
> cjfields at illinois.edu
> >> wrote:
> >
> >> That's a scary error, but the best place to submit this would be the
> >> BLAST help list at NCBI (cc'd)
> >>
> >> chris
> >>
> >> On Nov 1, 2012, at 2:36 PM, shalabh sharma <shalabh.sharma7 at gmail.com>
> >> wrote:
> >>
> >>> Hi All,
> >>>         First of all i am really very sorry for posting blast
> >>> question
> >> in
> >>> this forum, I am not sure if this is the right place.
> >>> I will really appreciate if anyone can guide me to the right direction.
> >>>
> >>> I am using blastall to get a top hit from a database so i am using
> >>> -v 1
> >> -b
> >>> 1 (i hope this is right).
> >>> But the strange part is that i am getting wrong results.
> >>>
> >>> for example: if i use -v 1 -b 1 then for one of the hit i am getting
> >> this:
> >>>
> >>>
> >>> Sequences producing significant alignments:                      (bits)
> >>> Value
> >>>
> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA
>  38
> >>> 4e-04
> >>>
> >>>
> >>> If i use -v 3 -b 3 then i am getting this for the same query:
> >>>
> >>> Sequences producing significant alignments:                      (bits)
> >>> Value
> >>>
> >>> fig|6666666.11092.peg.1134 COG3118: Thioredoxin domain-containin...
> 570
> >>> e-167
> >>> fig|6666666.11092.peg.487 Thiol:disulfide oxidoreductase TlpA
>  38
> >>> 9e-07
> >>> fig|6666666.11092.peg.1133 Exodeoxyribonuclease III (EC 3.1.11.2...
>  18
> >>> 1.0
> >>>
> >>> As you can see the top hit in the first case is totally wrong.
> >>>
> >>> I would really appreciate if someone can help me out, or direct to
> >>> in the right direction.
> >>>
> >>> Thanks
> >>> Shalabh
> >>>
> >>>
> >>>
> >>> --
> >>> Shalabh Sharma
> >>> Scientific Computing Professional Associate (Bioinformatics
> >>> Specialist) Department of Marine Sciences University of Georgia
> >>> Athens, GA 30602-3636
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >>
> >
> >
> > --
> > Shalabh Sharma
> > Scientific Computing Professional Associate (Bioinformatics Specialist)
> Department of Marine Sciences University of Georgia Athens, GA 30602-3636
> _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > =======================================================================
> > Attention: The information contained in this message and/or attachments
> > from AgResearch Limited is intended only for the persons or entities
> > to which it is addressed and may contain confidential and/or privileged
> > material. Any review, retransmission, dissemination or other use of, or
> > taking of any action in reliance upon, this information by persons or
> > entities other than the intended recipients is prohibited by AgResearch
> > Limited. If you have received this message in error, please notify the
> > sender immediately.
> > =======================================================================
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l****
>
>
>
> ****
>
> ** **
>
> --
> Shalabh Sharma
> Scientific Computing Professional Associate (Bioinformatics Specialist)
> Department of Marine Sciences
> University of Georgia
> Athens, GA 30602-3636****
>
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
>


-- 
Shalabh Sharma
Scientific Computing Professional Associate (Bioinformatics Specialist)
Department of Marine Sciences
University of Georgia
Athens, GA 30602-3636


From florent.angly at gmail.com  Tue Nov  6 11:06:56 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Tue, 06 Nov 2012 21:06:56 +1000
Subject: [Bioperl-l] Bio::PrimarySeq speedup
In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu>
References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu>
Message-ID: <5098EF50.5040208@gmail.com>

Yes, good idea, Chris.

Actually, thinking about it, most of these warnings were redundant. So, 
I changed the behaviour of Bio::PrimarySeq::validate_seq() so that it 
issues exceptions if requested.

Florent


On 05/11/12 12:43, Fields, Christopher J wrote:
> Florent,
>
> Ran tests on it, they pass but I am seeing this (if these are expected, you can catch the warnings using Test::Warn):
>
> [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr t/Seq/PrimarySeq.t
> t/Seq/PrimarySeq.t .. 1/167
> --------------------- WARNING ---------------------
> MSG: Got a sequence without letters. Could not guess alphabet
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is !
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is &
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is \,$,+
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/
> ---------------------------------------------------
> t/Seq/PrimarySeq.t .. ok
> All tests successful.
> Files=1, Tests=167,  0 wallclock secs ( 0.03 usr  0.01 sys +  0.18 cusr  0.01 csys =  0.23 CPU)
> Result: PASS
>
> chris
>
> On Nov 4, 2012, at 6:46 PM, Florent Angly <florent.angly at gmail.com>
>   wrote:
>
>> I am planning on merging the branch with master this week.
>> Best,
>> Florent
>>
>>
>> On 01/11/12 15:49, Florent Angly wrote:
>>> Hi all,
>>>
>>> I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was associated with passing full genomes between functions. For example, when doing a call to length(), length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So, every call to length is very expensive for long sequences. And there is a lot of code that calls length(), for error checking.
>>>
>>> I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g. Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength
>>>
>>> I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to improve the documentation. In addition, to address the speed issue, I did some changes to Bio::PrimarySeq and Bio::PrimarySeqI :
>>> ? The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no extra call to seq() (which would incur the cost of copying the entire sequence between functions).
>>> ? The length is saved as an object attribute. So, calling length() is very cheap since it only needs to retrieve the stored value for the length.
>>> ? There is a constructor called -direct, which skips sequence validation. However, it was only active in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented purpose, I made it -direct work when a sequence is set through -seq as well.
>>> ? This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the new object can be constructed with the -direct constructor, to save some time.
>>> ? Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and these calls were easily replaced by simple calls to substr() to save some time.
>>>
>>> A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I anticipate that the gains would be bigger for even longer sequences.
>>>
>>> All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of Bioperl, so please, have a look at it and let me know if there are any glaring errors.
>>>
>>> Thanks,
>>>
>>> Florent
>>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From shlomif at shlomifish.org  Tue Nov  6 12:27:00 2012
From: shlomif at shlomifish.org (Shlomi Fish)
Date: Tue, 6 Nov 2012 14:27:00 +0200
Subject: [Bioperl-l] [Request] Please Help Add Some Information about
 Perl for Bio-Informatics to http://perl-begin.org/uses/bio-info/
In-Reply-To: <20121026192203.6d1e59c0@lap.shlomifish.org>
References: <20121026192203.6d1e59c0@lap.shlomifish.org>
Message-ID: <20121106142700.192f456e@lap.shlomifish.org>

Hi,

Can anyone help with that?

Regards,

	Shlomi Fish

On Fri, 26 Oct 2012 19:22:03 +0200
Shlomi Fish <shlomif at shlomifish.org> wrote:

> Hi all,
> 
> I am the maintainer of http://perl-begin.org/ , the Perl Beginners' Site. I
> had this page there for a long time, but it's empty:
> 
> http://perl-begin.org/uses/bio-info/
> 
> Can someone help me add some information there? A short XHTML page will be OK.
> For reference, see the other pages in the section
> ( http://perl-begin.org/uses/ ) such as:
> 
> * http://perl-begin.org/uses/web/
> 
> * http://perl-begin.org/uses/sys-admin/
> 
> * http://perl-begin.org/uses/qa/
> 
> Note that you agree that the content will be licensed under the Creative
> Commons Attribution 3.0 Unported License (or higher versions) and so you
> should make sure it is original.
> 
> I shall be obliged for any help.
> 
> Regards,
> 
> 	Shlomi Fish
> 


-- 
-----------------------------------------------------------------
Shlomi Fish       http://www.shlomifish.org/
Perl Humour - http://perl-begin.org/humour/

A wiseman can learn from a fool much more than a fool can ever learn from a
wiseman.               ? http://en.wikiquote.org/wiki/Cato_the_Elder

Please reply to list if it's a mailing list post - http://shlom.in/reply .


From florent.angly at gmail.com  Thu Nov 15 16:29:30 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Fri, 16 Nov 2012 02:29:30 +1000
Subject: [Bioperl-l] Bio::PrimarySeq speedup
In-Reply-To: <5098EF50.5040208@gmail.com>
References: <50920D59.4010307@gmail.com> <50970C74.7070605@gmail.com>
	<118F034CF4C3EF48A96F86CE585B94BF3FC823F5@CHIMBX5.ad.uillinois.edu>
	<5098EF50.5040208@gmail.com>
Message-ID: <50A5186A.4060304@gmail.com>

I now merged the branch with master.
Best,
Florent

On 06/11/12 21:06, Florent Angly wrote:
> Yes, good idea, Chris.
>
> Actually, thinking about it, most of these warnings were redundant. 
> So, I changed the behaviour of Bio::PrimarySeq::validate_seq() so that 
> it issues exceptions if requested.
>
> Florent
>
>
> On 05/11/12 12:43, Fields, Christopher J wrote:
>> Florent,
>>
>> Ran tests on it, they pass but I am seeing this (if these are 
>> expected, you can catch the warnings using Test::Warn):
>>
>> [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr 
>> t/Seq/PrimarySeq.t
>> t/Seq/PrimarySeq.t .. 1/167
>> --------------------- WARNING ---------------------
>> MSG: Got a sequence without letters. Could not guess alphabet
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is !
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is &
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is 
>> \,$,+
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/
>> ---------------------------------------------------
>> t/Seq/PrimarySeq.t .. ok
>> All tests successful.
>> Files=1, Tests=167,  0 wallclock secs ( 0.03 usr  0.01 sys + 0.18 
>> cusr  0.01 csys =  0.23 CPU)
>> Result: PASS
>>
>> chris
>>
>> On Nov 4, 2012, at 6:46 PM, Florent Angly <florent.angly at gmail.com>
>>   wrote:
>>
>>> I am planning on merging the branch with master this week.
>>> Best,
>>> Florent
>>>
>>>
>>> On 01/11/12 15:49, Florent Angly wrote:
>>>> Hi all,
>>>>
>>>> I was working with Ben Woodcroft on identifying ways to speed up 
>>>> Grinder, which relies heavily on Bioperl. Ben did some profiling 
>>>> with NYTProf and we realized that a lot of computation time was 
>>>> spent in Bio::PrimarySeq, doing calls to subseq() and length(). The 
>>>> sequences we used for the profiling were microbial genomes, i.e. 
>>>> several Mbp long sequences, which is quite long. A lot of the 
>>>> performance cost was associated with passing full genomes between 
>>>> functions. For example, when doing a call to length(), length() 
>>>> requests the full sequence from seq(), which returns it back to 
>>>> length() (it makes a copy!). So, every call to length is very 
>>>> expensive for long sequences. And there is a lot of code that calls 
>>>> length(), for error checking.
>>>>
>>>> I know that there are a few Bioperl modules that are more adapted 
>>>> to handling very long sequences, e.g. Bio::DB::Fasta or 
>>>> Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look 
>>>> at Bio::PrimarySeq with Ben and I released this commit: 
>>>> https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. 
>>>> But in fact, there were more things that I wanted to try to 
>>>> improve, which led me to start this new branch: 
>>>> https://github.com/bioperl/bioperl-live/tree/seqlength
>>>>
>>>> I wrote quite a few tests for functionalities that were not 
>>>> previously covered by tests, and tried to improve the 
>>>> documentation. In addition, to address the speed issue, I did some 
>>>> changes to Bio::PrimarySeq and Bio::PrimarySeqI :
>>>> ? The length of a sequence is now computed as soon as the sequence 
>>>> is set, not after. This way, there is no extra call to seq() (which 
>>>> would incur the cost of copying the entire sequence between 
>>>> functions).
>>>> ? The length is saved as an object attribute. So, calling length() 
>>>> is very cheap since it only needs to retrieve the stored value for 
>>>> the length.
>>>> ? There is a constructor called -direct, which skips sequence 
>>>> validation. However, it was only active in conjunction with the 
>>>> -ref_to_seq constructor. To make -direct conform better to its 
>>>> documented purpose, I made it -direct work when a sequence is set 
>>>> through -seq as well.
>>>> ? This brings us to trunc(), revcom() and other methods of 
>>>> Bio::PrimarySeqI. Since all these methods create a new 
>>>> Bio::PrimarySeq object from an existing (already validated!) 
>>>> Bio::PrimarySeq object, the new object can be constructed with the 
>>>> -direct constructor, to save some time.
>>>> ? Finally, I noticed that subseq() used calls to eval() to do its 
>>>> work. eval() is notoriously slow and these calls were easily 
>>>> replaced by simple calls to substr() to save some time.
>>>>
>>>> A real-world test I performed with Grinder took 3m28s before the 
>>>> changes (and ~1 min is spent doing something unrelated). After the 
>>>> changes, the same test took only 2min28s. So, it's quite a 
>>>> significant improvement and on more specific test cases, 
>>>> performance gains can obviously be much bigger. Also, I anticipate 
>>>> that the gains would be bigger for even longer sequences.
>>>>
>>>> All the changes I made are meant to be backward compatible and all 
>>>> the tests in the Bioperl test suite passed. So, there _should_ not 
>>>> be any issues. However, I know that Bio::PrimarySeq is a central 
>>>> module of Bioperl, so please, have a look at it and let me know if 
>>>> there are any glaring errors.
>>>>
>>>> Thanks,
>>>>
>>>> Florent
>>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


From mahakadry at aucegypt.edu  Tue Nov 20 18:44:53 2012
From: mahakadry at aucegypt.edu (maha ahmed)
Date: Tue, 20 Nov 2012 20:44:53 +0200
Subject: [Bioperl-l] Parsing a blast report with multiple queries into
 separate one query files that only contain the fasta sequences
Message-ID: <CAE=MQgz9BY7-nUOePoqVZS6FD0ME76yXyTKmODWFhc+TjBGdrg@mail.gmail.com>

Dear BioPerl list,
I blasted a file that has several fasta queries against nr, however I need
to align each query with its hits for further computational analysis so I
need to parse the produced blast report into several files that each has
only the fasta query sequence and its hits in fasta format.
I found this script online,

use Bio::Search::Result::BlastResult;use Bio::SearchIO;
 my $report = Bio::SearchIO->new( -file=>'full-report.bls', -format
<http://perldoc.perl.org/functions/format.html> => blast);my $result =
$report->next_result;my %hits_by_query;while (my $hit =
$result->next_hit) {
  push <http://perldoc.perl.org/functions/push.html>
@{$hits_by_query{$hit->name}}, $hit;}
 foreach my $qid ( keys <http://perldoc.perl.org/functions/keys.html>
%hits_by_query ) {
  my $result = Bio::Search::Result::BlastResult->new();
  $result->add_hit($_) for ( @{$hits_by_query{$qid}} );
  my $blio = Bio::SearchIO->new( -file => ">$qid\.bls", -format
<http://perldoc.perl.org/functions/format.html>=>'blast' );
  $blio->write_result($result);}


however on using it this produced the following error message


BlastResult::new(): Not adding iterations.

------------- EXCEPTION: Bio::Root::NoSuchThing -------------
MSG: No such iteration number: 0. Valid range=1-0
VALUE: The number zero (0)
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/local/share/perl/5.10.0/Bio/Root/Root.pm:472
STACK: Bio::Search::Result::BlastResult::iteration
/usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:327
STACK: Bio::Search::Result::BlastResult::add_hit
/usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:257
STACK: ./parsing.blast.results.into.per.query.files.pl:15

I tried to search for other scripts but I couldn't find any
I would really appreciate your comments to this
Thank you


From cjfields at illinois.edu  Tue Nov 20 19:21:25 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 20 Nov 2012 19:21:25 +0000
Subject: [Bioperl-l] Parsing a blast report with multiple queries into
 separate one query files that only contain the fasta sequences
In-Reply-To: <CAE=MQgz9BY7-nUOePoqVZS6FD0ME76yXyTKmODWFhc+TjBGdrg@mail.gmail.com>
References: <CAE=MQgz9BY7-nUOePoqVZS6FD0ME76yXyTKmODWFhc+TjBGdrg@mail.gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF22E15@CITESMBX5.ad.uillinois.edu>

Maha,

Do you need only the sequence reported in the report (e.g. the HSP alignments) or the original FASTA sequences?  

The former can be recovered from the Bio::Search::HSP::GenericHSP objects as an alignment, and this can be redirected to a FASTA file.  The latter is a little trickier, as you will have to retrieve the sequences from their original source files.  

chris

On Nov 20, 2012, at 12:44 PM, maha ahmed <mahakadry at aucegypt.edu> wrote:

> Dear BioPerl list,
> I blasted a file that has several fasta queries against nr, however I need
> to align each query with its hits for further computational analysis so I
> need to parse the produced blast report into several files that each has
> only the fasta query sequence and its hits in fasta format.
> I found this script online,
> 
> use Bio::Search::Result::BlastResult;use Bio::SearchIO;
> my $report = Bio::SearchIO->new( -file=>'full-report.bls', -format
> <http://perldoc.perl.org/functions/format.html> => blast);my $result =
> $report->next_result;my %hits_by_query;while (my $hit =
> $result->next_hit) {
>  push <http://perldoc.perl.org/functions/push.html>
> @{$hits_by_query{$hit->name}}, $hit;}
> foreach my $qid ( keys <http://perldoc.perl.org/functions/keys.html>
> %hits_by_query ) {
>  my $result = Bio::Search::Result::BlastResult->new();
>  $result->add_hit($_) for ( @{$hits_by_query{$qid}} );
>  my $blio = Bio::SearchIO->new( -file => ">$qid\.bls", -format
> <http://perldoc.perl.org/functions/format.html>=>'blast' );
>  $blio->write_result($result);}
> 
> 
> 
> however on using it this produced the following error message
> 
> 
> 
> BlastResult::new(): Not adding iterations.
> 
> ------------- EXCEPTION: Bio::Root::NoSuchThing -------------
> MSG: No such iteration number: 0. Valid range=1-0
> VALUE: The number zero (0)
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /usr/local/share/perl/5.10.0/Bio/Root/Root.pm:472
> STACK: Bio::Search::Result::BlastResult::iteration
> /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:327
> STACK: Bio::Search::Result::BlastResult::add_hit
> /usr/local/share/perl/5.10.0/Bio/Search/Result/BlastResult.pm:257
> STACK: ./parsing.blast.results.into.per.query.files.pl:15
> 
> I tried to search for other scripts but I couldn't find any
> I would really appreciate your comments to this
> Thank you
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From assayagy at gmail.com  Thu Nov  1 00:02:41 2012
From: assayagy at gmail.com (eyla4ever)
Date: Thu, 01 Nov 2012 00:02:41 -0000
Subject: [Bioperl-l]  handle with file in perl
Message-ID: <34626730.post@talk.nabble.com>


hi 

i want to write a function that get as parameters : file_name, hsp , hit.
and i want her to print all the blast Field that i need to this file.

i do it because i have 2 files with the same Fields.
		

10X
-- 
View this message in context: http://old.nabble.com/handle-with-file-in-perl-tp34626730p34626730.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From rfhorns at gmail.com  Fri Nov  2 00:01:34 2012
From: rfhorns at gmail.com (Felix Horns)
Date: Fri, 02 Nov 2012 00:01:34 -0000
Subject: [Bioperl-l] get_Stream_by_query() appears to return empty stream
Message-ID: <CANpYnv_dSVMM85iNVn8gnktNyOuKmevsWifyMFm9BY_QboJT8w@mail.gmail.com>

Hello everyone.

I am having trouble using the get_Stream_by_query() function
in Bio::DB::GenBank.  It seems to return an empty stream, such that
$stream->next_seq never returns anything.

However, $query->count is returning the expected value (139).  Also,
get_Stream_by_query() seems to be querying the database, as when I pass it
an array of GeneIDs that have not been properly formatted, i.e.
GeneID:7816864, instead of simply 7816864, it returns warnings and errors:
"MSG: Warning(s) from GenBank: <PhraseNotFound>GeneID 7817709...; MSG:
Error from Genbank: No items found.".

I have included my full code below. I have also included the output from
the code below that.  The code is intended to find genes located within a
genomic region. I will later find the protein domains and pathways that
those genes are involved in.

Any help would be greatly appreciated.  I realize that this is probably a
very simple question, but I am relatively new to BioPerl and I've spent the
better part of the day trying to figure out such issues, so I would be very
thankful for help.

Felix


#!/usr/bin/perl
use strict;
use Bio::SeqIO;
use Bio::DB::EntrezGene;
use Bio::DB::GenBank;

# Load reference sequence
# Load from local .gb file
# Note that .gb file does not include sequences
# my $gbfile = "NC_012660.1.gb";
# my $seqio = Bio::SeqIO->new(-file => $gbfile);
# my $ref_seq = $seqio->next_seq;

# To access reference sequence programatically, uncomment this code
my $gb = new Bio::DB::GenBank;
my $ref_seq = $gb->get_Seq_by_acc("NC_012660.1");

# Specify coordinates of gap
my $gap_start = 2050506;
my $gap_end = 2190530;

my $gene_count = 0;
my @features;
my @starts;
my @ends;
my @db_xrefs;

my @products;
my @protein_ids;

# Get gene features in gap
for my $feat ($ref_seq->get_SeqFeatures) {
  my $start=$feat->location->start;
  my $end=$feat->location->end;

  if (($feat->primary_tag eq 'gene') &
      ($gap_start < $start) & ($start < $gap_end) &
      ($gap_start < $end) & ($end < $gap_end)) {

    $gene_count += 1;

    # Get GeneID reference
    my $db_xref = ($feat->get_tag_values('db_xref'))[0];
    $db_xref =~ s/GeneID://;    # Trim "GeneID:" from start of $db_xref

    push @features, $feat;
    push @starts, $start;
    push @ends, $end;
    push @db_xrefs, $db_xref;
  }
}

# Get data about gene features from GeneID reference
my $query = Bio::DB::Query::GenBank->new(-db => 'gene',
 -ids => [@db_xrefs]);
my $stream = $gb->get_Stream_by_query($query);

while (my $seq = $stream->next_seq) {
  for my $feat ($seq->all_SeqFeatures) {
    print "primary tag: ", $feat->primary_tag, "\n";
    for my $tag ($feat->get_all_tags) {
      print "  tag: ", $tag, "\n";
      for my $value ($feat->get_tag_values($tag)) {
print "    value: ", $value, "\n";
      }
    }
  }
}

print $query->count,"\n";
print $gene_count, "\n";


OUTPUT
> perl analyze_gap.pl
139
139

Note that no "primary tag; tag; value" items are printed.  Furthermore,
when I put a print line immediately after the (while (my $seq =
$stream->next_seq)) statement, it was never called, seemingly indicating
that the stream is empty.


From mooldhu at gmail.com  Tue Nov  6 07:38:57 2012
From: mooldhu at gmail.com (=?GB2312?B?uvq9rQ==?=)
Date: Tue, 6 Nov 2012 15:38:57 +0800
Subject: [Bioperl-l] Ask for help about Bioperl
Message-ID: <CACdwZFTvgwGzaa2wTfinu9=JKYAXL_O0kQ-Osn7FGcsDx33+Ng@mail.gmail.com>

hi,
when I use bioperl ,it report errors like this :---------------------
WARNING ---------------------
MSG: sequence '1' doesn't validate, mismatch is )81,95,0(,::
---------------------------------------------------
Error providing evidence type: GeneModel
The error was:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Attempting to set the sequence '1' to
[)81hvf95x0(DSTD=qeSrytkiyP::oiV] which does not look healthy
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:486
STACK: Bio::PrimarySeq::seq /usr/share/perl5/Bio/PrimarySeq.pm:285
STACK: Bio::PrimarySeq::new /usr/share/perl5/Bio/PrimarySeq.pm:239
STACK: Bio::PrimarySeqI::revcom /usr/share/perl5/Bio/PrimarySeqI.pm:383


but,I am sure that the input file only cotain [ATGCN],I also try to use
another sequences ,but the errors are the same.my bioperl is Bioperl-live
1.006902;

-- 
??


From assayagy at gmail.com  Sat Nov 10 18:27:03 2012
From: assayagy at gmail.com (eyla4ever)
Date: Sat, 10 Nov 2012 10:27:03 -0800 (PST)
Subject: [Bioperl-l] Extracting sequences from Genbank files
In-Reply-To: <CAJLmuDKPBA_DUtnfQjcAXr+JU=nL+orBP4cCgwBi=4BWQiYmpw@mail.gmail.com>
References: <CAJLmuDKPBA_DUtnfQjcAXr+JU=nL+orBP4cCgwBi=4BWQiYmpw@mail.gmail.com>
Message-ID: <34664632.post@talk.nabble.com>


hello Brian

i wuold like you to send me your script, i think it can help me to solve a
big problem
and help me to finish my final project.
i hope it will be posible

regards Eyla


BForde wrote:
> 
> Hello,
> 
> I have been modifying a script which extracts all the protein sequences
> from a genbank file and saves them in a multi-fasta file.
> 
> I wish the fasta header to have both the locus_tag of the protein and the
> product. However I cannot get the  product tag to write to the fasta
> header
> 
> this is the relevant section of the script
> 
>  $s->display_id($f->has_tag('locus_tag') ? join(',',sort
> $f->each_tag_value('locus_tag')) :
>                            $f->has_tag('product') ?
> join(',',$f->each_tag_value('product')):
>                            $s->display_id);
> 
> is "product" not an actual tag
> 
> regards
> 
> Brian
> 
> 
> 
> -- 
> Brian Forde
> Microbiology Dept.
> Bioscience Institute. Room 4.11
> University College Cork
> Cork
> Ireland
> tel:+353 21 4901306
> email: b.m.forde at umail.ucc.ie
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 

-- 
View this message in context: http://old.nabble.com/Extracting-sequences-from-Genbank-files-tp33901023p34664632.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From bosborne11 at verizon.net  Tue Nov 20 23:50:00 2012
From: bosborne11 at verizon.net (Brian Osborne)
Date: Tue, 20 Nov 2012 18:50:00 -0500
Subject: [Bioperl-l] get_Stream_by_query() appears to return empty stream
In-Reply-To: <CANpYnv_dSVMM85iNVn8gnktNyOuKmevsWifyMFm9BY_QboJT8w@mail.gmail.com>
References: <CANpYnv_dSVMM85iNVn8gnktNyOuKmevsWifyMFm9BY_QboJT8w@mail.gmail.com>
Message-ID: <5F077DEA-DEBD-42BC-87E7-327697764CFE@verizon.net>

Felix,

I took a look at the Bio::DB::Query::GenBank documentation, it says this:

If you provide an array reference of IDs in -ids, the query will be ignored and the list of IDs will be used when the query is passed to a Bio::DB::GenBank object's get_Stream_by_query() method. 

Bio::DB::Genbank queries "nucleotide", by default. You have GeneIDs. I see that you're setting "-id" to "gene" but note that you're passing that query to a plain Bio::DB::GenBank object. Not sure what the expected behavior is here.

I would try using the NCBI Eutilities for that second query, rather than Bio::DB::Query::GenBank (http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook).

Brian O.

On Nov 1, 2012, at 8:01 PM, Felix Horns <rfhorns at gmail.com> wrote:

> Hello everyone.
> 
> I am having trouble using the get_Stream_by_query() function
> in Bio::DB::GenBank.  It seems to return an empty stream, such that
> $stream->next_seq never returns anything.
> 
> However, $query->count is returning the expected value (139).  Also,
> get_Stream_by_query() seems to be querying the database, as when I pass it
> an array of GeneIDs that have not been properly formatted, i.e.
> GeneID:7816864, instead of simply 7816864, it returns warnings and errors:
> "MSG: Warning(s) from GenBank: <PhraseNotFound>GeneID 7817709...; MSG:
> Error from Genbank: No items found.".
> 
> I have included my full code below. I have also included the output from
> the code below that.  The code is intended to find genes located within a
> genomic region. I will later find the protein domains and pathways that
> those genes are involved in.
> 
> Any help would be greatly appreciated.  I realize that this is probably a
> very simple question, but I am relatively new to BioPerl and I've spent the
> better part of the day trying to figure out such issues, so I would be very
> thankful for help.
> 
> Felix
> 
> 
> #!/usr/bin/perl
> use strict;
> use Bio::SeqIO;
> use Bio::DB::EntrezGene;
> use Bio::DB::GenBank;
> 
> # Load reference sequence
> # Load from local .gb file
> # Note that .gb file does not include sequences
> # my $gbfile = "NC_012660.1.gb";
> # my $seqio = Bio::SeqIO->new(-file => $gbfile);
> # my $ref_seq = $seqio->next_seq;
> 
> # To access reference sequence programatically, uncomment this code
> my $gb = new Bio::DB::GenBank;
> my $ref_seq = $gb->get_Seq_by_acc("NC_012660.1");
> 
> # Specify coordinates of gap
> my $gap_start = 2050506;
> my $gap_end = 2190530;
> 
> my $gene_count = 0;
> my @features;
> my @starts;
> my @ends;
> my @db_xrefs;
> 
> my @products;
> my @protein_ids;
> 
> # Get gene features in gap
> for my $feat ($ref_seq->get_SeqFeatures) {
>  my $start=$feat->location->start;
>  my $end=$feat->location->end;
> 
>  if (($feat->primary_tag eq 'gene') &
>      ($gap_start < $start) & ($start < $gap_end) &
>      ($gap_start < $end) & ($end < $gap_end)) {
> 
>    $gene_count += 1;
> 
>    # Get GeneID reference
>    my $db_xref = ($feat->get_tag_values('db_xref'))[0];
>    $db_xref =~ s/GeneID://;    # Trim "GeneID:" from start of $db_xref
> 
>    push @features, $feat;
>    push @starts, $start;
>    push @ends, $end;
>    push @db_xrefs, $db_xref;
>  }
> }
> 
> # Get data about gene features from GeneID reference
> my $query = Bio::DB::Query::GenBank->new(-db => 'gene',
> -ids => [@db_xrefs]);
> my $stream = $gb->get_Stream_by_query($query);
> 
> while (my $seq = $stream->next_seq) {
>  for my $feat ($seq->all_SeqFeatures) {
>    print "primary tag: ", $feat->primary_tag, "\n";
>    for my $tag ($feat->get_all_tags) {
>      print "  tag: ", $tag, "\n";
>      for my $value ($feat->get_tag_values($tag)) {
> print "    value: ", $value, "\n";
>      }
>    }
>  }
> }
> 
> print $query->count,"\n";
> print $gene_count, "\n";
> 
> 
> OUTPUT
>> perl analyze_gap.pl
> 139
> 139
> 
> Note that no "primary tag; tag; value" items are printed.  Furthermore,
> when I put a print line immediately after the (while (my $seq =
> $stream->next_seq)) statement, it was never called, seemingly indicating
> that the stream is empty.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From bosborne11 at verizon.net  Tue Nov 20 23:52:00 2012
From: bosborne11 at verizon.net (Brian Osborne)
Date: Tue, 20 Nov 2012 18:52:00 -0500
Subject: [Bioperl-l] Ask for help about Bioperl
In-Reply-To: <CACdwZFTvgwGzaa2wTfinu9=JKYAXL_O0kQ-Osn7FGcsDx33+Ng@mail.gmail.com>
References: <CACdwZFTvgwGzaa2wTfinu9=JKYAXL_O0kQ-Osn7FGcsDx33+Ng@mail.gmail.com>
Message-ID: <3B472C83-7F6C-41E2-A629-E6A2BDC6B075@verizon.net>

??,

You're going to have show us your code, we can't help you just by seeing the error messages. Show us the input file as well, or the beginning of it.

Brian O.


On Nov 6, 2012, at 2:38 AM, ?? <mooldhu at gmail.com> wrote:

> hi,
> when I use bioperl ,it report errors like this :---------------------
> WARNING ---------------------
> MSG: sequence '1' doesn't validate, mismatch is )81,95,0(,::
> ---------------------------------------------------
> Error providing evidence type: GeneModel
> The error was:
> 
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: Attempting to set the sequence '1' to
> [)81hvf95x0(DSTD=qeSrytkiyP::oiV] which does not look healthy
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:486
> STACK: Bio::PrimarySeq::seq /usr/share/perl5/Bio/PrimarySeq.pm:285
> STACK: Bio::PrimarySeq::new /usr/share/perl5/Bio/PrimarySeq.pm:239
> STACK: Bio::PrimarySeqI::revcom /usr/share/perl5/Bio/PrimarySeqI.pm:383
> 
> 
> but,I am sure that the input file only cotain [ATGCN],I also try to use
> another sequences ,but the errors are the same.my bioperl is Bioperl-live
> 1.006902;
> 
> -- 
> ??
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From hlapp at drycafe.net  Wed Nov 21 02:24:50 2012
From: hlapp at drycafe.net (Hilmar Lapp)
Date: Tue, 20 Nov 2012 21:24:50 -0500
Subject: [Bioperl-l] handle with file in perl
In-Reply-To: <34626730.post@talk.nabble.com>
References: <34626730.post@talk.nabble.com>
Message-ID: <1DE09B34-5124-478C-8925-0045EC119CFC@drycafe.net>

This sounds like a homework assignment. We're not here to do your homework or assignments for you. You can post if you run into a specific problem when solving your assignment with Bioperl, and we'll help with that. 

-hilmar

Sent with a tap.

On Oct 31, 2012, at 7:45 PM, eyla4ever <assayagy at gmail.com> wrote:

> 
> hi 
> 
> i want to write a function that get as parameters : file_name, hsp , hit.
> and i want her to print all the blast Field that i need to this file.
> 
> i do it because i have 2 files with the same Fields.
>        
> 
> 10X
> -- 
> View this message in context: http://old.nabble.com/handle-with-file-in-perl-tp34626730p34626730.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From mahakadry at aucegypt.edu  Sat Nov 24 01:33:59 2012
From: mahakadry at aucegypt.edu (maha ahmed)
Date: Sat, 24 Nov 2012 03:33:59 +0200
Subject: [Bioperl-l] retrieving a subset of files from a folder
Message-ID: <CAE=MQgzV7Z-bQxiQFDuHHaEr=XL5=cTc6Mhq1Su7VPByZNEP6Q@mail.gmail.com>

Dear Bioperl list,
I have a folder that has 60,000 files (one file for each phylogenetic tree)
However I only need to work with a subset of 1,000 files from that folder
(the files are not numbered in order so I cant use the i++ loop in my
bioperl script)
Is there a way to write a script that only moves files with the names given
in a list in a text file
i.e. I have a file that has the names of the files I want to copy fro m the
folder and I want to write script that does this
Thank you so much


From kellert at ohsu.edu  Sat Nov 24 18:08:11 2012
From: kellert at ohsu.edu (Tom Keller)
Date: Sat, 24 Nov 2012 10:08:11 -0800
Subject: [Bioperl-l] use cookbook to work with a directory of files
In-Reply-To: <mailman.7.1353776405.32614.bioperl-l@lists.open-bio.org>
References: <mailman.7.1353776405.32614.bioperl-l@lists.open-bio.org>
Message-ID: <C969FE0E-18FE-4771-B031-22EEA42AEA77@ohsu.edu>

A search with the phrase "perl cookbook filenames from directory" should help you find what you need.

On Nov 24, 2012, at 9:00 AM, bioperl-l-request at lists.open-bio.org wrote:

> Send Bioperl-l mailing list submissions to
> 	bioperl-l at lists.open-bio.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://lists.open-bio.org/mailman/listinfo/bioperl-l
> or, via email, send a message with subject or body 'help' to
> 	bioperl-l-request at lists.open-bio.org
> 
> You can reach the person managing the list at
> 	bioperl-l-owner at lists.open-bio.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Bioperl-l digest..."
> 
> 
> Today's Topics:
> 
>   1.  retrieving a subset of files from a folder (maha ahmed)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Sat, 24 Nov 2012 03:33:59 +0200
> From: maha ahmed <mahakadry at aucegypt.edu>
> Subject: [Bioperl-l] retrieving a subset of files from a folder
> To: Bioperl-l at lists.open-bio.org
> Message-ID:
> 	<CAE=MQgzV7Z-bQxiQFDuHHaEr=XL5=cTc6Mhq1Su7VPByZNEP6Q at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> Dear Bioperl list,
> I have a folder that has 60,000 files (one file for each phylogenetic tree)
> However I only need to work with a subset of 1,000 files from that folder
> (the files are not numbered in order so I cant use the i++ loop in my
> bioperl script)
> Is there a way to write a script that only moves files with the names given
> in a list in a text file
> i.e. I have a file that has the names of the files I want to copy fro m the
> folder and I want to write script that does this
> Thank you so much
> 
> 
> ------------------------------
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> End of Bioperl-l Digest, Vol 115, Issue 8
> *****************************************


From minou.nowrousian at rub.de  Sat Nov 24 18:24:02 2012
From: minou.nowrousian at rub.de (Minou Nowrousian)
Date: 24 Nov 2012 19:24:02 +0100
Subject: [Bioperl-l] retrieving a subset of files from a folder
Message-ID: <000001cdca70$e1a97720$a4fc6560$@rub.de>


>Dear Bioperl list,
>I have a folder that has 60,000 files (one file for each phylogenetic tree)
However I only need to work with a subset of 1,000 files from that folder
>(the files are not numbered in order so I cant use the i++ loop in my
bioperl script) Is there a way to write a script that only moves files with
the >names given in a list in a text file i.e. I have a file that has the
names of the files I want to copy fro m the folder and I want to write
script that does >this Thank you so much

I don't know if there is a BioPerl solution, but you could use the
File::Copy module (available from CPAN):

use File::Copy;
 copy("path_to_file_you_want_to_copy","path_to_target_file") or die "Copy
failed: $!";

Regards,
Minou


From mahakadry at aucegypt.edu  Sat Nov 24 19:04:09 2012
From: mahakadry at aucegypt.edu (maha ahmed)
Date: Sat, 24 Nov 2012 21:04:09 +0200
Subject: [Bioperl-l] retrieving a subset of files from a folder
In-Reply-To: <000001cdca70$e1a97720$a4fc6560$@rub.de>
References: <000001cdca70$e1a97720$a4fc6560$@rub.de>
Message-ID: <CAE=MQgztf_isVyt=WPF9LMXCtX4Q2U9vHL1AV+TwpueUjKayuw@mail.gmail.com>

Thanks everyone , I actually found a one line command that I am going to
try:
xargs -a file_list.txt mv -t /path/to/des
thanks for your help I will read have a look at the readings you suggested
thank you

On Sat, Nov 24, 2012 at 8:24 PM, Minou Nowrousian
<minou.nowrousian at rub.de>wrote:

>
> >Dear Bioperl list,
> >I have a folder that has 60,000 files (one file for each phylogenetic
> tree)
> However I only need to work with a subset of 1,000 files from that folder
> >(the files are not numbered in order so I cant use the i++ loop in my
> bioperl script) Is there a way to write a script that only moves files with
> the >names given in a list in a text file i.e. I have a file that has the
> names of the files I want to copy fro m the folder and I want to write
> script that does >this Thank you so much
>
> I don't know if there is a BioPerl solution, but you could use the
> File::Copy module (available from CPAN):
>
> use File::Copy;
>  copy("path_to_file_you_want_to_copy","path_to_target_file") or die "Copy
> failed: $!";
>
> Regards,
> Minou
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


From maj at fortinbras.us  Tue Nov 27 13:49:46 2012
From: maj at fortinbras.us (Mark A. Jensen)
Date: Tue, 27 Nov 2012 13:49:46 +0000
Subject: [Bioperl-l] Neo4j : applying user defined validation and constraints
Message-ID: <W2391426705276111354024186@webmail57>

Hi Folks,
Since there was some enthusiasm about REST::Neo4p, my interface to Neo4j, I thought I would let you know about
https://metacpan.org/module/REST::Neo4p::Constrain
This is a framework that lets you apply constraints on node and relationship properties, relationships, and relationship types. You can specify your constraints, and have REST::Neo4p throw exceptions when the constraints aren't met, or you can do validation on existing database items. The pod has a full explanation and examples aplenty.

Please have a look and send bugs my way via RT.
Cheers all,
MAJ


From francescomusacchia at gmail.com  Wed Nov 28 10:27:16 2012
From: francescomusacchia at gmail.com (Francesco Musacchia)
Date: Wed, 28 Nov 2012 02:27:16 -0800 (PST)
Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access
Message-ID: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com>

Hi all,
I have a big problem with using GFF3 database with BioPerl. This is not a 
question about what is the way to write some bioperl code. I'm experiencing 
that when I have to do a lot of accessess on a GFF database (with Bio:DB::SeqFeature::Store) 
the slowness increase until my script can stay running for more than a day.

How can I solve it? Or it cannot be done?

Thanks a lot!