From florent.angly at gmail.com  Sun Dec  2 21:36:28 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Mon, 03 Dec 2012 12:36:28 +1000
Subject: [Bioperl-l] Bio::DB::Fasta and threads
Message-ID: <50BC102C.7080902@gmail.com>

Hi all,

This is in response to Carson Holt's report that Bio::DB::Fasta does not 
play well with threads: https://redmine.open-bio.org/issues/3397

The first issue is the serialization of Bio::DB::IndexedBase-inheriting 
(e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for 
threading (for example when using Thread::Queue::Any). I implemented 
hooks that make it transparent to serialize using Storable freeze() and 
thaw().

Another issue was the lack of communication between different 
Bio::DB::IndexedBase instances, which means that an instance could 
easily be writing or deleting the database that another instance is 
working on. To fix this, I needed some form of locking.

Some database Bio::DB::IndexedBase backends (DB_file) have some support 
for locking but Bio::DB::IndexedBase also supports other database 
backends for which there is no native locking mechanism. So, I had to 
come up with a more general solution: a lock file. I noticed that 
Bio::DB::SeqFeature::Store::berkeleydb has a locking mechanism, based on 
flock(), which means that it does not work with NFS-mounted filesystems. 
All the Bioperl-based scripts I (and most likely many others) write run 
on servers that use NFS, so this support is important. I have found only 
one way to do the NFS locking safely, using File::SharedNFSLock. It has 
a few downsides though:
     1/ it is an external dependency,
     2/ it does not work on FAT filesystems (should be mostly restricted 
to USB sticks nowadays) and the lock is never acquired, and
     3/ at the moment, it requires a patch to work in threaded context 
(https://rt.cpan.org/Public/Bug/Display.html?id=81597)

Note that while I have now added basic support for threads in 
Bio::DB::IndexedBase was added, I still get segfaults in specific cases, 
for example when returning a database or sequence object. This might be 
related to this issue: 
https://rt.perl.org/rt3/Ticket/Display.html?id=115972. Beyond this, the 
new code seems to work nicely. See the branch 
https://github.com/bioperl/bioperl-live/tree/storable_db if you want to 
test yourself. For example, one can now run multiple threads, each of 
them creating a Bio::DB::Fasta database from the same FASTA file: the 
first thread performs the indexing while the others wait nicely for the 
indexing to be finished to query the database.

Comments welcome. Regards,

Florent

From l.m.timmermans at students.uu.nl  Mon Dec  3 19:29:59 2012
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Tue, 4 Dec 2012 01:29:59 +0100
Subject: [Bioperl-l] Bio::DB::Fasta and threads
In-Reply-To: <50BC102C.7080902@gmail.com>
References: <50BC102C.7080902@gmail.com>
Message-ID: <CAC1jpXCFHTb1DCsuKxLy1==qrKmSgRwqxuj155iyf7yyR=-WDQ@mail.gmail.com>

On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly <florent.angly at gmail.com> wrote:
> The first issue is the serialization of Bio::DB::IndexedBase-inheriting
> (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for
> threading (for example when using Thread::Queue::Any). I implemented hooks
> that make it transparent to serialize using Storable freeze() and thaw().

I don't think serializing a magical thingie makes much sense. Storable
is commonly used for a lot more things than interthread communication
(e.g. network communication), this would often not work under such
circumstances.

Leon

From cjfields at illinois.edu  Mon Dec  3 22:23:50 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 4 Dec 2012 03:23:50 +0000
Subject: [Bioperl-l] Bio::DB::Fasta and threads
In-Reply-To: <CAC1jpXCFHTb1DCsuKxLy1==qrKmSgRwqxuj155iyf7yyR=-WDQ@mail.gmail.com>
References: <50BC102C.7080902@gmail.com>
	<CAC1jpXCFHTb1DCsuKxLy1==qrKmSgRwqxuj155iyf7yyR=-WDQ@mail.gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF45670@CHIMBX5.ad.uillinois.edu>

On Dec 3, 2012, at 6:29 PM, Leon Timmermans <l.m.timmermans at students.uu.nl> wrote:

> On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly <florent.angly at gmail.com> wrote:
>> The first issue is the serialization of Bio::DB::IndexedBase-inheriting
>> (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for
>> threading (for example when using Thread::Queue::Any). I implemented hooks
>> that make it transparent to serialize using Storable freeze() and thaw().
> 
> I don't think serializing a magical thingie makes much sense. Storable
> is commonly used for a lot more things than interthread communication
> (e.g. network communication), this would often not work under such
> circumstances.
> 
> Leon

Leon, any suggestions on alternatives?  I know this particular bit is a sore spot with MAKER at the moment, so any help would be greatly appreciated.

chris


From yongli at yeslab.com  Sat Dec  1 01:10:15 2012
From: yongli at yeslab.com (=?utf-8?B?eW9uZ2xpQHllc2xhYi5jb20=?=)
Date: Sat, 1 Dec 2012 14:10:15 +0800 (CST)
Subject: [Bioperl-l] =?utf-8?q?question_about_bioperl_program?=
Message-ID: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com>

Dear Sir or Madam,

 
 I just begin to learn bioperl and am using a bioperl program to extract a kind of bacterial genes sequences from genome genbank file. I download NC_003450.gbk, ppt, ffn files. NC_003450 is a kind of bacterial genome genbank file. My bioperl code as follows:

 
 use Bio::Seq;

  use Bio::SeqIO;

  
  $seqio_obj=Bio::SeqIO->new (-file=>'NC_003450.gbk',-format=>'genbank');

  # $seq_obj=$seqio_obj->next_seq;

  
  while($seq_obj=$seqio_obj->next_seq)

  {

    $display_name=$seq_obj->display_name;

    $desc=$seq_obj->desc;

    $seq=$seq_obj->seq;

  $acc = $seq_obj->accession_number;

  $seqio_obj = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'fasta' );

  $seqio_obj->write_seq($seq_obj);

  }

  
 After the program runned I just gain a genome complete sequence without a file including every gene sequence one by one. I want to get the genes sequences one by one in a fasta files.  So I write you for help.

 
 Yong Li


From carsonhh at gmail.com  Mon Dec  3 22:35:50 2012
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 03 Dec 2012 22:35:50 -0500
Subject: [Bioperl-l] Bio::DB::Fasta and threads
In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF4CF45670@CHIMBX5.ad.uillinois.edu>
Message-ID: <CCE2D73A.16852%carsonhh@gmail.com>

Bio::DB::Fasta is working for maker now.  The previous issues have been
fixed, but being as Florent has gone out of his way to build a number of
improvements into Bio::DB::Fasta over the past few weeks, this seemed like
a useful one as well, so I suggested it.  One of the big uses of
Bio::DB::Fasta is the Bio::PrimarySeq::Fasta features it creates.  They
are great for manipulating the sequence without actually having to ever
keep it in memory.  It's nice because the sequence is made available on
demand, but when you try and pass them between threads, your program falls
apart. There are creative work arounds, but simply adding a serialization
hook to Bio::DB::Fasta to disconnect the database on freezing and then
reconnect on thaw also fixes it, and it makes them extremely useful for
multi-threaded applications without having to go through other kinds of
work arounds (it just makes them work as expected with serialization).
Previously I had created my own module and inherited from Bio::DB::Fasta
so I could implement the Storable hooks.  Because Storable looks for the
hooks in anything it serializes, the Bio::DB::Fasta object can even be
well down inside of a complex object and you don't have worry about it.
Previously I've used Storable hooks to pass the Bio::PrimarySeq::Fasta
features across the network using MPI, as long as the database is on an
NFS mount it just reconnects on the other node with no issue.  If the
indexed file isn't available after deserialization over a network, you
could just throw an error when the thaw hook is called.  I'll give
Florent's changes a look over soon to give any suggestions.

Thanks,
Carson


On 12-12-03 10:23 PM, "Fields, Christopher J" <cjfields at illinois.edu>
wrote:

>On Dec 3, 2012, at 6:29 PM, Leon Timmermans
><l.m.timmermans at students.uu.nl> wrote:
>
>> On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly <florent.angly at gmail.com>
>>wrote:
>>> The first issue is the serialization of Bio::DB::IndexedBase-inheriting
>>> (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for
>>> threading (for example when using Thread::Queue::Any). I implemented
>>>hooks
>>> that make it transparent to serialize using Storable freeze() and
>>>thaw().
>> 
>> I don't think serializing a magical thingie makes much sense. Storable
>> is commonly used for a lot more things than interthread communication
>> (e.g. network communication), this would often not work under such
>> circumstances.
>> 
>> Leon
>
>Leon, any suggestions on alternatives?  I know this particular bit is a
>sore spot with MAKER at the moment, so any help would be greatly
>appreciated.
>
>chris
>


From jason.r.gallant at gmail.com  Tue Dec  4 15:23:02 2012
From: jason.r.gallant at gmail.com (Jason Gallant)
Date: Tue, 4 Dec 2012 12:23:02 -0800 (PST)
Subject: [Bioperl-l] Problem with BIO::DB::FASTA and Colon in Fasta Header
Message-ID: <c996f6bc-458a-461a-bcc5-3a567eb06e85@googlegroups.com>

Hello,

I'm trying to retreive fasta sequences that contain a colon in their 
header.  However, I cannot get my BioPerl script to do this!!

It works as expected when the header does not contain the colon, however 
doesn't return anything when it does.  Weirdly, when I ask it to return the 
parsed IDs (see below), it returns the appropriate IDs, which include the 
colon!  Very confusing, would appreciate any help!!

Many Thanks,
Jason Gallant


use strict;
use Bio::SearchIO; 
use Bio::DB::Fasta;


my ($file,$id,$start,$end) = 
("secondround_merged_expanded.fasta","C7047455:0-100",1,10);


my $db = Bio::DB::Fasta->new($file, -reindex=>1);
my $seq = $db->seq($id,$start,$end);
 
print $db->ids;

print $seq,"\n";


From asjo at koldfront.dk  Tue Dec  4 15:53:08 2012
From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=)
Date: Tue, 04 Dec 2012 21:53:08 +0100
Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access
In-Reply-To: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com>
	(Francesco Musacchia's message of "Wed, 28 Nov 2012 02:27:16 -0800
	(PST)")
References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com>
Message-ID: <87y5hdletn.fsf@topper.koldfront.dk>

On Wed, 28 Nov 2012 02:27:16 -0800 (PST), Francesco wrote:

> I'm experiencing that when I have to do a lot of accessess on a GFF
> database (with Bio:DB::SeqFeature::Store) the slowness increase until
> my script can stay running for more than a day.

First you'll need to find out what/where exactly it is slow. One way to
do so is using a a profiler; this is a good one for Perl:

 * https://metacpan.org/module/Devel::NYTProf

If you want more specific suggestions, you'll probably have to provide
more information.


  Good luck!

    Adam

-- 
 "As Knuth pointed out long ago, speed only matters           Adam Sj?gren
  in certain critical bottlenecks. And as many           asjo at koldfront.dk
  programmers have observed since, one is very often
  mistaken about where these bottlenecks are."


From cjfields at illinois.edu  Tue Dec  4 16:10:00 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 4 Dec 2012 21:10:00 +0000
Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access
In-Reply-To: <87y5hdletn.fsf@topper.koldfront.dk>
References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com>
	<87y5hdletn.fsf@topper.koldfront.dk>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu>


On Dec 4, 2012, at 2:53 PM, Adam Sj?gren <asjo at koldfront.dk>
 wrote:

> On Wed, 28 Nov 2012 02:27:16 -0800 (PST), Francesco wrote:
> 
>> I'm experiencing that when I have to do a lot of accessess on a GFF
>> database (with Bio:DB::SeqFeature::Store) the slowness increase until
>> my script can stay running for more than a day.
> 
> First you'll need to find out what/where exactly it is slow. One way to
> do so is using a a profiler; this is a good one for Perl:
> 
> * https://metacpan.org/module/Devel::NYTProf
> 
> If you want more specific suggestions, you'll probably have to provide
> more information.
> 
> 
>  Good luck!
> 
>    Adam

If anything, we need more profiling of Bioperl code.  Ah, if we only had infinite time... :)

chris

From asjo at koldfront.dk  Tue Dec  4 16:33:55 2012
From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=)
Date: Tue, 04 Dec 2012 22:33:55 +0100
Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access
In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu>
	(Christopher J. Fields's message of "Tue, 4 Dec 2012 21:10:00 +0000")
References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com>
	<87y5hdletn.fsf@topper.koldfront.dk>
	<118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu>
Message-ID: <87txs1jyd8.fsf@topper.koldfront.dk>

On Tue, 4 Dec 2012 21:10:00 +0000, Fields, wrote:

> If anything, we need more profiling of Bioperl code. Ah, if we only
> had infinite time... :)

If we had that, we didn't need profiling!


  ;-),

   Adam

-- 
 "On the quiet side. Somewhat peculiar. A good                Adam Sj?gren
  companion, in a weird sort of way."                    asjo at koldfront.dk


From florent.angly at gmail.com  Tue Dec  4 16:52:41 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Wed, 05 Dec 2012 07:52:41 +1000
Subject: [Bioperl-l] Problem with BIO::DB::FASTA and Colon in Fasta
	Header
In-Reply-To: <c996f6bc-458a-461a-bcc5-3a567eb06e85@googlegroups.com>
References: <c996f6bc-458a-461a-bcc5-3a567eb06e85@googlegroups.com>
Message-ID: <50BE70A9.4060404@gmail.com>

Hi Jason,

See the documentation for seq() at 
http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/DB/Fasta.pm#OBJECT_METHODS 
<http://search.cpan.org/%7Ecjfields/BioPerl-1.6.901/Bio/DB/Fasta.pm#OBJECT_METHODS>.

When you call seq() with a single argument, e.g. 
$db->seq('C7047455:0-100'), Bio::DB::Fasta interprets it as a compound 
ID and looks for position 0 to 100 of a sequence called C7047455. This 
is a feature that has been in Bio::DB::Fasta since the dawn of time. In 
this form, seq() expects a colon as part of the compound ID, which is 
problematic because your sequence ID actually contains a colon.

I think that when you call $db->seq($id,$start,$end), Bio::DB::Fasta 
does not attempt to parse your ID. This is why your code works with this 
form. Note that if you want to get the entirety of a sequence called 
'C7047455:0-100', the easiest if your sequence names contain colon is to 
use $db->get_Seq_by_id('C7047455:0-100') since get_Seq_by_id() does only 
take a regular ID (not compound).

Florent


On 05/12/12 06:23, Jason Gallant wrote:
> Hello,
>
> I'm trying to retreive fasta sequences that contain a colon in their
> header.  However, I cannot get my BioPerl script to do this!!
>
> It works as expected when the header does not contain the colon, however
> doesn't return anything when it does.  Weirdly, when I ask it to return the
> parsed IDs (see below), it returns the appropriate IDs, which include the
> colon!  Very confusing, would appreciate any help!!
>
> Many Thanks,
> Jason Gallant
>
>
> use strict;
> use Bio::SearchIO;
> use Bio::DB::Fasta;
>
>
> my ($file,$id,$start,$end) =
> ("secondround_merged_expanded.fasta","C7047455:0-100",1,10);
>
>
> my $db = Bio::DB::Fasta->new($file, -reindex=>1);
> my $seq = $db->seq($id,$start,$end);
>   
> print $db->ids;
>
> print $seq,"\n";
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From bosborne11 at verizon.net  Tue Dec  4 17:12:59 2012
From: bosborne11 at verizon.net (Brian Osborne)
Date: Tue, 04 Dec 2012 17:12:59 -0500
Subject: [Bioperl-l] question about bioperl program
In-Reply-To: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com>
References: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com>
Message-ID: <16BBC477-9935-4C79-A70D-6B18716089FB@verizon.net>

Yong Li,

You want to take a look at this HOWTO:

http://www.bioperl.org/wiki/HOWTO:Feature-Annotation

Those genes you see in the file are features in the genome sequence.

Brian O.


On Dec 1, 2012, at 1:10 AM, yongli at yeslab.com wrote:

> Dear Sir or Madam,
> 
> 
> 
> I just begin to learn bioperl and am using a bioperl program to extract a kind of bacterial genes sequences from genome genbank file. I download NC_003450.gbk, ppt, ffn files. NC_003450 is a kind of bacterial genome genbank file. My bioperl code as follows:
> 
> 
> 
> use Bio::Seq;
> 
>  use Bio::SeqIO;
> 
> 
> 
>  $seqio_obj=Bio::SeqIO->new (-file=>'NC_003450.gbk',-format=>'genbank');
> 
>  # $seq_obj=$seqio_obj->next_seq;
> 
> 
> 
>  while($seq_obj=$seqio_obj->next_seq)
> 
>  {
> 
>    $display_name=$seq_obj->display_name;
> 
>    $desc=$seq_obj->desc;
> 
>    $seq=$seq_obj->seq;
> 
>  $acc = $seq_obj->accession_number;
> 
>  $seqio_obj = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'fasta' );
> 
>  $seqio_obj->write_seq($seq_obj);
> 
>  }
> 
> 
> 
> After the program runned I just gain a genome complete sequence without a file including every gene sequence one by one. I want to get the genes sequences one by one in a fasta files.  So I write you for help.
> 
> 
> 
> Yong Li
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From ankh.egypt.public at googlemail.com  Fri Dec  7 15:24:20 2012
From: ankh.egypt.public at googlemail.com (Adrian Helmchen)
Date: Fri, 07 Dec 2012 21:24:20 +0100
Subject: [Bioperl-l] proteins from an organism
Message-ID: <50C25074.8050703@googlemail.com>

Hello,

I would like to get all proteins from an organism but proteins from
cholorplasts or with chrystal structures or something else.

I tried to obtain these proteins by send a query 'Arabidopsis 
thaliana[organism]'
with Bio::DB::GenBank and fetch the gi numbers from the cds.
But on the one pc I get 6000 proteins and on another pc I get 46000 proteins
although Arabidopsis thaliana has 25000 genes.

Thank you for your help.

From nikkie.vanbers at gmail.com  Mon Dec 10 03:07:27 2012
From: nikkie.vanbers at gmail.com (Nikki2)
Date: Mon, 10 Dec 2012 00:07:27 -0800 (PST)
Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI assembly
 database
Message-ID: <34761946.post@talk.nabble.com>


Hi,

I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from
'Tracheophyta' that are NCBI's assembly database. However, there are no
DocSums returned for the uid's that match the query. When I try the same
thing using the genome database it works fine.

The script that I used to do the query is at the bottom of this message. The
output I get when running the script is:

Count = 84

--------------------- WARNING ---------------------
MSG: No returned docsums.
---------------------------------------------------

I checked the @ids array and it contains the 84 uids.

My questions are as follows:

1) Is it possible to get DocSums for uids from the NCBI assembly database,
and if yes, how?
2) If not, does anyone have any suggestions how to change my script to get
the species-names that match the uids that are returned?

Thanks a lot!

Nikki


##############################################

#!/bin/perl -w

use Bio::DB::EUtilities;

my $factory = Bio::DB::EUtilities->new(-eutil  => 'esearch',
                                       -db     => 'genome',
				       -email => 'my_email at gmail.com',
                                       -term   => 'Tracheophyta[organism]',
                                       -retmax => 5000);

print "Count = ",$factory->get_count,"\n";
my @ids = $factory->get_ids;

my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary',
					-email=>'my_email at gmail.com',
					-db    => 'genome',
                                        -id    => \@ids,
					ret_max=>5000);
 
while (my $ds = $factory2->next_DocSum) {
    print "ID: ",$ds->get_id,"\n";
    # flattened mode, iterates through all Item objects
    while (my $item = $ds->next_Item('flattened'))  {
        # not all Items have content, so need to check...
        printf("%-20s:%s\n",$item->get_name,$item->get_content) if
$item->get_content;
   }
    print "\n";
}


-- 
View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From cjfields at illinois.edu  Mon Dec 10 10:59:03 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Mon, 10 Dec 2012 15:59:03 +0000
Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI
 assembly database
In-Reply-To: <34761946.post@talk.nabble.com>
References: <34761946.post@talk.nabble.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF4C164@CHIMBX5.ad.uillinois.edu>

Nikki, 

This is b/c a handful of the databases apparently have switched docsum output completely to the DB-specific DocSum schemata (v2), which have not been implemented in Bio::EUtilities as of yet.  This requires quite a bit of revision to parse correctly as it's per database, so I don't have a timeline on when this would be available and would likely be incrementally implemented over time.  

See here for the announcement:

    http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.Release_Notes

In the meantime, you can get the raw XML output for these by replacing the loop for $factory2 with:

    print $factory2->get_Response->content

chris


On Dec 10, 2012, at 2:07 AM, Nikki2 <nikkie.vanbers at gmail.com> wrote:

> Hi,
> 
> I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from
> 'Tracheophyta' that are NCBI's assembly database. However, there are no
> DocSums returned for the uid's that match the query. When I try the same
> thing using the genome database it works fine.
> 
> The script that I used to do the query is at the bottom of this message. The
> output I get when running the script is:
> 
> Count = 84
> 
> --------------------- WARNING ---------------------
> MSG: No returned docsums.
> ---------------------------------------------------
> 
> I checked the @ids array and it contains the 84 uids.
> 
> My questions are as follows:
> 
> 1) Is it possible to get DocSums for uids from the NCBI assembly database,
> and if yes, how?
> 2) If not, does anyone have any suggestions how to change my script to get
> the species-names that match the uids that are returned?
> 
> Thanks a lot!
> 
> Nikki
> 
> 
> 
> 
> 
> 
> 
> ##############################################
> 
> #!/bin/perl -w
> 
> use Bio::DB::EUtilities;
> 
> my $factory = Bio::DB::EUtilities->new(-eutil  => 'esearch',
>                                       -db     => 'genome',
> 				       -email => 'my_email at gmail.com',
>                                       -term   => 'Tracheophyta[organism]',
>                                       -retmax => 5000);
> 
> print "Count = ",$factory->get_count,"\n";
> my @ids = $factory->get_ids;
> 
> my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary',
> 					-email=>'my_email at gmail.com',
> 					-db    => 'genome',
>                                        -id    => \@ids,
> 					ret_max=>5000);
> 
> while (my $ds = $factory2->next_DocSum) {
>    print "ID: ",$ds->get_id,"\n";
>    # flattened mode, iterates through all Item objects
>    while (my $item = $ds->next_Item('flattened'))  {
>        # not all Items have content, so need to check...
>        printf("%-20s:%s\n",$item->get_name,$item->get_content) if
> $item->get_content;
>   }
>    print "\n";
> }
> 
> 
> -- 
> View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From jason.stajich at gmail.com  Wed Dec 12 23:05:29 2012
From: jason.stajich at gmail.com (Jason Stajich)
Date: Wed, 12 Dec 2012 20:05:29 -0800
Subject: [Bioperl-l] Asking
In-Reply-To: <201212131130153627348@gmail.com>
References: <201212131130153627348@gmail.com>
Message-ID: <7ED416EC-622E-4023-94B7-9A11D29929DC@gmail.com>

You want the reroot function. Have you tried reading the howtos on the website already. 
Node is a node in the tree. There are several functions to find a node or iterate through all the ones in the tree

Sent from my iPhone-please excuse typos

--
Jason Stajich

On Dec 12, 2012, at 7:30 PM, "Xing-Xing Shen" <shenxingxing2010 at gmail.com> wrote:

> Drear Jason 
> I am a green hand in learning Bioperl. Now, I met a problem about how to define outgroup for a set of newick trees.
> 
> My codes below:
> #!/usr/bin/perl
> use Bio::TreeIO;
> use Bio::Tree::NodeI;
> use Bio::Tree::Tree;
> my @filenames = glob("*.txt");
> foreach my $filename (@filenames) {
>    my $treeio = Bio::TreeIO->new('-format' => 'newick', '-file'   => "$filename");
>    while( my $tree = $treeio->next_tree ) {
>       $tree->set_root_node("$node"); # what might $node mean?
>        ..........
>        ..........
>    }
> }
> 
> 
> With best,
> 
> Xing-Xing Shen


From j.abbott at imperial.ac.uk  Thu Dec 13 14:49:15 2012
From: j.abbott at imperial.ac.uk (James Abbott)
Date: Thu, 13 Dec 2012 19:49:15 +0000
Subject: [Bioperl-l] deobfuscator broken....
Message-ID: <50CA313B.9060904@imperial.ac.uk>

Hi All,

Don't know if anyone admin folk are aware, but the bioperl.org 
deobfuscator is generating internal server errors. I've also been having 
problems with broken documentation links (cpan links producning the 
wrong modules, and pdoc pages missing) but can't seem to replicate that 
problem now....

I am, for now, still obfuscated...

Cheers,
James
-- 
Dr. James Abbott
Lead Bioinformatician
Bioinformatics Support Service
Imperial College, London

From p.j.a.cock at googlemail.com  Thu Dec 13 17:52:44 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 13 Dec 2012 22:52:44 +0000
Subject: [Bioperl-l] deobfuscator broken....
In-Reply-To: <50CA313B.9060904@imperial.ac.uk>
References: <50CA313B.9060904@imperial.ac.uk>
Message-ID: <CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>

On Thu, Dec 13, 2012 at 7:49 PM, James Abbott <j.abbott at imperial.ac.uk> wrote:
> Hi All,
>
> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator
> is generating internal server errors. I've also been having problems with
> broken documentation links (cpan links producning the wrong modules, and
> pdoc pages missing) but can't seem to replicate that problem now....
>
> I am, for now, still obfuscated...
>
> Cheers,
> James

I would guess this is a side effect from the recent server move,
CC'ing root-l in case anyone of the sys-admin team had an idea.

Peter

From cjfields at illinois.edu  Thu Dec 13 17:51:50 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Thu, 13 Dec 2012 22:51:50 +0000
Subject: [Bioperl-l] deobfuscator broken....
In-Reply-To: <50CA313B.9060904@imperial.ac.uk>
References: <50CA313B.9060904@imperial.ac.uk>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF545DA@CHIMBX5.ad.uillinois.edu>

This is likely due to the back-end change in servers.  I'm not sure how this was set up but we can inquire about it.

chris

On Dec 13, 2012, at 1:49 PM, James Abbott <j.abbott at imperial.ac.uk> wrote:

> Hi All,
> 
> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator is generating internal server errors. I've also been having problems with broken documentation links (cpan links producning the wrong modules, and pdoc pages missing) but can't seem to replicate that problem now....
> 
> I am, for now, still obfuscated...
> 
> Cheers,
> James
> -- 
> Dr. James Abbott
> Lead Bioinformatician
> Bioinformatics Support Service
> Imperial College, London
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From cjfields at illinois.edu  Thu Dec 13 18:13:55 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Thu, 13 Dec 2012 23:13:55 +0000
Subject: [Bioperl-l] [Root-l]  deobfuscator broken....
In-Reply-To: <CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>
References: <50CA313B.9060904@imperial.ac.uk>
	<CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF546F2@CHIMBX5.ad.uillinois.edu>

On Dec 13, 2012, at 4:52 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Thu, Dec 13, 2012 at 7:49 PM, James Abbott <j.abbott at imperial.ac.uk> wrote:
>> Hi All,
>> 
>> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator
>> is generating internal server errors. I've also been having problems with
>> broken documentation links (cpan links producning the wrong modules, and
>> pdoc pages missing) but can't seem to replicate that problem now....
>> 
>> I am, for now, still obfuscated...
>> 
>> Cheers,
>> James
> 
> I would guess this is a side effect from the recent server move,
> CC'ing root-l in case anyone of the sys-admin team had an idea.
> 
> Peter

Beat me by four minutes!  

The CGI code is in websites/bioperl.org/cgi/.  I'm checking on the errors now, may take me a little time to get it back up (was missing CGI, now needs to have the lib path extended).

chris


From jason.stajich at gmail.com  Thu Dec 13 18:18:26 2012
From: jason.stajich at gmail.com (Jason Stajich)
Date: Thu, 13 Dec 2012 15:18:26 -0800
Subject: [Bioperl-l] [Root-l]  deobfuscator broken....
In-Reply-To: <CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>
References: <50CA313B.9060904@imperial.ac.uk>
	<CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>
Message-ID: <0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com>

I think it uses mysql but I don't know if that was reconstituted on the new server. 

On Dec 13, 2012, at 2:52 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Thu, Dec 13, 2012 at 7:49 PM, James Abbott <j.abbott at imperial.ac.uk> wrote:
>> Hi All,
>> 
>> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator
>> is generating internal server errors. I've also been having problems with
>> broken documentation links (cpan links producning the wrong modules, and
>> pdoc pages missing) but can't seem to replicate that problem now....
>> 
>> I am, for now, still obfuscated...
>> 
>> Cheers,
>> James
> 
> I would guess this is a side effect from the recent server move,
> CC'ing root-l in case anyone of the sys-admin team had an idea.
> 
> Peter
> _______________________________________________
> Root-l mailing list
> Root-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/root-l

Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org


From nikkie.vanbers at gmail.com  Wed Dec  5 09:04:09 2012
From: nikkie.vanbers at gmail.com (Nikki2)
Date: Wed, 5 Dec 2012 06:04:09 -0800 (PST)
Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI assembly
 database
Message-ID: <34761946.post@talk.nabble.com>


Hi,

I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from
'Tracheophyta' that are NCBI's assembly database. However, there are no
DocSums returned for the uid's that match the query. When I try the same
thing using the genome database it works fine.

The script that I used to do the query is at the bottom of this message. The
output I get when running the script is:

Count = 84

--------------------- WARNING ---------------------
MSG: No returned docsums.
---------------------------------------------------

I checked the @ids array and it contains the 84 uids.

My questions are as follows:

1) Is it possible to get DocSums for uids from the NCBI assembly database,
and if yes, how?
2) If not, does anyone have any suggestions how to change my script to get
the species-names that match the uids that are returned?

Thanks a lot!

Nikki


##############################################

#!/bin/perl -w

use Bio::DB::EUtilities;

my $factory = Bio::DB::EUtilities->new(-eutil  => 'esearch',
                                       -db     => 'genome',
				       -email => 'my_email at gmail.com',
                                       -term   => 'Tracheophyta[organism]',
                                       -retmax => 5000);

print "Count = ",$factory->get_count,"\n";
my @ids = $factory->get_ids;

my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary',
					-email=>'my_email at gmail.com',
					-db    => 'genome',
                                        -id    => \@ids,
					ret_max=>5000);
 
while (my $ds = $factory2->next_DocSum) {
    print "ID: ",$ds->get_id,"\n";
    # flattened mode, iterates through all Item objects
    while (my $item = $ds->next_Item('flattened'))  {
        # not all Items have content, so need to check...
        printf("%-20s:%s\n",$item->get_name,$item->get_content) if
$item->get_content;
   }
    print "\n";
}


-- 
View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From online at davemessina.com  Thu Dec 13 18:41:35 2012
From: online at davemessina.com (Dave Messina)
Date: Thu, 13 Dec 2012 18:41:35 -0500
Subject: [Bioperl-l] [Root-l]  deobfuscator broken....
In-Reply-To: <0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com>
References: <50CA313B.9060904@imperial.ac.uk>
	<CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>
	<0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com>
Message-ID: <A6ECFFCE-8274-4B60-B681-64AA04285059@davemessina.com>

It should be just (shudder) Berkeley DB.


On Dec 13, 2012, at 18:18, Jason Stajich <jason.stajich at gmail.com> wrote:

> I think it uses mysql but I don't know if that was reconstituted on the new server. 
> 
> On Dec 13, 2012, at 2:52 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> 
>> On Thu, Dec 13, 2012 at 7:49 PM, James Abbott <j.abbott at imperial.ac.uk> wrote:
>>> Hi All,
>>> 
>>> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator
>>> is generating internal server errors. I've also been having problems with
>>> broken documentation links (cpan links producning the wrong modules, and
>>> pdoc pages missing) but can't seem to replicate that problem now....
>>> 
>>> I am, for now, still obfuscated...
>>> 
>>> Cheers,
>>> James
>> 
>> I would guess this is a side effect from the recent server move,
>> CC'ing root-l in case anyone of the sys-admin team had an idea.
>> 
>> Peter
>> _______________________________________________
>> Root-l mailing list
>> Root-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/root-l
> 
> Jason Stajich
> jason.stajich at gmail.com
> jason at bioperl.org
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From abualiga2 at gmail.com  Tue Dec 18 17:08:51 2012
From: abualiga2 at gmail.com (galeb abu-ali)
Date: Tue, 18 Dec 2012 17:08:51 -0500
Subject: [Bioperl-l] Fwd: how to parse maf file format
In-Reply-To: <CANPzzTtgJJwmN2TyBhDb0h=nEofSh44RED-1v9VEvRG+ZazUvA@mail.gmail.com>
References: <CANPzzTtgJJwmN2TyBhDb0h=nEofSh44RED-1v9VEvRG+ZazUvA@mail.gmail.com>
Message-ID: <CANPzzTueE6TN2vofSkwAQN-RBSTZzm7UuJLVT5kEf6O+7CV2CQ@mail.gmail.com>

Hi,

I am writing a script to parse a multiple genome alignment file in maf
format, generated with mugsy alignment of e.coli genomes.  So far, my
script parses SNPs from synteny blocks conserved in all aligned strains,
and it excludes gaps, which is enough for a phylogenetic analyses.  I was
wondering how can I parse the remaining blocks that are not conserved in
all strains, to see what is conserved in n-1, n-2, etc. strains or unique
to each strain.  I guess this is not a BioPerl question, but it's a Perl
for biologists question so I was hoping to get some insight here.  If there
is a more appropriate forum, please let me know.

Below is my code.

many thanks!

galeb

#!/usr/local/bin/perl
use Modern::Perl '2013';
use autodie;
use List::MoreUtils qw/ each_arrayref /;

# gsa 18.12.2012
# parse mugsy multiple genome alignment for SNPs in synteny blocks
conserved in all aligned strains
=head
##maf version=1 scoring=mugsy
a score=7891 label=40 mult=4
s O55H7_RM12579.O55H7_RM12579        1596752 7262 + 5263980
CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCG
s O55H7_CB9615.O55H7_CB9615        1604426 7262 + 5386352
CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCGG-AT
s O157H7_Sakai.O157H7_Sakai        1787303 7068 + 5498450
CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCGG-AT
s O157H7_EDL.O157H7_EDL933        1729749 7082 + 5528445
CGGGATGCGGGAATGGGAATGCCTTGGTTGACGGGGTGGCGGAAT

a score=6756 label=41 mult=4
s O55H7_RM12579.O55H7_RM12579        1986265 6749 + 5263980
CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGG
s O55H7_CB9615.O55H7_CB9615        1991733 6749 + 5386352
CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGGTTAC
s O157H7_Sakai.O157H7_Sakai        3940728 6751 - 5498450
CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGGTTAC
s O157H7_EDL.O157H7_EDL933        4260689 4042 - 5528445
---------------------------------------------
=cut

my $infile = shift or die "Usage: $0 <alignment_file.maf>\n";

my %snps;
my $strains = 0;
my @alignment;
my( $score, $blkLen, $mult );
my $total_snps;
my $syn_len;
my %lengths;

open my $fh, '<', $infile;

while( <$fh> ) {
    next if /^#/;
    chomp;

    if( /^a/ ) {
        ( $score, $blkLen, $mult ) = ( split )[1,2,3];
        $score =~ s/score\=(\d+)/$1/; # length of alignment block including
'-'
        $blkLen =~ s/label\=(\d+)/$1/; # alignment block number; numbers
ranked on alignment length
        $mult  =~ s/mult\=(\d+)/$1/;# number of strains aligned in block

        $strains = $mult if $mult > $strains; # total number of strains in
alignment
    }
    elsif( /^s/ ) { push @alignment, $_ }

    elsif( /^$/ || ! length $_ ) {
        my( @strNames, @starts, @strands, @dna_mtrx );
        # if sequence conserved in all strains
        if( $strains == @alignment ) {
            $syn_len += $score; # total aligned sequence in all strains
            for( @alignment ) {
                # name, align start, align length (w/o '-'), direction,
align sequence w/ '-'
                my( $name, $start, $len, $strand, $dna ) = ( split /\s+/ )[
1, 2, 3, 4, 6 ];
                #$name =~ s/.*\.(.*)/$1/; # remove duplicated strain name

                # strains are always in same order when all strains in
block.
                push @strNames, $name;
                push @starts, $start;
                push @strands, $strand;
                push @dna_mtrx, [ split '', $dna ];
                # total seqeunce in each strain w/o '-' that is conserved
in all strains
                $lengths{ $name } += $len;
            }

            my $ea = each_arrayref( @dna_mtrx );
            my %gaps;
            my $cnt;
            while( my( @bases ) = $ea->() ) {
                ++$cnt;
                my %temp;
                for( 0 .. $#bases ) { # store gaps if any
                    if( $bases[$_] eq '-' ) {
                        $gaps{$_}++; # key is number, corresponds to index
of other arrays
                    }
                }
                # skip gaps '-'
                unless( '-' ~~ @bases ) { $temp{ uc $_}++ for @bases } # if
snp then %temp will have > 1 key
                if( keys %temp > 1 ) { # if SNP exists, get base and
position for all strains in alignment
                    ++$total_snps;
                    my $pos;
                    for( 0 .. $#bases ) {
                        if( $strands[$_] eq '+' ) { $pos = $starts[$_] +
$cnt - ( $gaps{$_} // 0 ) } # genome positn
                        elsif( $strands[$_] eq '-' ) { $pos = $starts[$_] -
$cnt - ( $gaps{$_} // 0 ) }
                        # HoAoH
                        push @{ $snps{ $strNames[$_] } }, { $pos =>
$bases[$_] };
                    }
                }
            }
        }
        @alignment = ();
    }
}
close $fh;
#print Dumper( \%snps ); use Data::Dumper;
say "Sum length of synteny blocks conserved in all strains, including gaps:
$syn_len bp";
say "Length of conserved sequence for each strain, excluding gaps:";
for my $strain ( keys %lengths ) {
    say "$strain\t$lengths{ $strain } bp";
}

my $outfile = $infile;
$outfile =~ s/\.maf$/_snps.txt/;
open my $fh2, '>', $outfile;
say {$fh2} map{ $_ . "_base\t", $_ . "_pos\t" } keys %snps;
for my $snp ( 0 .. ( $total_snps - 1 ) ) {
    for my $strain ( keys %snps ){
        for my $href ( keys %{ $snps{ $strain }[ $snp ] } ) {
            print {$fh2} "$snps{ $strain }[ $snp ]->{ $href }\t$href\t";
            }
        }
    print {$fh2} "\n";
}

From sanketd at isquareit.ac.in  Mon Dec 31 01:46:41 2012
From: sanketd at isquareit.ac.in (Sanket Desai)
Date: Mon, 31 Dec 2012 12:16:41 +0530 (IST)
Subject: [Bioperl-l] Help in getting organism names of the nucleotide
	entries.
Message-ID: <26019826.10871.1356936401744.JavaMail.root@mail.isquareit.ac.in>

Hello,

With respect to the post:
http://bio.perl.org/pipermail/bioperl-l/2009-December/031831.html

When used for nucleotide database it gives the following error:

--------------------- WARNING ---------------------
MSG: The -email parameter is now required, per NCBI E-utilities policy
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: No linksets returned
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: The -email parameter is now required, per NCBI E-utilities policy
---------------------------------------------------

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: NCBI esummary fatal error: Empty id list - nothing todo
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:472
STACK: Bio::Tools::EUtilities::parse_data /usr/share/perl5/Bio/Tools/EUtilities.pm:382
STACK: Bio::Tools::EUtilities::next_DocSum /usr/share/perl5/Bio/Tools/EUtilities.pm:964
STACK: Bio::DB::EUtilities::next_DocSum /usr/share/perl5/Bio/DB/EUtilities.pm:914
STACK: getOrgNameFrmAccession.pl:29
-----------------------------------------------------------

Please suggest the relevant changes in the above script to make it work for the nucleotide entries also.

Thanks in advance,
Regards,
Sanket

From fcyucn at gmail.com  Mon Dec 17 20:37:45 2012
From: fcyucn at gmail.com (Fengchao Yu)
Date: Tue, 18 Dec 2012 01:37:45 -0000
Subject: [Bioperl-l] Is there any module for the protein digestion?
Message-ID: <7b719317-57a3-46ef-927c-6b0508e1e62d@googlegroups.com>

I notice that Bio::Restriction::Enzyme<http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/Restriction/Enzyme.pm> is 
for DNA digest? I wonder if there is any module for protein digestion?

Thanks


From florent.angly at gmail.com  Sun Dec  2 21:36:28 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Mon, 03 Dec 2012 12:36:28 +1000
Subject: [Bioperl-l] Bio::DB::Fasta and threads
Message-ID: <50BC102C.7080902@gmail.com>

Hi all,

This is in response to Carson Holt's report that Bio::DB::Fasta does not 
play well with threads: https://redmine.open-bio.org/issues/3397

The first issue is the serialization of Bio::DB::IndexedBase-inheriting 
(e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for 
threading (for example when using Thread::Queue::Any). I implemented 
hooks that make it transparent to serialize using Storable freeze() and 
thaw().

Another issue was the lack of communication between different 
Bio::DB::IndexedBase instances, which means that an instance could 
easily be writing or deleting the database that another instance is 
working on. To fix this, I needed some form of locking.

Some database Bio::DB::IndexedBase backends (DB_file) have some support 
for locking but Bio::DB::IndexedBase also supports other database 
backends for which there is no native locking mechanism. So, I had to 
come up with a more general solution: a lock file. I noticed that 
Bio::DB::SeqFeature::Store::berkeleydb has a locking mechanism, based on 
flock(), which means that it does not work with NFS-mounted filesystems. 
All the Bioperl-based scripts I (and most likely many others) write run 
on servers that use NFS, so this support is important. I have found only 
one way to do the NFS locking safely, using File::SharedNFSLock. It has 
a few downsides though:
     1/ it is an external dependency,
     2/ it does not work on FAT filesystems (should be mostly restricted 
to USB sticks nowadays) and the lock is never acquired, and
     3/ at the moment, it requires a patch to work in threaded context 
(https://rt.cpan.org/Public/Bug/Display.html?id=81597)

Note that while I have now added basic support for threads in 
Bio::DB::IndexedBase was added, I still get segfaults in specific cases, 
for example when returning a database or sequence object. This might be 
related to this issue: 
https://rt.perl.org/rt3/Ticket/Display.html?id=115972. Beyond this, the 
new code seems to work nicely. See the branch 
https://github.com/bioperl/bioperl-live/tree/storable_db if you want to 
test yourself. For example, one can now run multiple threads, each of 
them creating a Bio::DB::Fasta database from the same FASTA file: the 
first thread performs the indexing while the others wait nicely for the 
indexing to be finished to query the database.

Comments welcome. Regards,

Florent


From l.m.timmermans at students.uu.nl  Mon Dec  3 19:29:59 2012
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Tue, 4 Dec 2012 01:29:59 +0100
Subject: [Bioperl-l] Bio::DB::Fasta and threads
In-Reply-To: <50BC102C.7080902@gmail.com>
References: <50BC102C.7080902@gmail.com>
Message-ID: <CAC1jpXCFHTb1DCsuKxLy1==qrKmSgRwqxuj155iyf7yyR=-WDQ@mail.gmail.com>

On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly <florent.angly at gmail.com> wrote:
> The first issue is the serialization of Bio::DB::IndexedBase-inheriting
> (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for
> threading (for example when using Thread::Queue::Any). I implemented hooks
> that make it transparent to serialize using Storable freeze() and thaw().

I don't think serializing a magical thingie makes much sense. Storable
is commonly used for a lot more things than interthread communication
(e.g. network communication), this would often not work under such
circumstances.

Leon


From cjfields at illinois.edu  Mon Dec  3 22:23:50 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 4 Dec 2012 03:23:50 +0000
Subject: [Bioperl-l] Bio::DB::Fasta and threads
In-Reply-To: <CAC1jpXCFHTb1DCsuKxLy1==qrKmSgRwqxuj155iyf7yyR=-WDQ@mail.gmail.com>
References: <50BC102C.7080902@gmail.com>
	<CAC1jpXCFHTb1DCsuKxLy1==qrKmSgRwqxuj155iyf7yyR=-WDQ@mail.gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF45670@CHIMBX5.ad.uillinois.edu>

On Dec 3, 2012, at 6:29 PM, Leon Timmermans <l.m.timmermans at students.uu.nl> wrote:

> On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly <florent.angly at gmail.com> wrote:
>> The first issue is the serialization of Bio::DB::IndexedBase-inheriting
>> (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for
>> threading (for example when using Thread::Queue::Any). I implemented hooks
>> that make it transparent to serialize using Storable freeze() and thaw().
> 
> I don't think serializing a magical thingie makes much sense. Storable
> is commonly used for a lot more things than interthread communication
> (e.g. network communication), this would often not work under such
> circumstances.
> 
> Leon

Leon, any suggestions on alternatives?  I know this particular bit is a sore spot with MAKER at the moment, so any help would be greatly appreciated.

chris


From yongli at yeslab.com  Sat Dec  1 01:10:15 2012
From: yongli at yeslab.com (=?utf-8?B?eW9uZ2xpQHllc2xhYi5jb20=?=)
Date: Sat, 1 Dec 2012 14:10:15 +0800 (CST)
Subject: [Bioperl-l] =?utf-8?q?question_about_bioperl_program?=
Message-ID: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com>

Dear Sir or Madam,

 
 I just begin to learn bioperl and am using a bioperl program to extract a kind of bacterial genes sequences from genome genbank file. I download NC_003450.gbk, ppt, ffn files. NC_003450 is a kind of bacterial genome genbank file. My bioperl code as follows:

 
 use Bio::Seq;

  use Bio::SeqIO;

  
  $seqio_obj=Bio::SeqIO->new (-file=>'NC_003450.gbk',-format=>'genbank');

  # $seq_obj=$seqio_obj->next_seq;

  
  while($seq_obj=$seqio_obj->next_seq)

  {

    $display_name=$seq_obj->display_name;

    $desc=$seq_obj->desc;

    $seq=$seq_obj->seq;

  $acc = $seq_obj->accession_number;

  $seqio_obj = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'fasta' );

  $seqio_obj->write_seq($seq_obj);

  }

  
 After the program runned I just gain a genome complete sequence without a file including every gene sequence one by one. I want to get the genes sequences one by one in a fasta files.  So I write you for help.

 
 Yong Li


From carsonhh at gmail.com  Mon Dec  3 22:35:50 2012
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 03 Dec 2012 22:35:50 -0500
Subject: [Bioperl-l] Bio::DB::Fasta and threads
In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF4CF45670@CHIMBX5.ad.uillinois.edu>
Message-ID: <CCE2D73A.16852%carsonhh@gmail.com>

Bio::DB::Fasta is working for maker now.  The previous issues have been
fixed, but being as Florent has gone out of his way to build a number of
improvements into Bio::DB::Fasta over the past few weeks, this seemed like
a useful one as well, so I suggested it.  One of the big uses of
Bio::DB::Fasta is the Bio::PrimarySeq::Fasta features it creates.  They
are great for manipulating the sequence without actually having to ever
keep it in memory.  It's nice because the sequence is made available on
demand, but when you try and pass them between threads, your program falls
apart. There are creative work arounds, but simply adding a serialization
hook to Bio::DB::Fasta to disconnect the database on freezing and then
reconnect on thaw also fixes it, and it makes them extremely useful for
multi-threaded applications without having to go through other kinds of
work arounds (it just makes them work as expected with serialization).
Previously I had created my own module and inherited from Bio::DB::Fasta
so I could implement the Storable hooks.  Because Storable looks for the
hooks in anything it serializes, the Bio::DB::Fasta object can even be
well down inside of a complex object and you don't have worry about it.
Previously I've used Storable hooks to pass the Bio::PrimarySeq::Fasta
features across the network using MPI, as long as the database is on an
NFS mount it just reconnects on the other node with no issue.  If the
indexed file isn't available after deserialization over a network, you
could just throw an error when the thaw hook is called.  I'll give
Florent's changes a look over soon to give any suggestions.

Thanks,
Carson


On 12-12-03 10:23 PM, "Fields, Christopher J" <cjfields at illinois.edu>
wrote:

>On Dec 3, 2012, at 6:29 PM, Leon Timmermans
><l.m.timmermans at students.uu.nl> wrote:
>
>> On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly <florent.angly at gmail.com>
>>wrote:
>>> The first issue is the serialization of Bio::DB::IndexedBase-inheriting
>>> (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for
>>> threading (for example when using Thread::Queue::Any). I implemented
>>>hooks
>>> that make it transparent to serialize using Storable freeze() and
>>>thaw().
>> 
>> I don't think serializing a magical thingie makes much sense. Storable
>> is commonly used for a lot more things than interthread communication
>> (e.g. network communication), this would often not work under such
>> circumstances.
>> 
>> Leon
>
>Leon, any suggestions on alternatives?  I know this particular bit is a
>sore spot with MAKER at the moment, so any help would be greatly
>appreciated.
>
>chris
>


From jason.r.gallant at gmail.com  Tue Dec  4 15:23:02 2012
From: jason.r.gallant at gmail.com (Jason Gallant)
Date: Tue, 4 Dec 2012 12:23:02 -0800 (PST)
Subject: [Bioperl-l] Problem with BIO::DB::FASTA and Colon in Fasta Header
Message-ID: <c996f6bc-458a-461a-bcc5-3a567eb06e85@googlegroups.com>

Hello,

I'm trying to retreive fasta sequences that contain a colon in their 
header.  However, I cannot get my BioPerl script to do this!!

It works as expected when the header does not contain the colon, however 
doesn't return anything when it does.  Weirdly, when I ask it to return the 
parsed IDs (see below), it returns the appropriate IDs, which include the 
colon!  Very confusing, would appreciate any help!!

Many Thanks,
Jason Gallant


use strict;
use Bio::SearchIO; 
use Bio::DB::Fasta;


my ($file,$id,$start,$end) = 
("secondround_merged_expanded.fasta","C7047455:0-100",1,10);


my $db = Bio::DB::Fasta->new($file, -reindex=>1);
my $seq = $db->seq($id,$start,$end);
 
print $db->ids;

print $seq,"\n";


From asjo at koldfront.dk  Tue Dec  4 15:53:08 2012
From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=)
Date: Tue, 04 Dec 2012 21:53:08 +0100
Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access
In-Reply-To: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com>
	(Francesco Musacchia's message of "Wed, 28 Nov 2012 02:27:16 -0800
	(PST)")
References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com>
Message-ID: <87y5hdletn.fsf@topper.koldfront.dk>

On Wed, 28 Nov 2012 02:27:16 -0800 (PST), Francesco wrote:

> I'm experiencing that when I have to do a lot of accessess on a GFF
> database (with Bio:DB::SeqFeature::Store) the slowness increase until
> my script can stay running for more than a day.

First you'll need to find out what/where exactly it is slow. One way to
do so is using a a profiler; this is a good one for Perl:

 * https://metacpan.org/module/Devel::NYTProf

If you want more specific suggestions, you'll probably have to provide
more information.


  Good luck!

    Adam

-- 
 "As Knuth pointed out long ago, speed only matters           Adam Sj?gren
  in certain critical bottlenecks. And as many           asjo at koldfront.dk
  programmers have observed since, one is very often
  mistaken about where these bottlenecks are."


From cjfields at illinois.edu  Tue Dec  4 16:10:00 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 4 Dec 2012 21:10:00 +0000
Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access
In-Reply-To: <87y5hdletn.fsf@topper.koldfront.dk>
References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com>
	<87y5hdletn.fsf@topper.koldfront.dk>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu>


On Dec 4, 2012, at 2:53 PM, Adam Sj?gren <asjo at koldfront.dk>
 wrote:

> On Wed, 28 Nov 2012 02:27:16 -0800 (PST), Francesco wrote:
> 
>> I'm experiencing that when I have to do a lot of accessess on a GFF
>> database (with Bio:DB::SeqFeature::Store) the slowness increase until
>> my script can stay running for more than a day.
> 
> First you'll need to find out what/where exactly it is slow. One way to
> do so is using a a profiler; this is a good one for Perl:
> 
> * https://metacpan.org/module/Devel::NYTProf
> 
> If you want more specific suggestions, you'll probably have to provide
> more information.
> 
> 
>  Good luck!
> 
>    Adam

If anything, we need more profiling of Bioperl code.  Ah, if we only had infinite time... :)

chris


From asjo at koldfront.dk  Tue Dec  4 16:33:55 2012
From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=)
Date: Tue, 04 Dec 2012 22:33:55 +0100
Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access
In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu>
	(Christopher J. Fields's message of "Tue, 4 Dec 2012 21:10:00 +0000")
References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com>
	<87y5hdletn.fsf@topper.koldfront.dk>
	<118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu>
Message-ID: <87txs1jyd8.fsf@topper.koldfront.dk>

On Tue, 4 Dec 2012 21:10:00 +0000, Fields, wrote:

> If anything, we need more profiling of Bioperl code. Ah, if we only
> had infinite time... :)

If we had that, we didn't need profiling!


  ;-),

   Adam

-- 
 "On the quiet side. Somewhat peculiar. A good                Adam Sj?gren
  companion, in a weird sort of way."                    asjo at koldfront.dk


From florent.angly at gmail.com  Tue Dec  4 16:52:41 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Wed, 05 Dec 2012 07:52:41 +1000
Subject: [Bioperl-l] Problem with BIO::DB::FASTA and Colon in Fasta
	Header
In-Reply-To: <c996f6bc-458a-461a-bcc5-3a567eb06e85@googlegroups.com>
References: <c996f6bc-458a-461a-bcc5-3a567eb06e85@googlegroups.com>
Message-ID: <50BE70A9.4060404@gmail.com>

Hi Jason,

See the documentation for seq() at 
http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/DB/Fasta.pm#OBJECT_METHODS 
<http://search.cpan.org/%7Ecjfields/BioPerl-1.6.901/Bio/DB/Fasta.pm#OBJECT_METHODS>.

When you call seq() with a single argument, e.g. 
$db->seq('C7047455:0-100'), Bio::DB::Fasta interprets it as a compound 
ID and looks for position 0 to 100 of a sequence called C7047455. This 
is a feature that has been in Bio::DB::Fasta since the dawn of time. In 
this form, seq() expects a colon as part of the compound ID, which is 
problematic because your sequence ID actually contains a colon.

I think that when you call $db->seq($id,$start,$end), Bio::DB::Fasta 
does not attempt to parse your ID. This is why your code works with this 
form. Note that if you want to get the entirety of a sequence called 
'C7047455:0-100', the easiest if your sequence names contain colon is to 
use $db->get_Seq_by_id('C7047455:0-100') since get_Seq_by_id() does only 
take a regular ID (not compound).

Florent


On 05/12/12 06:23, Jason Gallant wrote:
> Hello,
>
> I'm trying to retreive fasta sequences that contain a colon in their
> header.  However, I cannot get my BioPerl script to do this!!
>
> It works as expected when the header does not contain the colon, however
> doesn't return anything when it does.  Weirdly, when I ask it to return the
> parsed IDs (see below), it returns the appropriate IDs, which include the
> colon!  Very confusing, would appreciate any help!!
>
> Many Thanks,
> Jason Gallant
>
>
> use strict;
> use Bio::SearchIO;
> use Bio::DB::Fasta;
>
>
> my ($file,$id,$start,$end) =
> ("secondround_merged_expanded.fasta","C7047455:0-100",1,10);
>
>
> my $db = Bio::DB::Fasta->new($file, -reindex=>1);
> my $seq = $db->seq($id,$start,$end);
>   
> print $db->ids;
>
> print $seq,"\n";
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From bosborne11 at verizon.net  Tue Dec  4 17:12:59 2012
From: bosborne11 at verizon.net (Brian Osborne)
Date: Tue, 04 Dec 2012 17:12:59 -0500
Subject: [Bioperl-l] question about bioperl program
In-Reply-To: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com>
References: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com>
Message-ID: <16BBC477-9935-4C79-A70D-6B18716089FB@verizon.net>

Yong Li,

You want to take a look at this HOWTO:

http://www.bioperl.org/wiki/HOWTO:Feature-Annotation

Those genes you see in the file are features in the genome sequence.

Brian O.


On Dec 1, 2012, at 1:10 AM, yongli at yeslab.com wrote:

> Dear Sir or Madam,
> 
> 
> 
> I just begin to learn bioperl and am using a bioperl program to extract a kind of bacterial genes sequences from genome genbank file. I download NC_003450.gbk, ppt, ffn files. NC_003450 is a kind of bacterial genome genbank file. My bioperl code as follows:
> 
> 
> 
> use Bio::Seq;
> 
>  use Bio::SeqIO;
> 
> 
> 
>  $seqio_obj=Bio::SeqIO->new (-file=>'NC_003450.gbk',-format=>'genbank');
> 
>  # $seq_obj=$seqio_obj->next_seq;
> 
> 
> 
>  while($seq_obj=$seqio_obj->next_seq)
> 
>  {
> 
>    $display_name=$seq_obj->display_name;
> 
>    $desc=$seq_obj->desc;
> 
>    $seq=$seq_obj->seq;
> 
>  $acc = $seq_obj->accession_number;
> 
>  $seqio_obj = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'fasta' );
> 
>  $seqio_obj->write_seq($seq_obj);
> 
>  }
> 
> 
> 
> After the program runned I just gain a genome complete sequence without a file including every gene sequence one by one. I want to get the genes sequences one by one in a fasta files.  So I write you for help.
> 
> 
> 
> Yong Li
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From ankh.egypt.public at googlemail.com  Fri Dec  7 15:24:20 2012
From: ankh.egypt.public at googlemail.com (Adrian Helmchen)
Date: Fri, 07 Dec 2012 21:24:20 +0100
Subject: [Bioperl-l] proteins from an organism
Message-ID: <50C25074.8050703@googlemail.com>

Hello,

I would like to get all proteins from an organism but proteins from
cholorplasts or with chrystal structures or something else.

I tried to obtain these proteins by send a query 'Arabidopsis 
thaliana[organism]'
with Bio::DB::GenBank and fetch the gi numbers from the cds.
But on the one pc I get 6000 proteins and on another pc I get 46000 proteins
although Arabidopsis thaliana has 25000 genes.

Thank you for your help.


From nikkie.vanbers at gmail.com  Mon Dec 10 03:07:27 2012
From: nikkie.vanbers at gmail.com (Nikki2)
Date: Mon, 10 Dec 2012 00:07:27 -0800 (PST)
Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI assembly
 database
Message-ID: <34761946.post@talk.nabble.com>


Hi,

I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from
'Tracheophyta' that are NCBI's assembly database. However, there are no
DocSums returned for the uid's that match the query. When I try the same
thing using the genome database it works fine.

The script that I used to do the query is at the bottom of this message. The
output I get when running the script is:

Count = 84

--------------------- WARNING ---------------------
MSG: No returned docsums.
---------------------------------------------------

I checked the @ids array and it contains the 84 uids.

My questions are as follows:

1) Is it possible to get DocSums for uids from the NCBI assembly database,
and if yes, how?
2) If not, does anyone have any suggestions how to change my script to get
the species-names that match the uids that are returned?

Thanks a lot!

Nikki


##############################################

#!/bin/perl -w

use Bio::DB::EUtilities;

my $factory = Bio::DB::EUtilities->new(-eutil  => 'esearch',
                                       -db     => 'genome',
				       -email => 'my_email at gmail.com',
                                       -term   => 'Tracheophyta[organism]',
                                       -retmax => 5000);

print "Count = ",$factory->get_count,"\n";
my @ids = $factory->get_ids;

my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary',
					-email=>'my_email at gmail.com',
					-db    => 'genome',
                                        -id    => \@ids,
					ret_max=>5000);
 
while (my $ds = $factory2->next_DocSum) {
    print "ID: ",$ds->get_id,"\n";
    # flattened mode, iterates through all Item objects
    while (my $item = $ds->next_Item('flattened'))  {
        # not all Items have content, so need to check...
        printf("%-20s:%s\n",$item->get_name,$item->get_content) if
$item->get_content;
   }
    print "\n";
}


-- 
View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From cjfields at illinois.edu  Mon Dec 10 10:59:03 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Mon, 10 Dec 2012 15:59:03 +0000
Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI
 assembly database
In-Reply-To: <34761946.post@talk.nabble.com>
References: <34761946.post@talk.nabble.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF4C164@CHIMBX5.ad.uillinois.edu>

Nikki, 

This is b/c a handful of the databases apparently have switched docsum output completely to the DB-specific DocSum schemata (v2), which have not been implemented in Bio::EUtilities as of yet.  This requires quite a bit of revision to parse correctly as it's per database, so I don't have a timeline on when this would be available and would likely be incrementally implemented over time.  

See here for the announcement:

    http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.Release_Notes

In the meantime, you can get the raw XML output for these by replacing the loop for $factory2 with:

    print $factory2->get_Response->content

chris


On Dec 10, 2012, at 2:07 AM, Nikki2 <nikkie.vanbers at gmail.com> wrote:

> Hi,
> 
> I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from
> 'Tracheophyta' that are NCBI's assembly database. However, there are no
> DocSums returned for the uid's that match the query. When I try the same
> thing using the genome database it works fine.
> 
> The script that I used to do the query is at the bottom of this message. The
> output I get when running the script is:
> 
> Count = 84
> 
> --------------------- WARNING ---------------------
> MSG: No returned docsums.
> ---------------------------------------------------
> 
> I checked the @ids array and it contains the 84 uids.
> 
> My questions are as follows:
> 
> 1) Is it possible to get DocSums for uids from the NCBI assembly database,
> and if yes, how?
> 2) If not, does anyone have any suggestions how to change my script to get
> the species-names that match the uids that are returned?
> 
> Thanks a lot!
> 
> Nikki
> 
> 
> 
> 
> 
> 
> 
> ##############################################
> 
> #!/bin/perl -w
> 
> use Bio::DB::EUtilities;
> 
> my $factory = Bio::DB::EUtilities->new(-eutil  => 'esearch',
>                                       -db     => 'genome',
> 				       -email => 'my_email at gmail.com',
>                                       -term   => 'Tracheophyta[organism]',
>                                       -retmax => 5000);
> 
> print "Count = ",$factory->get_count,"\n";
> my @ids = $factory->get_ids;
> 
> my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary',
> 					-email=>'my_email at gmail.com',
> 					-db    => 'genome',
>                                        -id    => \@ids,
> 					ret_max=>5000);
> 
> while (my $ds = $factory2->next_DocSum) {
>    print "ID: ",$ds->get_id,"\n";
>    # flattened mode, iterates through all Item objects
>    while (my $item = $ds->next_Item('flattened'))  {
>        # not all Items have content, so need to check...
>        printf("%-20s:%s\n",$item->get_name,$item->get_content) if
> $item->get_content;
>   }
>    print "\n";
> }
> 
> 
> -- 
> View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From jason.stajich at gmail.com  Wed Dec 12 23:05:29 2012
From: jason.stajich at gmail.com (Jason Stajich)
Date: Wed, 12 Dec 2012 20:05:29 -0800
Subject: [Bioperl-l] Asking
In-Reply-To: <201212131130153627348@gmail.com>
References: <201212131130153627348@gmail.com>
Message-ID: <7ED416EC-622E-4023-94B7-9A11D29929DC@gmail.com>

You want the reroot function. Have you tried reading the howtos on the website already. 
Node is a node in the tree. There are several functions to find a node or iterate through all the ones in the tree

Sent from my iPhone-please excuse typos

--
Jason Stajich

On Dec 12, 2012, at 7:30 PM, "Xing-Xing Shen" <shenxingxing2010 at gmail.com> wrote:

> Drear Jason 
> I am a green hand in learning Bioperl. Now, I met a problem about how to define outgroup for a set of newick trees.
> 
> My codes below:
> #!/usr/bin/perl
> use Bio::TreeIO;
> use Bio::Tree::NodeI;
> use Bio::Tree::Tree;
> my @filenames = glob("*.txt");
> foreach my $filename (@filenames) {
>    my $treeio = Bio::TreeIO->new('-format' => 'newick', '-file'   => "$filename");
>    while( my $tree = $treeio->next_tree ) {
>       $tree->set_root_node("$node"); # what might $node mean?
>        ..........
>        ..........
>    }
> }
> 
> 
> With best,
> 
> Xing-Xing Shen


From j.abbott at imperial.ac.uk  Thu Dec 13 14:49:15 2012
From: j.abbott at imperial.ac.uk (James Abbott)
Date: Thu, 13 Dec 2012 19:49:15 +0000
Subject: [Bioperl-l] deobfuscator broken....
Message-ID: <50CA313B.9060904@imperial.ac.uk>

Hi All,

Don't know if anyone admin folk are aware, but the bioperl.org 
deobfuscator is generating internal server errors. I've also been having 
problems with broken documentation links (cpan links producning the 
wrong modules, and pdoc pages missing) but can't seem to replicate that 
problem now....

I am, for now, still obfuscated...

Cheers,
James
-- 
Dr. James Abbott
Lead Bioinformatician
Bioinformatics Support Service
Imperial College, London


From p.j.a.cock at googlemail.com  Thu Dec 13 17:52:44 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 13 Dec 2012 22:52:44 +0000
Subject: [Bioperl-l] deobfuscator broken....
In-Reply-To: <50CA313B.9060904@imperial.ac.uk>
References: <50CA313B.9060904@imperial.ac.uk>
Message-ID: <CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>

On Thu, Dec 13, 2012 at 7:49 PM, James Abbott <j.abbott at imperial.ac.uk> wrote:
> Hi All,
>
> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator
> is generating internal server errors. I've also been having problems with
> broken documentation links (cpan links producning the wrong modules, and
> pdoc pages missing) but can't seem to replicate that problem now....
>
> I am, for now, still obfuscated...
>
> Cheers,
> James

I would guess this is a side effect from the recent server move,
CC'ing root-l in case anyone of the sys-admin team had an idea.

Peter


From cjfields at illinois.edu  Thu Dec 13 17:51:50 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Thu, 13 Dec 2012 22:51:50 +0000
Subject: [Bioperl-l] deobfuscator broken....
In-Reply-To: <50CA313B.9060904@imperial.ac.uk>
References: <50CA313B.9060904@imperial.ac.uk>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF545DA@CHIMBX5.ad.uillinois.edu>

This is likely due to the back-end change in servers.  I'm not sure how this was set up but we can inquire about it.

chris

On Dec 13, 2012, at 1:49 PM, James Abbott <j.abbott at imperial.ac.uk> wrote:

> Hi All,
> 
> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator is generating internal server errors. I've also been having problems with broken documentation links (cpan links producning the wrong modules, and pdoc pages missing) but can't seem to replicate that problem now....
> 
> I am, for now, still obfuscated...
> 
> Cheers,
> James
> -- 
> Dr. James Abbott
> Lead Bioinformatician
> Bioinformatics Support Service
> Imperial College, London
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From cjfields at illinois.edu  Thu Dec 13 18:13:55 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Thu, 13 Dec 2012 23:13:55 +0000
Subject: [Bioperl-l] [Root-l]  deobfuscator broken....
In-Reply-To: <CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>
References: <50CA313B.9060904@imperial.ac.uk>
	<CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF546F2@CHIMBX5.ad.uillinois.edu>

On Dec 13, 2012, at 4:52 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Thu, Dec 13, 2012 at 7:49 PM, James Abbott <j.abbott at imperial.ac.uk> wrote:
>> Hi All,
>> 
>> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator
>> is generating internal server errors. I've also been having problems with
>> broken documentation links (cpan links producning the wrong modules, and
>> pdoc pages missing) but can't seem to replicate that problem now....
>> 
>> I am, for now, still obfuscated...
>> 
>> Cheers,
>> James
> 
> I would guess this is a side effect from the recent server move,
> CC'ing root-l in case anyone of the sys-admin team had an idea.
> 
> Peter

Beat me by four minutes!  

The CGI code is in websites/bioperl.org/cgi/.  I'm checking on the errors now, may take me a little time to get it back up (was missing CGI, now needs to have the lib path extended).

chris


From jason.stajich at gmail.com  Thu Dec 13 18:18:26 2012
From: jason.stajich at gmail.com (Jason Stajich)
Date: Thu, 13 Dec 2012 15:18:26 -0800
Subject: [Bioperl-l] [Root-l]  deobfuscator broken....
In-Reply-To: <CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>
References: <50CA313B.9060904@imperial.ac.uk>
	<CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>
Message-ID: <0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com>

I think it uses mysql but I don't know if that was reconstituted on the new server. 

On Dec 13, 2012, at 2:52 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Thu, Dec 13, 2012 at 7:49 PM, James Abbott <j.abbott at imperial.ac.uk> wrote:
>> Hi All,
>> 
>> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator
>> is generating internal server errors. I've also been having problems with
>> broken documentation links (cpan links producning the wrong modules, and
>> pdoc pages missing) but can't seem to replicate that problem now....
>> 
>> I am, for now, still obfuscated...
>> 
>> Cheers,
>> James
> 
> I would guess this is a side effect from the recent server move,
> CC'ing root-l in case anyone of the sys-admin team had an idea.
> 
> Peter
> _______________________________________________
> Root-l mailing list
> Root-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/root-l

Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org


From nikkie.vanbers at gmail.com  Wed Dec  5 09:04:09 2012
From: nikkie.vanbers at gmail.com (Nikki2)
Date: Wed, 5 Dec 2012 06:04:09 -0800 (PST)
Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI assembly
 database
Message-ID: <34761946.post@talk.nabble.com>


Hi,

I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from
'Tracheophyta' that are NCBI's assembly database. However, there are no
DocSums returned for the uid's that match the query. When I try the same
thing using the genome database it works fine.

The script that I used to do the query is at the bottom of this message. The
output I get when running the script is:

Count = 84

--------------------- WARNING ---------------------
MSG: No returned docsums.
---------------------------------------------------

I checked the @ids array and it contains the 84 uids.

My questions are as follows:

1) Is it possible to get DocSums for uids from the NCBI assembly database,
and if yes, how?
2) If not, does anyone have any suggestions how to change my script to get
the species-names that match the uids that are returned?

Thanks a lot!

Nikki


##############################################

#!/bin/perl -w

use Bio::DB::EUtilities;

my $factory = Bio::DB::EUtilities->new(-eutil  => 'esearch',
                                       -db     => 'genome',
				       -email => 'my_email at gmail.com',
                                       -term   => 'Tracheophyta[organism]',
                                       -retmax => 5000);

print "Count = ",$factory->get_count,"\n";
my @ids = $factory->get_ids;

my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary',
					-email=>'my_email at gmail.com',
					-db    => 'genome',
                                        -id    => \@ids,
					ret_max=>5000);
 
while (my $ds = $factory2->next_DocSum) {
    print "ID: ",$ds->get_id,"\n";
    # flattened mode, iterates through all Item objects
    while (my $item = $ds->next_Item('flattened'))  {
        # not all Items have content, so need to check...
        printf("%-20s:%s\n",$item->get_name,$item->get_content) if
$item->get_content;
   }
    print "\n";
}


-- 
View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From online at davemessina.com  Thu Dec 13 18:41:35 2012
From: online at davemessina.com (Dave Messina)
Date: Thu, 13 Dec 2012 18:41:35 -0500
Subject: [Bioperl-l] [Root-l]  deobfuscator broken....
In-Reply-To: <0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com>
References: <50CA313B.9060904@imperial.ac.uk>
	<CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>
	<0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com>
Message-ID: <A6ECFFCE-8274-4B60-B681-64AA04285059@davemessina.com>

It should be just (shudder) Berkeley DB.


On Dec 13, 2012, at 18:18, Jason Stajich <jason.stajich at gmail.com> wrote:

> I think it uses mysql but I don't know if that was reconstituted on the new server. 
> 
> On Dec 13, 2012, at 2:52 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> 
>> On Thu, Dec 13, 2012 at 7:49 PM, James Abbott <j.abbott at imperial.ac.uk> wrote:
>>> Hi All,
>>> 
>>> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator
>>> is generating internal server errors. I've also been having problems with
>>> broken documentation links (cpan links producning the wrong modules, and
>>> pdoc pages missing) but can't seem to replicate that problem now....
>>> 
>>> I am, for now, still obfuscated...
>>> 
>>> Cheers,
>>> James
>> 
>> I would guess this is a side effect from the recent server move,
>> CC'ing root-l in case anyone of the sys-admin team had an idea.
>> 
>> Peter
>> _______________________________________________
>> Root-l mailing list
>> Root-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/root-l
> 
> Jason Stajich
> jason.stajich at gmail.com
> jason at bioperl.org
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From abualiga2 at gmail.com  Tue Dec 18 17:08:51 2012
From: abualiga2 at gmail.com (galeb abu-ali)
Date: Tue, 18 Dec 2012 17:08:51 -0500
Subject: [Bioperl-l] Fwd: how to parse maf file format
In-Reply-To: <CANPzzTtgJJwmN2TyBhDb0h=nEofSh44RED-1v9VEvRG+ZazUvA@mail.gmail.com>
References: <CANPzzTtgJJwmN2TyBhDb0h=nEofSh44RED-1v9VEvRG+ZazUvA@mail.gmail.com>
Message-ID: <CANPzzTueE6TN2vofSkwAQN-RBSTZzm7UuJLVT5kEf6O+7CV2CQ@mail.gmail.com>

Hi,

I am writing a script to parse a multiple genome alignment file in maf
format, generated with mugsy alignment of e.coli genomes.  So far, my
script parses SNPs from synteny blocks conserved in all aligned strains,
and it excludes gaps, which is enough for a phylogenetic analyses.  I was
wondering how can I parse the remaining blocks that are not conserved in
all strains, to see what is conserved in n-1, n-2, etc. strains or unique
to each strain.  I guess this is not a BioPerl question, but it's a Perl
for biologists question so I was hoping to get some insight here.  If there
is a more appropriate forum, please let me know.

Below is my code.

many thanks!

galeb

#!/usr/local/bin/perl
use Modern::Perl '2013';
use autodie;
use List::MoreUtils qw/ each_arrayref /;

# gsa 18.12.2012
# parse mugsy multiple genome alignment for SNPs in synteny blocks
conserved in all aligned strains
=head
##maf version=1 scoring=mugsy
a score=7891 label=40 mult=4
s O55H7_RM12579.O55H7_RM12579        1596752 7262 + 5263980
CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCG
s O55H7_CB9615.O55H7_CB9615        1604426 7262 + 5386352
CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCGG-AT
s O157H7_Sakai.O157H7_Sakai        1787303 7068 + 5498450
CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCGG-AT
s O157H7_EDL.O157H7_EDL933        1729749 7082 + 5528445
CGGGATGCGGGAATGGGAATGCCTTGGTTGACGGGGTGGCGGAAT

a score=6756 label=41 mult=4
s O55H7_RM12579.O55H7_RM12579        1986265 6749 + 5263980
CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGG
s O55H7_CB9615.O55H7_CB9615        1991733 6749 + 5386352
CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGGTTAC
s O157H7_Sakai.O157H7_Sakai        3940728 6751 - 5498450
CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGGTTAC
s O157H7_EDL.O157H7_EDL933        4260689 4042 - 5528445
---------------------------------------------
=cut

my $infile = shift or die "Usage: $0 <alignment_file.maf>\n";

my %snps;
my $strains = 0;
my @alignment;
my( $score, $blkLen, $mult );
my $total_snps;
my $syn_len;
my %lengths;

open my $fh, '<', $infile;

while( <$fh> ) {
    next if /^#/;
    chomp;

    if( /^a/ ) {
        ( $score, $blkLen, $mult ) = ( split )[1,2,3];
        $score =~ s/score\=(\d+)/$1/; # length of alignment block including
'-'
        $blkLen =~ s/label\=(\d+)/$1/; # alignment block number; numbers
ranked on alignment length
        $mult  =~ s/mult\=(\d+)/$1/;# number of strains aligned in block

        $strains = $mult if $mult > $strains; # total number of strains in
alignment
    }
    elsif( /^s/ ) { push @alignment, $_ }

    elsif( /^$/ || ! length $_ ) {
        my( @strNames, @starts, @strands, @dna_mtrx );
        # if sequence conserved in all strains
        if( $strains == @alignment ) {
            $syn_len += $score; # total aligned sequence in all strains
            for( @alignment ) {
                # name, align start, align length (w/o '-'), direction,
align sequence w/ '-'
                my( $name, $start, $len, $strand, $dna ) = ( split /\s+/ )[
1, 2, 3, 4, 6 ];
                #$name =~ s/.*\.(.*)/$1/; # remove duplicated strain name

                # strains are always in same order when all strains in
block.
                push @strNames, $name;
                push @starts, $start;
                push @strands, $strand;
                push @dna_mtrx, [ split '', $dna ];
                # total seqeunce in each strain w/o '-' that is conserved
in all strains
                $lengths{ $name } += $len;
            }

            my $ea = each_arrayref( @dna_mtrx );
            my %gaps;
            my $cnt;
            while( my( @bases ) = $ea->() ) {
                ++$cnt;
                my %temp;
                for( 0 .. $#bases ) { # store gaps if any
                    if( $bases[$_] eq '-' ) {
                        $gaps{$_}++; # key is number, corresponds to index
of other arrays
                    }
                }
                # skip gaps '-'
                unless( '-' ~~ @bases ) { $temp{ uc $_}++ for @bases } # if
snp then %temp will have > 1 key
                if( keys %temp > 1 ) { # if SNP exists, get base and
position for all strains in alignment
                    ++$total_snps;
                    my $pos;
                    for( 0 .. $#bases ) {
                        if( $strands[$_] eq '+' ) { $pos = $starts[$_] +
$cnt - ( $gaps{$_} // 0 ) } # genome positn
                        elsif( $strands[$_] eq '-' ) { $pos = $starts[$_] -
$cnt - ( $gaps{$_} // 0 ) }
                        # HoAoH
                        push @{ $snps{ $strNames[$_] } }, { $pos =>
$bases[$_] };
                    }
                }
            }
        }
        @alignment = ();
    }
}
close $fh;
#print Dumper( \%snps ); use Data::Dumper;
say "Sum length of synteny blocks conserved in all strains, including gaps:
$syn_len bp";
say "Length of conserved sequence for each strain, excluding gaps:";
for my $strain ( keys %lengths ) {
    say "$strain\t$lengths{ $strain } bp";
}

my $outfile = $infile;
$outfile =~ s/\.maf$/_snps.txt/;
open my $fh2, '>', $outfile;
say {$fh2} map{ $_ . "_base\t", $_ . "_pos\t" } keys %snps;
for my $snp ( 0 .. ( $total_snps - 1 ) ) {
    for my $strain ( keys %snps ){
        for my $href ( keys %{ $snps{ $strain }[ $snp ] } ) {
            print {$fh2} "$snps{ $strain }[ $snp ]->{ $href }\t$href\t";
            }
        }
    print {$fh2} "\n";
}


From sanketd at isquareit.ac.in  Mon Dec 31 01:46:41 2012
From: sanketd at isquareit.ac.in (Sanket Desai)
Date: Mon, 31 Dec 2012 12:16:41 +0530 (IST)
Subject: [Bioperl-l] Help in getting organism names of the nucleotide
	entries.
Message-ID: <26019826.10871.1356936401744.JavaMail.root@mail.isquareit.ac.in>

Hello,

With respect to the post:
http://bio.perl.org/pipermail/bioperl-l/2009-December/031831.html

When used for nucleotide database it gives the following error:

--------------------- WARNING ---------------------
MSG: The -email parameter is now required, per NCBI E-utilities policy
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: No linksets returned
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: The -email parameter is now required, per NCBI E-utilities policy
---------------------------------------------------

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: NCBI esummary fatal error: Empty id list - nothing todo
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:472
STACK: Bio::Tools::EUtilities::parse_data /usr/share/perl5/Bio/Tools/EUtilities.pm:382
STACK: Bio::Tools::EUtilities::next_DocSum /usr/share/perl5/Bio/Tools/EUtilities.pm:964
STACK: Bio::DB::EUtilities::next_DocSum /usr/share/perl5/Bio/DB/EUtilities.pm:914
STACK: getOrgNameFrmAccession.pl:29
-----------------------------------------------------------

Please suggest the relevant changes in the above script to make it work for the nucleotide entries also.

Thanks in advance,
Regards,
Sanket


From fcyucn at gmail.com  Mon Dec 17 20:37:45 2012
From: fcyucn at gmail.com (Fengchao Yu)
Date: Tue, 18 Dec 2012 01:37:45 -0000
Subject: [Bioperl-l] Is there any module for the protein digestion?
Message-ID: <7b719317-57a3-46ef-927c-6b0508e1e62d@googlegroups.com>

I notice that Bio::Restriction::Enzyme<http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/Restriction/Enzyme.pm> is 
for DNA digest? I wonder if there is any module for protein digestion?

Thanks


From florent.angly at gmail.com  Sun Dec  2 21:36:28 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Mon, 03 Dec 2012 12:36:28 +1000
Subject: [Bioperl-l] Bio::DB::Fasta and threads
Message-ID: <50BC102C.7080902@gmail.com>

Hi all,

This is in response to Carson Holt's report that Bio::DB::Fasta does not 
play well with threads: https://redmine.open-bio.org/issues/3397

The first issue is the serialization of Bio::DB::IndexedBase-inheriting 
(e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for 
threading (for example when using Thread::Queue::Any). I implemented 
hooks that make it transparent to serialize using Storable freeze() and 
thaw().

Another issue was the lack of communication between different 
Bio::DB::IndexedBase instances, which means that an instance could 
easily be writing or deleting the database that another instance is 
working on. To fix this, I needed some form of locking.

Some database Bio::DB::IndexedBase backends (DB_file) have some support 
for locking but Bio::DB::IndexedBase also supports other database 
backends for which there is no native locking mechanism. So, I had to 
come up with a more general solution: a lock file. I noticed that 
Bio::DB::SeqFeature::Store::berkeleydb has a locking mechanism, based on 
flock(), which means that it does not work with NFS-mounted filesystems. 
All the Bioperl-based scripts I (and most likely many others) write run 
on servers that use NFS, so this support is important. I have found only 
one way to do the NFS locking safely, using File::SharedNFSLock. It has 
a few downsides though:
     1/ it is an external dependency,
     2/ it does not work on FAT filesystems (should be mostly restricted 
to USB sticks nowadays) and the lock is never acquired, and
     3/ at the moment, it requires a patch to work in threaded context 
(https://rt.cpan.org/Public/Bug/Display.html?id=81597)

Note that while I have now added basic support for threads in 
Bio::DB::IndexedBase was added, I still get segfaults in specific cases, 
for example when returning a database or sequence object. This might be 
related to this issue: 
https://rt.perl.org/rt3/Ticket/Display.html?id=115972. Beyond this, the 
new code seems to work nicely. See the branch 
https://github.com/bioperl/bioperl-live/tree/storable_db if you want to 
test yourself. For example, one can now run multiple threads, each of 
them creating a Bio::DB::Fasta database from the same FASTA file: the 
first thread performs the indexing while the others wait nicely for the 
indexing to be finished to query the database.

Comments welcome. Regards,

Florent


From l.m.timmermans at students.uu.nl  Mon Dec  3 19:29:59 2012
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Tue, 4 Dec 2012 01:29:59 +0100
Subject: [Bioperl-l] Bio::DB::Fasta and threads
In-Reply-To: <50BC102C.7080902@gmail.com>
References: <50BC102C.7080902@gmail.com>
Message-ID: <CAC1jpXCFHTb1DCsuKxLy1==qrKmSgRwqxuj155iyf7yyR=-WDQ@mail.gmail.com>

On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly <florent.angly at gmail.com> wrote:
> The first issue is the serialization of Bio::DB::IndexedBase-inheriting
> (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for
> threading (for example when using Thread::Queue::Any). I implemented hooks
> that make it transparent to serialize using Storable freeze() and thaw().

I don't think serializing a magical thingie makes much sense. Storable
is commonly used for a lot more things than interthread communication
(e.g. network communication), this would often not work under such
circumstances.

Leon


From cjfields at illinois.edu  Mon Dec  3 22:23:50 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 4 Dec 2012 03:23:50 +0000
Subject: [Bioperl-l] Bio::DB::Fasta and threads
In-Reply-To: <CAC1jpXCFHTb1DCsuKxLy1==qrKmSgRwqxuj155iyf7yyR=-WDQ@mail.gmail.com>
References: <50BC102C.7080902@gmail.com>
	<CAC1jpXCFHTb1DCsuKxLy1==qrKmSgRwqxuj155iyf7yyR=-WDQ@mail.gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF45670@CHIMBX5.ad.uillinois.edu>

On Dec 3, 2012, at 6:29 PM, Leon Timmermans <l.m.timmermans at students.uu.nl> wrote:

> On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly <florent.angly at gmail.com> wrote:
>> The first issue is the serialization of Bio::DB::IndexedBase-inheriting
>> (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for
>> threading (for example when using Thread::Queue::Any). I implemented hooks
>> that make it transparent to serialize using Storable freeze() and thaw().
> 
> I don't think serializing a magical thingie makes much sense. Storable
> is commonly used for a lot more things than interthread communication
> (e.g. network communication), this would often not work under such
> circumstances.
> 
> Leon

Leon, any suggestions on alternatives?  I know this particular bit is a sore spot with MAKER at the moment, so any help would be greatly appreciated.

chris


From yongli at yeslab.com  Sat Dec  1 01:10:15 2012
From: yongli at yeslab.com (=?utf-8?B?eW9uZ2xpQHllc2xhYi5jb20=?=)
Date: Sat, 1 Dec 2012 14:10:15 +0800 (CST)
Subject: [Bioperl-l] =?utf-8?q?question_about_bioperl_program?=
Message-ID: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com>

Dear Sir or Madam,

 
 I just begin to learn bioperl and am using a bioperl program to extract a kind of bacterial genes sequences from genome genbank file. I download NC_003450.gbk, ppt, ffn files. NC_003450 is a kind of bacterial genome genbank file. My bioperl code as follows:

 
 use Bio::Seq;

  use Bio::SeqIO;

  
  $seqio_obj=Bio::SeqIO->new (-file=>'NC_003450.gbk',-format=>'genbank');

  # $seq_obj=$seqio_obj->next_seq;

  
  while($seq_obj=$seqio_obj->next_seq)

  {

    $display_name=$seq_obj->display_name;

    $desc=$seq_obj->desc;

    $seq=$seq_obj->seq;

  $acc = $seq_obj->accession_number;

  $seqio_obj = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'fasta' );

  $seqio_obj->write_seq($seq_obj);

  }

  
 After the program runned I just gain a genome complete sequence without a file including every gene sequence one by one. I want to get the genes sequences one by one in a fasta files.  So I write you for help.

 
 Yong Li


From carsonhh at gmail.com  Mon Dec  3 22:35:50 2012
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 03 Dec 2012 22:35:50 -0500
Subject: [Bioperl-l] Bio::DB::Fasta and threads
In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF4CF45670@CHIMBX5.ad.uillinois.edu>
Message-ID: <CCE2D73A.16852%carsonhh@gmail.com>

Bio::DB::Fasta is working for maker now.  The previous issues have been
fixed, but being as Florent has gone out of his way to build a number of
improvements into Bio::DB::Fasta over the past few weeks, this seemed like
a useful one as well, so I suggested it.  One of the big uses of
Bio::DB::Fasta is the Bio::PrimarySeq::Fasta features it creates.  They
are great for manipulating the sequence without actually having to ever
keep it in memory.  It's nice because the sequence is made available on
demand, but when you try and pass them between threads, your program falls
apart. There are creative work arounds, but simply adding a serialization
hook to Bio::DB::Fasta to disconnect the database on freezing and then
reconnect on thaw also fixes it, and it makes them extremely useful for
multi-threaded applications without having to go through other kinds of
work arounds (it just makes them work as expected with serialization).
Previously I had created my own module and inherited from Bio::DB::Fasta
so I could implement the Storable hooks.  Because Storable looks for the
hooks in anything it serializes, the Bio::DB::Fasta object can even be
well down inside of a complex object and you don't have worry about it.
Previously I've used Storable hooks to pass the Bio::PrimarySeq::Fasta
features across the network using MPI, as long as the database is on an
NFS mount it just reconnects on the other node with no issue.  If the
indexed file isn't available after deserialization over a network, you
could just throw an error when the thaw hook is called.  I'll give
Florent's changes a look over soon to give any suggestions.

Thanks,
Carson


On 12-12-03 10:23 PM, "Fields, Christopher J" <cjfields at illinois.edu>
wrote:

>On Dec 3, 2012, at 6:29 PM, Leon Timmermans
><l.m.timmermans at students.uu.nl> wrote:
>
>> On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly <florent.angly at gmail.com>
>>wrote:
>>> The first issue is the serialization of Bio::DB::IndexedBase-inheriting
>>> (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for
>>> threading (for example when using Thread::Queue::Any). I implemented
>>>hooks
>>> that make it transparent to serialize using Storable freeze() and
>>>thaw().
>> 
>> I don't think serializing a magical thingie makes much sense. Storable
>> is commonly used for a lot more things than interthread communication
>> (e.g. network communication), this would often not work under such
>> circumstances.
>> 
>> Leon
>
>Leon, any suggestions on alternatives?  I know this particular bit is a
>sore spot with MAKER at the moment, so any help would be greatly
>appreciated.
>
>chris
>


From jason.r.gallant at gmail.com  Tue Dec  4 15:23:02 2012
From: jason.r.gallant at gmail.com (Jason Gallant)
Date: Tue, 4 Dec 2012 12:23:02 -0800 (PST)
Subject: [Bioperl-l] Problem with BIO::DB::FASTA and Colon in Fasta Header
Message-ID: <c996f6bc-458a-461a-bcc5-3a567eb06e85@googlegroups.com>

Hello,

I'm trying to retreive fasta sequences that contain a colon in their 
header.  However, I cannot get my BioPerl script to do this!!

It works as expected when the header does not contain the colon, however 
doesn't return anything when it does.  Weirdly, when I ask it to return the 
parsed IDs (see below), it returns the appropriate IDs, which include the 
colon!  Very confusing, would appreciate any help!!

Many Thanks,
Jason Gallant


use strict;
use Bio::SearchIO; 
use Bio::DB::Fasta;


my ($file,$id,$start,$end) = 
("secondround_merged_expanded.fasta","C7047455:0-100",1,10);


my $db = Bio::DB::Fasta->new($file, -reindex=>1);
my $seq = $db->seq($id,$start,$end);
 
print $db->ids;

print $seq,"\n";


From asjo at koldfront.dk  Tue Dec  4 15:53:08 2012
From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=)
Date: Tue, 04 Dec 2012 21:53:08 +0100
Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access
In-Reply-To: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com>
	(Francesco Musacchia's message of "Wed, 28 Nov 2012 02:27:16 -0800
	(PST)")
References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com>
Message-ID: <87y5hdletn.fsf@topper.koldfront.dk>

On Wed, 28 Nov 2012 02:27:16 -0800 (PST), Francesco wrote:

> I'm experiencing that when I have to do a lot of accessess on a GFF
> database (with Bio:DB::SeqFeature::Store) the slowness increase until
> my script can stay running for more than a day.

First you'll need to find out what/where exactly it is slow. One way to
do so is using a a profiler; this is a good one for Perl:

 * https://metacpan.org/module/Devel::NYTProf

If you want more specific suggestions, you'll probably have to provide
more information.


  Good luck!

    Adam

-- 
 "As Knuth pointed out long ago, speed only matters           Adam Sj?gren
  in certain critical bottlenecks. And as many           asjo at koldfront.dk
  programmers have observed since, one is very often
  mistaken about where these bottlenecks are."


From cjfields at illinois.edu  Tue Dec  4 16:10:00 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 4 Dec 2012 21:10:00 +0000
Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access
In-Reply-To: <87y5hdletn.fsf@topper.koldfront.dk>
References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com>
	<87y5hdletn.fsf@topper.koldfront.dk>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu>


On Dec 4, 2012, at 2:53 PM, Adam Sj?gren <asjo at koldfront.dk>
 wrote:

> On Wed, 28 Nov 2012 02:27:16 -0800 (PST), Francesco wrote:
> 
>> I'm experiencing that when I have to do a lot of accessess on a GFF
>> database (with Bio:DB::SeqFeature::Store) the slowness increase until
>> my script can stay running for more than a day.
> 
> First you'll need to find out what/where exactly it is slow. One way to
> do so is using a a profiler; this is a good one for Perl:
> 
> * https://metacpan.org/module/Devel::NYTProf
> 
> If you want more specific suggestions, you'll probably have to provide
> more information.
> 
> 
>  Good luck!
> 
>    Adam

If anything, we need more profiling of Bioperl code.  Ah, if we only had infinite time... :)

chris


From asjo at koldfront.dk  Tue Dec  4 16:33:55 2012
From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=)
Date: Tue, 04 Dec 2012 22:33:55 +0100
Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access
In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu>
	(Christopher J. Fields's message of "Tue, 4 Dec 2012 21:10:00 +0000")
References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com>
	<87y5hdletn.fsf@topper.koldfront.dk>
	<118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu>
Message-ID: <87txs1jyd8.fsf@topper.koldfront.dk>

On Tue, 4 Dec 2012 21:10:00 +0000, Fields, wrote:

> If anything, we need more profiling of Bioperl code. Ah, if we only
> had infinite time... :)

If we had that, we didn't need profiling!


  ;-),

   Adam

-- 
 "On the quiet side. Somewhat peculiar. A good                Adam Sj?gren
  companion, in a weird sort of way."                    asjo at koldfront.dk


From florent.angly at gmail.com  Tue Dec  4 16:52:41 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Wed, 05 Dec 2012 07:52:41 +1000
Subject: [Bioperl-l] Problem with BIO::DB::FASTA and Colon in Fasta
	Header
In-Reply-To: <c996f6bc-458a-461a-bcc5-3a567eb06e85@googlegroups.com>
References: <c996f6bc-458a-461a-bcc5-3a567eb06e85@googlegroups.com>
Message-ID: <50BE70A9.4060404@gmail.com>

Hi Jason,

See the documentation for seq() at 
http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/DB/Fasta.pm#OBJECT_METHODS 
<http://search.cpan.org/%7Ecjfields/BioPerl-1.6.901/Bio/DB/Fasta.pm#OBJECT_METHODS>.

When you call seq() with a single argument, e.g. 
$db->seq('C7047455:0-100'), Bio::DB::Fasta interprets it as a compound 
ID and looks for position 0 to 100 of a sequence called C7047455. This 
is a feature that has been in Bio::DB::Fasta since the dawn of time. In 
this form, seq() expects a colon as part of the compound ID, which is 
problematic because your sequence ID actually contains a colon.

I think that when you call $db->seq($id,$start,$end), Bio::DB::Fasta 
does not attempt to parse your ID. This is why your code works with this 
form. Note that if you want to get the entirety of a sequence called 
'C7047455:0-100', the easiest if your sequence names contain colon is to 
use $db->get_Seq_by_id('C7047455:0-100') since get_Seq_by_id() does only 
take a regular ID (not compound).

Florent


On 05/12/12 06:23, Jason Gallant wrote:
> Hello,
>
> I'm trying to retreive fasta sequences that contain a colon in their
> header.  However, I cannot get my BioPerl script to do this!!
>
> It works as expected when the header does not contain the colon, however
> doesn't return anything when it does.  Weirdly, when I ask it to return the
> parsed IDs (see below), it returns the appropriate IDs, which include the
> colon!  Very confusing, would appreciate any help!!
>
> Many Thanks,
> Jason Gallant
>
>
> use strict;
> use Bio::SearchIO;
> use Bio::DB::Fasta;
>
>
> my ($file,$id,$start,$end) =
> ("secondround_merged_expanded.fasta","C7047455:0-100",1,10);
>
>
> my $db = Bio::DB::Fasta->new($file, -reindex=>1);
> my $seq = $db->seq($id,$start,$end);
>   
> print $db->ids;
>
> print $seq,"\n";
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From bosborne11 at verizon.net  Tue Dec  4 17:12:59 2012
From: bosborne11 at verizon.net (Brian Osborne)
Date: Tue, 04 Dec 2012 17:12:59 -0500
Subject: [Bioperl-l] question about bioperl program
In-Reply-To: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com>
References: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com>
Message-ID: <16BBC477-9935-4C79-A70D-6B18716089FB@verizon.net>

Yong Li,

You want to take a look at this HOWTO:

http://www.bioperl.org/wiki/HOWTO:Feature-Annotation

Those genes you see in the file are features in the genome sequence.

Brian O.


On Dec 1, 2012, at 1:10 AM, yongli at yeslab.com wrote:

> Dear Sir or Madam,
> 
> 
> 
> I just begin to learn bioperl and am using a bioperl program to extract a kind of bacterial genes sequences from genome genbank file. I download NC_003450.gbk, ppt, ffn files. NC_003450 is a kind of bacterial genome genbank file. My bioperl code as follows:
> 
> 
> 
> use Bio::Seq;
> 
>  use Bio::SeqIO;
> 
> 
> 
>  $seqio_obj=Bio::SeqIO->new (-file=>'NC_003450.gbk',-format=>'genbank');
> 
>  # $seq_obj=$seqio_obj->next_seq;
> 
> 
> 
>  while($seq_obj=$seqio_obj->next_seq)
> 
>  {
> 
>    $display_name=$seq_obj->display_name;
> 
>    $desc=$seq_obj->desc;
> 
>    $seq=$seq_obj->seq;
> 
>  $acc = $seq_obj->accession_number;
> 
>  $seqio_obj = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'fasta' );
> 
>  $seqio_obj->write_seq($seq_obj);
> 
>  }
> 
> 
> 
> After the program runned I just gain a genome complete sequence without a file including every gene sequence one by one. I want to get the genes sequences one by one in a fasta files.  So I write you for help.
> 
> 
> 
> Yong Li
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From ankh.egypt.public at googlemail.com  Fri Dec  7 15:24:20 2012
From: ankh.egypt.public at googlemail.com (Adrian Helmchen)
Date: Fri, 07 Dec 2012 21:24:20 +0100
Subject: [Bioperl-l] proteins from an organism
Message-ID: <50C25074.8050703@googlemail.com>

Hello,

I would like to get all proteins from an organism but proteins from
cholorplasts or with chrystal structures or something else.

I tried to obtain these proteins by send a query 'Arabidopsis 
thaliana[organism]'
with Bio::DB::GenBank and fetch the gi numbers from the cds.
But on the one pc I get 6000 proteins and on another pc I get 46000 proteins
although Arabidopsis thaliana has 25000 genes.

Thank you for your help.


From nikkie.vanbers at gmail.com  Mon Dec 10 03:07:27 2012
From: nikkie.vanbers at gmail.com (Nikki2)
Date: Mon, 10 Dec 2012 00:07:27 -0800 (PST)
Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI assembly
 database
Message-ID: <34761946.post@talk.nabble.com>


Hi,

I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from
'Tracheophyta' that are NCBI's assembly database. However, there are no
DocSums returned for the uid's that match the query. When I try the same
thing using the genome database it works fine.

The script that I used to do the query is at the bottom of this message. The
output I get when running the script is:

Count = 84

--------------------- WARNING ---------------------
MSG: No returned docsums.
---------------------------------------------------

I checked the @ids array and it contains the 84 uids.

My questions are as follows:

1) Is it possible to get DocSums for uids from the NCBI assembly database,
and if yes, how?
2) If not, does anyone have any suggestions how to change my script to get
the species-names that match the uids that are returned?

Thanks a lot!

Nikki


##############################################

#!/bin/perl -w

use Bio::DB::EUtilities;

my $factory = Bio::DB::EUtilities->new(-eutil  => 'esearch',
                                       -db     => 'genome',
				       -email => 'my_email at gmail.com',
                                       -term   => 'Tracheophyta[organism]',
                                       -retmax => 5000);

print "Count = ",$factory->get_count,"\n";
my @ids = $factory->get_ids;

my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary',
					-email=>'my_email at gmail.com',
					-db    => 'genome',
                                        -id    => \@ids,
					ret_max=>5000);
 
while (my $ds = $factory2->next_DocSum) {
    print "ID: ",$ds->get_id,"\n";
    # flattened mode, iterates through all Item objects
    while (my $item = $ds->next_Item('flattened'))  {
        # not all Items have content, so need to check...
        printf("%-20s:%s\n",$item->get_name,$item->get_content) if
$item->get_content;
   }
    print "\n";
}


-- 
View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From cjfields at illinois.edu  Mon Dec 10 10:59:03 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Mon, 10 Dec 2012 15:59:03 +0000
Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI
 assembly database
In-Reply-To: <34761946.post@talk.nabble.com>
References: <34761946.post@talk.nabble.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF4C164@CHIMBX5.ad.uillinois.edu>

Nikki, 

This is b/c a handful of the databases apparently have switched docsum output completely to the DB-specific DocSum schemata (v2), which have not been implemented in Bio::EUtilities as of yet.  This requires quite a bit of revision to parse correctly as it's per database, so I don't have a timeline on when this would be available and would likely be incrementally implemented over time.  

See here for the announcement:

    http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.Release_Notes

In the meantime, you can get the raw XML output for these by replacing the loop for $factory2 with:

    print $factory2->get_Response->content

chris


On Dec 10, 2012, at 2:07 AM, Nikki2 <nikkie.vanbers at gmail.com> wrote:

> Hi,
> 
> I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from
> 'Tracheophyta' that are NCBI's assembly database. However, there are no
> DocSums returned for the uid's that match the query. When I try the same
> thing using the genome database it works fine.
> 
> The script that I used to do the query is at the bottom of this message. The
> output I get when running the script is:
> 
> Count = 84
> 
> --------------------- WARNING ---------------------
> MSG: No returned docsums.
> ---------------------------------------------------
> 
> I checked the @ids array and it contains the 84 uids.
> 
> My questions are as follows:
> 
> 1) Is it possible to get DocSums for uids from the NCBI assembly database,
> and if yes, how?
> 2) If not, does anyone have any suggestions how to change my script to get
> the species-names that match the uids that are returned?
> 
> Thanks a lot!
> 
> Nikki
> 
> 
> 
> 
> 
> 
> 
> ##############################################
> 
> #!/bin/perl -w
> 
> use Bio::DB::EUtilities;
> 
> my $factory = Bio::DB::EUtilities->new(-eutil  => 'esearch',
>                                       -db     => 'genome',
> 				       -email => 'my_email at gmail.com',
>                                       -term   => 'Tracheophyta[organism]',
>                                       -retmax => 5000);
> 
> print "Count = ",$factory->get_count,"\n";
> my @ids = $factory->get_ids;
> 
> my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary',
> 					-email=>'my_email at gmail.com',
> 					-db    => 'genome',
>                                        -id    => \@ids,
> 					ret_max=>5000);
> 
> while (my $ds = $factory2->next_DocSum) {
>    print "ID: ",$ds->get_id,"\n";
>    # flattened mode, iterates through all Item objects
>    while (my $item = $ds->next_Item('flattened'))  {
>        # not all Items have content, so need to check...
>        printf("%-20s:%s\n",$item->get_name,$item->get_content) if
> $item->get_content;
>   }
>    print "\n";
> }
> 
> 
> -- 
> View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From jason.stajich at gmail.com  Wed Dec 12 23:05:29 2012
From: jason.stajich at gmail.com (Jason Stajich)
Date: Wed, 12 Dec 2012 20:05:29 -0800
Subject: [Bioperl-l] Asking
In-Reply-To: <201212131130153627348@gmail.com>
References: <201212131130153627348@gmail.com>
Message-ID: <7ED416EC-622E-4023-94B7-9A11D29929DC@gmail.com>

You want the reroot function. Have you tried reading the howtos on the website already. 
Node is a node in the tree. There are several functions to find a node or iterate through all the ones in the tree

Sent from my iPhone-please excuse typos

--
Jason Stajich

On Dec 12, 2012, at 7:30 PM, "Xing-Xing Shen" <shenxingxing2010 at gmail.com> wrote:

> Drear Jason 
> I am a green hand in learning Bioperl. Now, I met a problem about how to define outgroup for a set of newick trees.
> 
> My codes below:
> #!/usr/bin/perl
> use Bio::TreeIO;
> use Bio::Tree::NodeI;
> use Bio::Tree::Tree;
> my @filenames = glob("*.txt");
> foreach my $filename (@filenames) {
>    my $treeio = Bio::TreeIO->new('-format' => 'newick', '-file'   => "$filename");
>    while( my $tree = $treeio->next_tree ) {
>       $tree->set_root_node("$node"); # what might $node mean?
>        ..........
>        ..........
>    }
> }
> 
> 
> With best,
> 
> Xing-Xing Shen


From j.abbott at imperial.ac.uk  Thu Dec 13 14:49:15 2012
From: j.abbott at imperial.ac.uk (James Abbott)
Date: Thu, 13 Dec 2012 19:49:15 +0000
Subject: [Bioperl-l] deobfuscator broken....
Message-ID: <50CA313B.9060904@imperial.ac.uk>

Hi All,

Don't know if anyone admin folk are aware, but the bioperl.org 
deobfuscator is generating internal server errors. I've also been having 
problems with broken documentation links (cpan links producning the 
wrong modules, and pdoc pages missing) but can't seem to replicate that 
problem now....

I am, for now, still obfuscated...

Cheers,
James
-- 
Dr. James Abbott
Lead Bioinformatician
Bioinformatics Support Service
Imperial College, London


From p.j.a.cock at googlemail.com  Thu Dec 13 17:52:44 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 13 Dec 2012 22:52:44 +0000
Subject: [Bioperl-l] deobfuscator broken....
In-Reply-To: <50CA313B.9060904@imperial.ac.uk>
References: <50CA313B.9060904@imperial.ac.uk>
Message-ID: <CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>

On Thu, Dec 13, 2012 at 7:49 PM, James Abbott <j.abbott at imperial.ac.uk> wrote:
> Hi All,
>
> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator
> is generating internal server errors. I've also been having problems with
> broken documentation links (cpan links producning the wrong modules, and
> pdoc pages missing) but can't seem to replicate that problem now....
>
> I am, for now, still obfuscated...
>
> Cheers,
> James

I would guess this is a side effect from the recent server move,
CC'ing root-l in case anyone of the sys-admin team had an idea.

Peter


From cjfields at illinois.edu  Thu Dec 13 17:51:50 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Thu, 13 Dec 2012 22:51:50 +0000
Subject: [Bioperl-l] deobfuscator broken....
In-Reply-To: <50CA313B.9060904@imperial.ac.uk>
References: <50CA313B.9060904@imperial.ac.uk>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF545DA@CHIMBX5.ad.uillinois.edu>

This is likely due to the back-end change in servers.  I'm not sure how this was set up but we can inquire about it.

chris

On Dec 13, 2012, at 1:49 PM, James Abbott <j.abbott at imperial.ac.uk> wrote:

> Hi All,
> 
> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator is generating internal server errors. I've also been having problems with broken documentation links (cpan links producning the wrong modules, and pdoc pages missing) but can't seem to replicate that problem now....
> 
> I am, for now, still obfuscated...
> 
> Cheers,
> James
> -- 
> Dr. James Abbott
> Lead Bioinformatician
> Bioinformatics Support Service
> Imperial College, London
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From cjfields at illinois.edu  Thu Dec 13 18:13:55 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Thu, 13 Dec 2012 23:13:55 +0000
Subject: [Bioperl-l] [Root-l]  deobfuscator broken....
In-Reply-To: <CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>
References: <50CA313B.9060904@imperial.ac.uk>
	<CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF546F2@CHIMBX5.ad.uillinois.edu>

On Dec 13, 2012, at 4:52 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Thu, Dec 13, 2012 at 7:49 PM, James Abbott <j.abbott at imperial.ac.uk> wrote:
>> Hi All,
>> 
>> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator
>> is generating internal server errors. I've also been having problems with
>> broken documentation links (cpan links producning the wrong modules, and
>> pdoc pages missing) but can't seem to replicate that problem now....
>> 
>> I am, for now, still obfuscated...
>> 
>> Cheers,
>> James
> 
> I would guess this is a side effect from the recent server move,
> CC'ing root-l in case anyone of the sys-admin team had an idea.
> 
> Peter

Beat me by four minutes!  

The CGI code is in websites/bioperl.org/cgi/.  I'm checking on the errors now, may take me a little time to get it back up (was missing CGI, now needs to have the lib path extended).

chris


From jason.stajich at gmail.com  Thu Dec 13 18:18:26 2012
From: jason.stajich at gmail.com (Jason Stajich)
Date: Thu, 13 Dec 2012 15:18:26 -0800
Subject: [Bioperl-l] [Root-l]  deobfuscator broken....
In-Reply-To: <CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>
References: <50CA313B.9060904@imperial.ac.uk>
	<CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>
Message-ID: <0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com>

I think it uses mysql but I don't know if that was reconstituted on the new server. 

On Dec 13, 2012, at 2:52 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Thu, Dec 13, 2012 at 7:49 PM, James Abbott <j.abbott at imperial.ac.uk> wrote:
>> Hi All,
>> 
>> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator
>> is generating internal server errors. I've also been having problems with
>> broken documentation links (cpan links producning the wrong modules, and
>> pdoc pages missing) but can't seem to replicate that problem now....
>> 
>> I am, for now, still obfuscated...
>> 
>> Cheers,
>> James
> 
> I would guess this is a side effect from the recent server move,
> CC'ing root-l in case anyone of the sys-admin team had an idea.
> 
> Peter
> _______________________________________________
> Root-l mailing list
> Root-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/root-l

Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org


From nikkie.vanbers at gmail.com  Wed Dec  5 09:04:09 2012
From: nikkie.vanbers at gmail.com (Nikki2)
Date: Wed, 5 Dec 2012 06:04:09 -0800 (PST)
Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI assembly
 database
Message-ID: <34761946.post@talk.nabble.com>


Hi,

I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from
'Tracheophyta' that are NCBI's assembly database. However, there are no
DocSums returned for the uid's that match the query. When I try the same
thing using the genome database it works fine.

The script that I used to do the query is at the bottom of this message. The
output I get when running the script is:

Count = 84

--------------------- WARNING ---------------------
MSG: No returned docsums.
---------------------------------------------------

I checked the @ids array and it contains the 84 uids.

My questions are as follows:

1) Is it possible to get DocSums for uids from the NCBI assembly database,
and if yes, how?
2) If not, does anyone have any suggestions how to change my script to get
the species-names that match the uids that are returned?

Thanks a lot!

Nikki


##############################################

#!/bin/perl -w

use Bio::DB::EUtilities;

my $factory = Bio::DB::EUtilities->new(-eutil  => 'esearch',
                                       -db     => 'genome',
				       -email => 'my_email at gmail.com',
                                       -term   => 'Tracheophyta[organism]',
                                       -retmax => 5000);

print "Count = ",$factory->get_count,"\n";
my @ids = $factory->get_ids;

my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary',
					-email=>'my_email at gmail.com',
					-db    => 'genome',
                                        -id    => \@ids,
					ret_max=>5000);
 
while (my $ds = $factory2->next_DocSum) {
    print "ID: ",$ds->get_id,"\n";
    # flattened mode, iterates through all Item objects
    while (my $item = $ds->next_Item('flattened'))  {
        # not all Items have content, so need to check...
        printf("%-20s:%s\n",$item->get_name,$item->get_content) if
$item->get_content;
   }
    print "\n";
}


-- 
View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From online at davemessina.com  Thu Dec 13 18:41:35 2012
From: online at davemessina.com (Dave Messina)
Date: Thu, 13 Dec 2012 18:41:35 -0500
Subject: [Bioperl-l] [Root-l]  deobfuscator broken....
In-Reply-To: <0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com>
References: <50CA313B.9060904@imperial.ac.uk>
	<CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>
	<0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com>
Message-ID: <A6ECFFCE-8274-4B60-B681-64AA04285059@davemessina.com>

It should be just (shudder) Berkeley DB.


On Dec 13, 2012, at 18:18, Jason Stajich <jason.stajich at gmail.com> wrote:

> I think it uses mysql but I don't know if that was reconstituted on the new server. 
> 
> On Dec 13, 2012, at 2:52 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> 
>> On Thu, Dec 13, 2012 at 7:49 PM, James Abbott <j.abbott at imperial.ac.uk> wrote:
>>> Hi All,
>>> 
>>> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator
>>> is generating internal server errors. I've also been having problems with
>>> broken documentation links (cpan links producning the wrong modules, and
>>> pdoc pages missing) but can't seem to replicate that problem now....
>>> 
>>> I am, for now, still obfuscated...
>>> 
>>> Cheers,
>>> James
>> 
>> I would guess this is a side effect from the recent server move,
>> CC'ing root-l in case anyone of the sys-admin team had an idea.
>> 
>> Peter
>> _______________________________________________
>> Root-l mailing list
>> Root-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/root-l
> 
> Jason Stajich
> jason.stajich at gmail.com
> jason at bioperl.org
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From abualiga2 at gmail.com  Tue Dec 18 17:08:51 2012
From: abualiga2 at gmail.com (galeb abu-ali)
Date: Tue, 18 Dec 2012 17:08:51 -0500
Subject: [Bioperl-l] Fwd: how to parse maf file format
In-Reply-To: <CANPzzTtgJJwmN2TyBhDb0h=nEofSh44RED-1v9VEvRG+ZazUvA@mail.gmail.com>
References: <CANPzzTtgJJwmN2TyBhDb0h=nEofSh44RED-1v9VEvRG+ZazUvA@mail.gmail.com>
Message-ID: <CANPzzTueE6TN2vofSkwAQN-RBSTZzm7UuJLVT5kEf6O+7CV2CQ@mail.gmail.com>

Hi,

I am writing a script to parse a multiple genome alignment file in maf
format, generated with mugsy alignment of e.coli genomes.  So far, my
script parses SNPs from synteny blocks conserved in all aligned strains,
and it excludes gaps, which is enough for a phylogenetic analyses.  I was
wondering how can I parse the remaining blocks that are not conserved in
all strains, to see what is conserved in n-1, n-2, etc. strains or unique
to each strain.  I guess this is not a BioPerl question, but it's a Perl
for biologists question so I was hoping to get some insight here.  If there
is a more appropriate forum, please let me know.

Below is my code.

many thanks!

galeb

#!/usr/local/bin/perl
use Modern::Perl '2013';
use autodie;
use List::MoreUtils qw/ each_arrayref /;

# gsa 18.12.2012
# parse mugsy multiple genome alignment for SNPs in synteny blocks
conserved in all aligned strains
=head
##maf version=1 scoring=mugsy
a score=7891 label=40 mult=4
s O55H7_RM12579.O55H7_RM12579        1596752 7262 + 5263980
CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCG
s O55H7_CB9615.O55H7_CB9615        1604426 7262 + 5386352
CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCGG-AT
s O157H7_Sakai.O157H7_Sakai        1787303 7068 + 5498450
CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCGG-AT
s O157H7_EDL.O157H7_EDL933        1729749 7082 + 5528445
CGGGATGCGGGAATGGGAATGCCTTGGTTGACGGGGTGGCGGAAT

a score=6756 label=41 mult=4
s O55H7_RM12579.O55H7_RM12579        1986265 6749 + 5263980
CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGG
s O55H7_CB9615.O55H7_CB9615        1991733 6749 + 5386352
CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGGTTAC
s O157H7_Sakai.O157H7_Sakai        3940728 6751 - 5498450
CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGGTTAC
s O157H7_EDL.O157H7_EDL933        4260689 4042 - 5528445
---------------------------------------------
=cut

my $infile = shift or die "Usage: $0 <alignment_file.maf>\n";

my %snps;
my $strains = 0;
my @alignment;
my( $score, $blkLen, $mult );
my $total_snps;
my $syn_len;
my %lengths;

open my $fh, '<', $infile;

while( <$fh> ) {
    next if /^#/;
    chomp;

    if( /^a/ ) {
        ( $score, $blkLen, $mult ) = ( split )[1,2,3];
        $score =~ s/score\=(\d+)/$1/; # length of alignment block including
'-'
        $blkLen =~ s/label\=(\d+)/$1/; # alignment block number; numbers
ranked on alignment length
        $mult  =~ s/mult\=(\d+)/$1/;# number of strains aligned in block

        $strains = $mult if $mult > $strains; # total number of strains in
alignment
    }
    elsif( /^s/ ) { push @alignment, $_ }

    elsif( /^$/ || ! length $_ ) {
        my( @strNames, @starts, @strands, @dna_mtrx );
        # if sequence conserved in all strains
        if( $strains == @alignment ) {
            $syn_len += $score; # total aligned sequence in all strains
            for( @alignment ) {
                # name, align start, align length (w/o '-'), direction,
align sequence w/ '-'
                my( $name, $start, $len, $strand, $dna ) = ( split /\s+/ )[
1, 2, 3, 4, 6 ];
                #$name =~ s/.*\.(.*)/$1/; # remove duplicated strain name

                # strains are always in same order when all strains in
block.
                push @strNames, $name;
                push @starts, $start;
                push @strands, $strand;
                push @dna_mtrx, [ split '', $dna ];
                # total seqeunce in each strain w/o '-' that is conserved
in all strains
                $lengths{ $name } += $len;
            }

            my $ea = each_arrayref( @dna_mtrx );
            my %gaps;
            my $cnt;
            while( my( @bases ) = $ea->() ) {
                ++$cnt;
                my %temp;
                for( 0 .. $#bases ) { # store gaps if any
                    if( $bases[$_] eq '-' ) {
                        $gaps{$_}++; # key is number, corresponds to index
of other arrays
                    }
                }
                # skip gaps '-'
                unless( '-' ~~ @bases ) { $temp{ uc $_}++ for @bases } # if
snp then %temp will have > 1 key
                if( keys %temp > 1 ) { # if SNP exists, get base and
position for all strains in alignment
                    ++$total_snps;
                    my $pos;
                    for( 0 .. $#bases ) {
                        if( $strands[$_] eq '+' ) { $pos = $starts[$_] +
$cnt - ( $gaps{$_} // 0 ) } # genome positn
                        elsif( $strands[$_] eq '-' ) { $pos = $starts[$_] -
$cnt - ( $gaps{$_} // 0 ) }
                        # HoAoH
                        push @{ $snps{ $strNames[$_] } }, { $pos =>
$bases[$_] };
                    }
                }
            }
        }
        @alignment = ();
    }
}
close $fh;
#print Dumper( \%snps ); use Data::Dumper;
say "Sum length of synteny blocks conserved in all strains, including gaps:
$syn_len bp";
say "Length of conserved sequence for each strain, excluding gaps:";
for my $strain ( keys %lengths ) {
    say "$strain\t$lengths{ $strain } bp";
}

my $outfile = $infile;
$outfile =~ s/\.maf$/_snps.txt/;
open my $fh2, '>', $outfile;
say {$fh2} map{ $_ . "_base\t", $_ . "_pos\t" } keys %snps;
for my $snp ( 0 .. ( $total_snps - 1 ) ) {
    for my $strain ( keys %snps ){
        for my $href ( keys %{ $snps{ $strain }[ $snp ] } ) {
            print {$fh2} "$snps{ $strain }[ $snp ]->{ $href }\t$href\t";
            }
        }
    print {$fh2} "\n";
}


From sanketd at isquareit.ac.in  Mon Dec 31 01:46:41 2012
From: sanketd at isquareit.ac.in (Sanket Desai)
Date: Mon, 31 Dec 2012 12:16:41 +0530 (IST)
Subject: [Bioperl-l] Help in getting organism names of the nucleotide
	entries.
Message-ID: <26019826.10871.1356936401744.JavaMail.root@mail.isquareit.ac.in>

Hello,

With respect to the post:
http://bio.perl.org/pipermail/bioperl-l/2009-December/031831.html

When used for nucleotide database it gives the following error:

--------------------- WARNING ---------------------
MSG: The -email parameter is now required, per NCBI E-utilities policy
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: No linksets returned
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: The -email parameter is now required, per NCBI E-utilities policy
---------------------------------------------------

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: NCBI esummary fatal error: Empty id list - nothing todo
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:472
STACK: Bio::Tools::EUtilities::parse_data /usr/share/perl5/Bio/Tools/EUtilities.pm:382
STACK: Bio::Tools::EUtilities::next_DocSum /usr/share/perl5/Bio/Tools/EUtilities.pm:964
STACK: Bio::DB::EUtilities::next_DocSum /usr/share/perl5/Bio/DB/EUtilities.pm:914
STACK: getOrgNameFrmAccession.pl:29
-----------------------------------------------------------

Please suggest the relevant changes in the above script to make it work for the nucleotide entries also.

Thanks in advance,
Regards,
Sanket


From fcyucn at gmail.com  Mon Dec 17 20:37:45 2012
From: fcyucn at gmail.com (Fengchao Yu)
Date: Tue, 18 Dec 2012 01:37:45 -0000
Subject: [Bioperl-l] Is there any module for the protein digestion?
Message-ID: <7b719317-57a3-46ef-927c-6b0508e1e62d@googlegroups.com>

I notice that Bio::Restriction::Enzyme<http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/Restriction/Enzyme.pm> is 
for DNA digest? I wonder if there is any module for protein digestion?

Thanks


From florent.angly at gmail.com  Mon Dec  3 02:36:28 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Mon, 03 Dec 2012 12:36:28 +1000
Subject: [Bioperl-l] Bio::DB::Fasta and threads
Message-ID: <50BC102C.7080902@gmail.com>

Hi all,

This is in response to Carson Holt's report that Bio::DB::Fasta does not 
play well with threads: https://redmine.open-bio.org/issues/3397

The first issue is the serialization of Bio::DB::IndexedBase-inheriting 
(e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for 
threading (for example when using Thread::Queue::Any). I implemented 
hooks that make it transparent to serialize using Storable freeze() and 
thaw().

Another issue was the lack of communication between different 
Bio::DB::IndexedBase instances, which means that an instance could 
easily be writing or deleting the database that another instance is 
working on. To fix this, I needed some form of locking.

Some database Bio::DB::IndexedBase backends (DB_file) have some support 
for locking but Bio::DB::IndexedBase also supports other database 
backends for which there is no native locking mechanism. So, I had to 
come up with a more general solution: a lock file. I noticed that 
Bio::DB::SeqFeature::Store::berkeleydb has a locking mechanism, based on 
flock(), which means that it does not work with NFS-mounted filesystems. 
All the Bioperl-based scripts I (and most likely many others) write run 
on servers that use NFS, so this support is important. I have found only 
one way to do the NFS locking safely, using File::SharedNFSLock. It has 
a few downsides though:
     1/ it is an external dependency,
     2/ it does not work on FAT filesystems (should be mostly restricted 
to USB sticks nowadays) and the lock is never acquired, and
     3/ at the moment, it requires a patch to work in threaded context 
(https://rt.cpan.org/Public/Bug/Display.html?id=81597)

Note that while I have now added basic support for threads in 
Bio::DB::IndexedBase was added, I still get segfaults in specific cases, 
for example when returning a database or sequence object. This might be 
related to this issue: 
https://rt.perl.org/rt3/Ticket/Display.html?id=115972. Beyond this, the 
new code seems to work nicely. See the branch 
https://github.com/bioperl/bioperl-live/tree/storable_db if you want to 
test yourself. For example, one can now run multiple threads, each of 
them creating a Bio::DB::Fasta database from the same FASTA file: the 
first thread performs the indexing while the others wait nicely for the 
indexing to be finished to query the database.

Comments welcome. Regards,

Florent


From l.m.timmermans at students.uu.nl  Tue Dec  4 00:29:59 2012
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Tue, 4 Dec 2012 01:29:59 +0100
Subject: [Bioperl-l] Bio::DB::Fasta and threads
In-Reply-To: <50BC102C.7080902@gmail.com>
References: <50BC102C.7080902@gmail.com>
Message-ID: <CAC1jpXCFHTb1DCsuKxLy1==qrKmSgRwqxuj155iyf7yyR=-WDQ@mail.gmail.com>

On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly <florent.angly at gmail.com> wrote:
> The first issue is the serialization of Bio::DB::IndexedBase-inheriting
> (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for
> threading (for example when using Thread::Queue::Any). I implemented hooks
> that make it transparent to serialize using Storable freeze() and thaw().

I don't think serializing a magical thingie makes much sense. Storable
is commonly used for a lot more things than interthread communication
(e.g. network communication), this would often not work under such
circumstances.

Leon


From cjfields at illinois.edu  Tue Dec  4 03:23:50 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 4 Dec 2012 03:23:50 +0000
Subject: [Bioperl-l] Bio::DB::Fasta and threads
In-Reply-To: <CAC1jpXCFHTb1DCsuKxLy1==qrKmSgRwqxuj155iyf7yyR=-WDQ@mail.gmail.com>
References: <50BC102C.7080902@gmail.com>
	<CAC1jpXCFHTb1DCsuKxLy1==qrKmSgRwqxuj155iyf7yyR=-WDQ@mail.gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF45670@CHIMBX5.ad.uillinois.edu>

On Dec 3, 2012, at 6:29 PM, Leon Timmermans <l.m.timmermans at students.uu.nl> wrote:

> On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly <florent.angly at gmail.com> wrote:
>> The first issue is the serialization of Bio::DB::IndexedBase-inheriting
>> (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for
>> threading (for example when using Thread::Queue::Any). I implemented hooks
>> that make it transparent to serialize using Storable freeze() and thaw().
> 
> I don't think serializing a magical thingie makes much sense. Storable
> is commonly used for a lot more things than interthread communication
> (e.g. network communication), this would often not work under such
> circumstances.
> 
> Leon

Leon, any suggestions on alternatives?  I know this particular bit is a sore spot with MAKER at the moment, so any help would be greatly appreciated.

chris


From yongli at yeslab.com  Sat Dec  1 06:10:15 2012
From: yongli at yeslab.com (=?utf-8?B?eW9uZ2xpQHllc2xhYi5jb20=?=)
Date: Sat, 1 Dec 2012 14:10:15 +0800 (CST)
Subject: [Bioperl-l] =?utf-8?q?question_about_bioperl_program?=
Message-ID: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com>

Dear Sir or Madam,

 
 I just begin to learn bioperl and am using a bioperl program to extract a kind of bacterial genes sequences from genome genbank file. I download NC_003450.gbk, ppt, ffn files. NC_003450 is a kind of bacterial genome genbank file. My bioperl code as follows:

 
 use Bio::Seq;

  use Bio::SeqIO;

  
  $seqio_obj=Bio::SeqIO->new (-file=>'NC_003450.gbk',-format=>'genbank');

  # $seq_obj=$seqio_obj->next_seq;

  
  while($seq_obj=$seqio_obj->next_seq)

  {

    $display_name=$seq_obj->display_name;

    $desc=$seq_obj->desc;

    $seq=$seq_obj->seq;

  $acc = $seq_obj->accession_number;

  $seqio_obj = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'fasta' );

  $seqio_obj->write_seq($seq_obj);

  }

  
 After the program runned I just gain a genome complete sequence without a file including every gene sequence one by one. I want to get the genes sequences one by one in a fasta files.  So I write you for help.

 
 Yong Li


From carsonhh at gmail.com  Tue Dec  4 03:35:50 2012
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 03 Dec 2012 22:35:50 -0500
Subject: [Bioperl-l] Bio::DB::Fasta and threads
In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF4CF45670@CHIMBX5.ad.uillinois.edu>
Message-ID: <CCE2D73A.16852%carsonhh@gmail.com>

Bio::DB::Fasta is working for maker now.  The previous issues have been
fixed, but being as Florent has gone out of his way to build a number of
improvements into Bio::DB::Fasta over the past few weeks, this seemed like
a useful one as well, so I suggested it.  One of the big uses of
Bio::DB::Fasta is the Bio::PrimarySeq::Fasta features it creates.  They
are great for manipulating the sequence without actually having to ever
keep it in memory.  It's nice because the sequence is made available on
demand, but when you try and pass them between threads, your program falls
apart. There are creative work arounds, but simply adding a serialization
hook to Bio::DB::Fasta to disconnect the database on freezing and then
reconnect on thaw also fixes it, and it makes them extremely useful for
multi-threaded applications without having to go through other kinds of
work arounds (it just makes them work as expected with serialization).
Previously I had created my own module and inherited from Bio::DB::Fasta
so I could implement the Storable hooks.  Because Storable looks for the
hooks in anything it serializes, the Bio::DB::Fasta object can even be
well down inside of a complex object and you don't have worry about it.
Previously I've used Storable hooks to pass the Bio::PrimarySeq::Fasta
features across the network using MPI, as long as the database is on an
NFS mount it just reconnects on the other node with no issue.  If the
indexed file isn't available after deserialization over a network, you
could just throw an error when the thaw hook is called.  I'll give
Florent's changes a look over soon to give any suggestions.

Thanks,
Carson


On 12-12-03 10:23 PM, "Fields, Christopher J" <cjfields at illinois.edu>
wrote:

>On Dec 3, 2012, at 6:29 PM, Leon Timmermans
><l.m.timmermans at students.uu.nl> wrote:
>
>> On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly <florent.angly at gmail.com>
>>wrote:
>>> The first issue is the serialization of Bio::DB::IndexedBase-inheriting
>>> (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for
>>> threading (for example when using Thread::Queue::Any). I implemented
>>>hooks
>>> that make it transparent to serialize using Storable freeze() and
>>>thaw().
>> 
>> I don't think serializing a magical thingie makes much sense. Storable
>> is commonly used for a lot more things than interthread communication
>> (e.g. network communication), this would often not work under such
>> circumstances.
>> 
>> Leon
>
>Leon, any suggestions on alternatives?  I know this particular bit is a
>sore spot with MAKER at the moment, so any help would be greatly
>appreciated.
>
>chris
>


From jason.r.gallant at gmail.com  Tue Dec  4 20:23:02 2012
From: jason.r.gallant at gmail.com (Jason Gallant)
Date: Tue, 4 Dec 2012 12:23:02 -0800 (PST)
Subject: [Bioperl-l] Problem with BIO::DB::FASTA and Colon in Fasta Header
Message-ID: <c996f6bc-458a-461a-bcc5-3a567eb06e85@googlegroups.com>

Hello,

I'm trying to retreive fasta sequences that contain a colon in their 
header.  However, I cannot get my BioPerl script to do this!!

It works as expected when the header does not contain the colon, however 
doesn't return anything when it does.  Weirdly, when I ask it to return the 
parsed IDs (see below), it returns the appropriate IDs, which include the 
colon!  Very confusing, would appreciate any help!!

Many Thanks,
Jason Gallant


use strict;
use Bio::SearchIO; 
use Bio::DB::Fasta;


my ($file,$id,$start,$end) = 
("secondround_merged_expanded.fasta","C7047455:0-100",1,10);


my $db = Bio::DB::Fasta->new($file, -reindex=>1);
my $seq = $db->seq($id,$start,$end);
 
print $db->ids;

print $seq,"\n";


From asjo at koldfront.dk  Tue Dec  4 20:53:08 2012
From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=)
Date: Tue, 04 Dec 2012 21:53:08 +0100
Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access
In-Reply-To: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com>
	(Francesco Musacchia's message of "Wed, 28 Nov 2012 02:27:16 -0800
	(PST)")
References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com>
Message-ID: <87y5hdletn.fsf@topper.koldfront.dk>

On Wed, 28 Nov 2012 02:27:16 -0800 (PST), Francesco wrote:

> I'm experiencing that when I have to do a lot of accessess on a GFF
> database (with Bio:DB::SeqFeature::Store) the slowness increase until
> my script can stay running for more than a day.

First you'll need to find out what/where exactly it is slow. One way to
do so is using a a profiler; this is a good one for Perl:

 * https://metacpan.org/module/Devel::NYTProf

If you want more specific suggestions, you'll probably have to provide
more information.


  Good luck!

    Adam

-- 
 "As Knuth pointed out long ago, speed only matters           Adam Sj?gren
  in certain critical bottlenecks. And as many           asjo at koldfront.dk
  programmers have observed since, one is very often
  mistaken about where these bottlenecks are."


From cjfields at illinois.edu  Tue Dec  4 21:10:00 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 4 Dec 2012 21:10:00 +0000
Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access
In-Reply-To: <87y5hdletn.fsf@topper.koldfront.dk>
References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com>
	<87y5hdletn.fsf@topper.koldfront.dk>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu>


On Dec 4, 2012, at 2:53 PM, Adam Sj?gren <asjo at koldfront.dk>
 wrote:

> On Wed, 28 Nov 2012 02:27:16 -0800 (PST), Francesco wrote:
> 
>> I'm experiencing that when I have to do a lot of accessess on a GFF
>> database (with Bio:DB::SeqFeature::Store) the slowness increase until
>> my script can stay running for more than a day.
> 
> First you'll need to find out what/where exactly it is slow. One way to
> do so is using a a profiler; this is a good one for Perl:
> 
> * https://metacpan.org/module/Devel::NYTProf
> 
> If you want more specific suggestions, you'll probably have to provide
> more information.
> 
> 
>  Good luck!
> 
>    Adam

If anything, we need more profiling of Bioperl code.  Ah, if we only had infinite time... :)

chris


From asjo at koldfront.dk  Tue Dec  4 21:33:55 2012
From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=)
Date: Tue, 04 Dec 2012 22:33:55 +0100
Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access
In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu>
	(Christopher J. Fields's message of "Tue, 4 Dec 2012 21:10:00 +0000")
References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com>
	<87y5hdletn.fsf@topper.koldfront.dk>
	<118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu>
Message-ID: <87txs1jyd8.fsf@topper.koldfront.dk>

On Tue, 4 Dec 2012 21:10:00 +0000, Fields, wrote:

> If anything, we need more profiling of Bioperl code. Ah, if we only
> had infinite time... :)

If we had that, we didn't need profiling!


  ;-),

   Adam

-- 
 "On the quiet side. Somewhat peculiar. A good                Adam Sj?gren
  companion, in a weird sort of way."                    asjo at koldfront.dk


From florent.angly at gmail.com  Tue Dec  4 21:52:41 2012
From: florent.angly at gmail.com (Florent Angly)
Date: Wed, 05 Dec 2012 07:52:41 +1000
Subject: [Bioperl-l] Problem with BIO::DB::FASTA and Colon in Fasta
	Header
In-Reply-To: <c996f6bc-458a-461a-bcc5-3a567eb06e85@googlegroups.com>
References: <c996f6bc-458a-461a-bcc5-3a567eb06e85@googlegroups.com>
Message-ID: <50BE70A9.4060404@gmail.com>

Hi Jason,

See the documentation for seq() at 
http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/DB/Fasta.pm#OBJECT_METHODS 
<http://search.cpan.org/%7Ecjfields/BioPerl-1.6.901/Bio/DB/Fasta.pm#OBJECT_METHODS>.

When you call seq() with a single argument, e.g. 
$db->seq('C7047455:0-100'), Bio::DB::Fasta interprets it as a compound 
ID and looks for position 0 to 100 of a sequence called C7047455. This 
is a feature that has been in Bio::DB::Fasta since the dawn of time. In 
this form, seq() expects a colon as part of the compound ID, which is 
problematic because your sequence ID actually contains a colon.

I think that when you call $db->seq($id,$start,$end), Bio::DB::Fasta 
does not attempt to parse your ID. This is why your code works with this 
form. Note that if you want to get the entirety of a sequence called 
'C7047455:0-100', the easiest if your sequence names contain colon is to 
use $db->get_Seq_by_id('C7047455:0-100') since get_Seq_by_id() does only 
take a regular ID (not compound).

Florent


On 05/12/12 06:23, Jason Gallant wrote:
> Hello,
>
> I'm trying to retreive fasta sequences that contain a colon in their
> header.  However, I cannot get my BioPerl script to do this!!
>
> It works as expected when the header does not contain the colon, however
> doesn't return anything when it does.  Weirdly, when I ask it to return the
> parsed IDs (see below), it returns the appropriate IDs, which include the
> colon!  Very confusing, would appreciate any help!!
>
> Many Thanks,
> Jason Gallant
>
>
> use strict;
> use Bio::SearchIO;
> use Bio::DB::Fasta;
>
>
> my ($file,$id,$start,$end) =
> ("secondround_merged_expanded.fasta","C7047455:0-100",1,10);
>
>
> my $db = Bio::DB::Fasta->new($file, -reindex=>1);
> my $seq = $db->seq($id,$start,$end);
>   
> print $db->ids;
>
> print $seq,"\n";
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From bosborne11 at verizon.net  Tue Dec  4 22:12:59 2012
From: bosborne11 at verizon.net (Brian Osborne)
Date: Tue, 04 Dec 2012 17:12:59 -0500
Subject: [Bioperl-l] question about bioperl program
In-Reply-To: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com>
References: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com>
Message-ID: <16BBC477-9935-4C79-A70D-6B18716089FB@verizon.net>

Yong Li,

You want to take a look at this HOWTO:

http://www.bioperl.org/wiki/HOWTO:Feature-Annotation

Those genes you see in the file are features in the genome sequence.

Brian O.


On Dec 1, 2012, at 1:10 AM, yongli at yeslab.com wrote:

> Dear Sir or Madam,
> 
> 
> 
> I just begin to learn bioperl and am using a bioperl program to extract a kind of bacterial genes sequences from genome genbank file. I download NC_003450.gbk, ppt, ffn files. NC_003450 is a kind of bacterial genome genbank file. My bioperl code as follows:
> 
> 
> 
> use Bio::Seq;
> 
>  use Bio::SeqIO;
> 
> 
> 
>  $seqio_obj=Bio::SeqIO->new (-file=>'NC_003450.gbk',-format=>'genbank');
> 
>  # $seq_obj=$seqio_obj->next_seq;
> 
> 
> 
>  while($seq_obj=$seqio_obj->next_seq)
> 
>  {
> 
>    $display_name=$seq_obj->display_name;
> 
>    $desc=$seq_obj->desc;
> 
>    $seq=$seq_obj->seq;
> 
>  $acc = $seq_obj->accession_number;
> 
>  $seqio_obj = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'fasta' );
> 
>  $seqio_obj->write_seq($seq_obj);
> 
>  }
> 
> 
> 
> After the program runned I just gain a genome complete sequence without a file including every gene sequence one by one. I want to get the genes sequences one by one in a fasta files.  So I write you for help.
> 
> 
> 
> Yong Li
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From ankh.egypt.public at googlemail.com  Fri Dec  7 20:24:20 2012
From: ankh.egypt.public at googlemail.com (Adrian Helmchen)
Date: Fri, 07 Dec 2012 21:24:20 +0100
Subject: [Bioperl-l] proteins from an organism
Message-ID: <50C25074.8050703@googlemail.com>

Hello,

I would like to get all proteins from an organism but proteins from
cholorplasts or with chrystal structures or something else.

I tried to obtain these proteins by send a query 'Arabidopsis 
thaliana[organism]'
with Bio::DB::GenBank and fetch the gi numbers from the cds.
But on the one pc I get 6000 proteins and on another pc I get 46000 proteins
although Arabidopsis thaliana has 25000 genes.

Thank you for your help.


From nikkie.vanbers at gmail.com  Mon Dec 10 08:07:27 2012
From: nikkie.vanbers at gmail.com (Nikki2)
Date: Mon, 10 Dec 2012 00:07:27 -0800 (PST)
Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI assembly
 database
Message-ID: <34761946.post@talk.nabble.com>


Hi,

I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from
'Tracheophyta' that are NCBI's assembly database. However, there are no
DocSums returned for the uid's that match the query. When I try the same
thing using the genome database it works fine.

The script that I used to do the query is at the bottom of this message. The
output I get when running the script is:

Count = 84

--------------------- WARNING ---------------------
MSG: No returned docsums.
---------------------------------------------------

I checked the @ids array and it contains the 84 uids.

My questions are as follows:

1) Is it possible to get DocSums for uids from the NCBI assembly database,
and if yes, how?
2) If not, does anyone have any suggestions how to change my script to get
the species-names that match the uids that are returned?

Thanks a lot!

Nikki


##############################################

#!/bin/perl -w

use Bio::DB::EUtilities;

my $factory = Bio::DB::EUtilities->new(-eutil  => 'esearch',
                                       -db     => 'genome',
				       -email => 'my_email at gmail.com',
                                       -term   => 'Tracheophyta[organism]',
                                       -retmax => 5000);

print "Count = ",$factory->get_count,"\n";
my @ids = $factory->get_ids;

my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary',
					-email=>'my_email at gmail.com',
					-db    => 'genome',
                                        -id    => \@ids,
					ret_max=>5000);
 
while (my $ds = $factory2->next_DocSum) {
    print "ID: ",$ds->get_id,"\n";
    # flattened mode, iterates through all Item objects
    while (my $item = $ds->next_Item('flattened'))  {
        # not all Items have content, so need to check...
        printf("%-20s:%s\n",$item->get_name,$item->get_content) if
$item->get_content;
   }
    print "\n";
}


-- 
View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From cjfields at illinois.edu  Mon Dec 10 15:59:03 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Mon, 10 Dec 2012 15:59:03 +0000
Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI
 assembly database
In-Reply-To: <34761946.post@talk.nabble.com>
References: <34761946.post@talk.nabble.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF4C164@CHIMBX5.ad.uillinois.edu>

Nikki, 

This is b/c a handful of the databases apparently have switched docsum output completely to the DB-specific DocSum schemata (v2), which have not been implemented in Bio::EUtilities as of yet.  This requires quite a bit of revision to parse correctly as it's per database, so I don't have a timeline on when this would be available and would likely be incrementally implemented over time.  

See here for the announcement:

    http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.Release_Notes

In the meantime, you can get the raw XML output for these by replacing the loop for $factory2 with:

    print $factory2->get_Response->content

chris


On Dec 10, 2012, at 2:07 AM, Nikki2 <nikkie.vanbers at gmail.com> wrote:

> Hi,
> 
> I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from
> 'Tracheophyta' that are NCBI's assembly database. However, there are no
> DocSums returned for the uid's that match the query. When I try the same
> thing using the genome database it works fine.
> 
> The script that I used to do the query is at the bottom of this message. The
> output I get when running the script is:
> 
> Count = 84
> 
> --------------------- WARNING ---------------------
> MSG: No returned docsums.
> ---------------------------------------------------
> 
> I checked the @ids array and it contains the 84 uids.
> 
> My questions are as follows:
> 
> 1) Is it possible to get DocSums for uids from the NCBI assembly database,
> and if yes, how?
> 2) If not, does anyone have any suggestions how to change my script to get
> the species-names that match the uids that are returned?
> 
> Thanks a lot!
> 
> Nikki
> 
> 
> 
> 
> 
> 
> 
> ##############################################
> 
> #!/bin/perl -w
> 
> use Bio::DB::EUtilities;
> 
> my $factory = Bio::DB::EUtilities->new(-eutil  => 'esearch',
>                                       -db     => 'genome',
> 				       -email => 'my_email at gmail.com',
>                                       -term   => 'Tracheophyta[organism]',
>                                       -retmax => 5000);
> 
> print "Count = ",$factory->get_count,"\n";
> my @ids = $factory->get_ids;
> 
> my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary',
> 					-email=>'my_email at gmail.com',
> 					-db    => 'genome',
>                                        -id    => \@ids,
> 					ret_max=>5000);
> 
> while (my $ds = $factory2->next_DocSum) {
>    print "ID: ",$ds->get_id,"\n";
>    # flattened mode, iterates through all Item objects
>    while (my $item = $ds->next_Item('flattened'))  {
>        # not all Items have content, so need to check...
>        printf("%-20s:%s\n",$item->get_name,$item->get_content) if
> $item->get_content;
>   }
>    print "\n";
> }
> 
> 
> -- 
> View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From jason.stajich at gmail.com  Thu Dec 13 04:05:29 2012
From: jason.stajich at gmail.com (Jason Stajich)
Date: Wed, 12 Dec 2012 20:05:29 -0800
Subject: [Bioperl-l] Asking
In-Reply-To: <201212131130153627348@gmail.com>
References: <201212131130153627348@gmail.com>
Message-ID: <7ED416EC-622E-4023-94B7-9A11D29929DC@gmail.com>

You want the reroot function. Have you tried reading the howtos on the website already. 
Node is a node in the tree. There are several functions to find a node or iterate through all the ones in the tree

Sent from my iPhone-please excuse typos

--
Jason Stajich

On Dec 12, 2012, at 7:30 PM, "Xing-Xing Shen" <shenxingxing2010 at gmail.com> wrote:

> Drear Jason 
> I am a green hand in learning Bioperl. Now, I met a problem about how to define outgroup for a set of newick trees.
> 
> My codes below:
> #!/usr/bin/perl
> use Bio::TreeIO;
> use Bio::Tree::NodeI;
> use Bio::Tree::Tree;
> my @filenames = glob("*.txt");
> foreach my $filename (@filenames) {
>    my $treeio = Bio::TreeIO->new('-format' => 'newick', '-file'   => "$filename");
>    while( my $tree = $treeio->next_tree ) {
>       $tree->set_root_node("$node"); # what might $node mean?
>        ..........
>        ..........
>    }
> }
> 
> 
> With best,
> 
> Xing-Xing Shen


From j.abbott at imperial.ac.uk  Thu Dec 13 19:49:15 2012
From: j.abbott at imperial.ac.uk (James Abbott)
Date: Thu, 13 Dec 2012 19:49:15 +0000
Subject: [Bioperl-l] deobfuscator broken....
Message-ID: <50CA313B.9060904@imperial.ac.uk>

Hi All,

Don't know if anyone admin folk are aware, but the bioperl.org 
deobfuscator is generating internal server errors. I've also been having 
problems with broken documentation links (cpan links producning the 
wrong modules, and pdoc pages missing) but can't seem to replicate that 
problem now....

I am, for now, still obfuscated...

Cheers,
James
-- 
Dr. James Abbott
Lead Bioinformatician
Bioinformatics Support Service
Imperial College, London


From p.j.a.cock at googlemail.com  Thu Dec 13 22:52:44 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 13 Dec 2012 22:52:44 +0000
Subject: [Bioperl-l] deobfuscator broken....
In-Reply-To: <50CA313B.9060904@imperial.ac.uk>
References: <50CA313B.9060904@imperial.ac.uk>
Message-ID: <CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>

On Thu, Dec 13, 2012 at 7:49 PM, James Abbott <j.abbott at imperial.ac.uk> wrote:
> Hi All,
>
> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator
> is generating internal server errors. I've also been having problems with
> broken documentation links (cpan links producning the wrong modules, and
> pdoc pages missing) but can't seem to replicate that problem now....
>
> I am, for now, still obfuscated...
>
> Cheers,
> James

I would guess this is a side effect from the recent server move,
CC'ing root-l in case anyone of the sys-admin team had an idea.

Peter


From cjfields at illinois.edu  Thu Dec 13 22:51:50 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Thu, 13 Dec 2012 22:51:50 +0000
Subject: [Bioperl-l] deobfuscator broken....
In-Reply-To: <50CA313B.9060904@imperial.ac.uk>
References: <50CA313B.9060904@imperial.ac.uk>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF545DA@CHIMBX5.ad.uillinois.edu>

This is likely due to the back-end change in servers.  I'm not sure how this was set up but we can inquire about it.

chris

On Dec 13, 2012, at 1:49 PM, James Abbott <j.abbott at imperial.ac.uk> wrote:

> Hi All,
> 
> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator is generating internal server errors. I've also been having problems with broken documentation links (cpan links producning the wrong modules, and pdoc pages missing) but can't seem to replicate that problem now....
> 
> I am, for now, still obfuscated...
> 
> Cheers,
> James
> -- 
> Dr. James Abbott
> Lead Bioinformatician
> Bioinformatics Support Service
> Imperial College, London
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From cjfields at illinois.edu  Thu Dec 13 23:13:55 2012
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Thu, 13 Dec 2012 23:13:55 +0000
Subject: [Bioperl-l] [Root-l]  deobfuscator broken....
In-Reply-To: <CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>
References: <50CA313B.9060904@imperial.ac.uk>
	<CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF546F2@CHIMBX5.ad.uillinois.edu>

On Dec 13, 2012, at 4:52 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Thu, Dec 13, 2012 at 7:49 PM, James Abbott <j.abbott at imperial.ac.uk> wrote:
>> Hi All,
>> 
>> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator
>> is generating internal server errors. I've also been having problems with
>> broken documentation links (cpan links producning the wrong modules, and
>> pdoc pages missing) but can't seem to replicate that problem now....
>> 
>> I am, for now, still obfuscated...
>> 
>> Cheers,
>> James
> 
> I would guess this is a side effect from the recent server move,
> CC'ing root-l in case anyone of the sys-admin team had an idea.
> 
> Peter

Beat me by four minutes!  

The CGI code is in websites/bioperl.org/cgi/.  I'm checking on the errors now, may take me a little time to get it back up (was missing CGI, now needs to have the lib path extended).

chris


From jason.stajich at gmail.com  Thu Dec 13 23:18:26 2012
From: jason.stajich at gmail.com (Jason Stajich)
Date: Thu, 13 Dec 2012 15:18:26 -0800
Subject: [Bioperl-l] [Root-l]  deobfuscator broken....
In-Reply-To: <CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>
References: <50CA313B.9060904@imperial.ac.uk>
	<CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>
Message-ID: <0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com>

I think it uses mysql but I don't know if that was reconstituted on the new server. 

On Dec 13, 2012, at 2:52 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Thu, Dec 13, 2012 at 7:49 PM, James Abbott <j.abbott at imperial.ac.uk> wrote:
>> Hi All,
>> 
>> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator
>> is generating internal server errors. I've also been having problems with
>> broken documentation links (cpan links producning the wrong modules, and
>> pdoc pages missing) but can't seem to replicate that problem now....
>> 
>> I am, for now, still obfuscated...
>> 
>> Cheers,
>> James
> 
> I would guess this is a side effect from the recent server move,
> CC'ing root-l in case anyone of the sys-admin team had an idea.
> 
> Peter
> _______________________________________________
> Root-l mailing list
> Root-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/root-l

Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org


From nikkie.vanbers at gmail.com  Wed Dec  5 14:04:09 2012
From: nikkie.vanbers at gmail.com (Nikki2)
Date: Wed, 5 Dec 2012 06:04:09 -0800 (PST)
Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI assembly
 database
Message-ID: <34761946.post@talk.nabble.com>


Hi,

I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from
'Tracheophyta' that are NCBI's assembly database. However, there are no
DocSums returned for the uid's that match the query. When I try the same
thing using the genome database it works fine.

The script that I used to do the query is at the bottom of this message. The
output I get when running the script is:

Count = 84

--------------------- WARNING ---------------------
MSG: No returned docsums.
---------------------------------------------------

I checked the @ids array and it contains the 84 uids.

My questions are as follows:

1) Is it possible to get DocSums for uids from the NCBI assembly database,
and if yes, how?
2) If not, does anyone have any suggestions how to change my script to get
the species-names that match the uids that are returned?

Thanks a lot!

Nikki


##############################################

#!/bin/perl -w

use Bio::DB::EUtilities;

my $factory = Bio::DB::EUtilities->new(-eutil  => 'esearch',
                                       -db     => 'genome',
				       -email => 'my_email at gmail.com',
                                       -term   => 'Tracheophyta[organism]',
                                       -retmax => 5000);

print "Count = ",$factory->get_count,"\n";
my @ids = $factory->get_ids;

my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary',
					-email=>'my_email at gmail.com',
					-db    => 'genome',
                                        -id    => \@ids,
					ret_max=>5000);
 
while (my $ds = $factory2->next_DocSum) {
    print "ID: ",$ds->get_id,"\n";
    # flattened mode, iterates through all Item objects
    while (my $item = $ds->next_Item('flattened'))  {
        # not all Items have content, so need to check...
        printf("%-20s:%s\n",$item->get_name,$item->get_content) if
$item->get_content;
   }
    print "\n";
}


-- 
View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From online at davemessina.com  Thu Dec 13 23:41:35 2012
From: online at davemessina.com (Dave Messina)
Date: Thu, 13 Dec 2012 18:41:35 -0500
Subject: [Bioperl-l] [Root-l]  deobfuscator broken....
In-Reply-To: <0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com>
References: <50CA313B.9060904@imperial.ac.uk>
	<CAKVJ-_615d=97ngOK3_XadPsy_SrHS+jMyCj00ji3OTVB088=g@mail.gmail.com>
	<0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com>
Message-ID: <A6ECFFCE-8274-4B60-B681-64AA04285059@davemessina.com>

It should be just (shudder) Berkeley DB.


On Dec 13, 2012, at 18:18, Jason Stajich <jason.stajich at gmail.com> wrote:

> I think it uses mysql but I don't know if that was reconstituted on the new server. 
> 
> On Dec 13, 2012, at 2:52 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> 
>> On Thu, Dec 13, 2012 at 7:49 PM, James Abbott <j.abbott at imperial.ac.uk> wrote:
>>> Hi All,
>>> 
>>> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator
>>> is generating internal server errors. I've also been having problems with
>>> broken documentation links (cpan links producning the wrong modules, and
>>> pdoc pages missing) but can't seem to replicate that problem now....
>>> 
>>> I am, for now, still obfuscated...
>>> 
>>> Cheers,
>>> James
>> 
>> I would guess this is a side effect from the recent server move,
>> CC'ing root-l in case anyone of the sys-admin team had an idea.
>> 
>> Peter
>> _______________________________________________
>> Root-l mailing list
>> Root-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/root-l
> 
> Jason Stajich
> jason.stajich at gmail.com
> jason at bioperl.org
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From abualiga2 at gmail.com  Tue Dec 18 22:08:51 2012
From: abualiga2 at gmail.com (galeb abu-ali)
Date: Tue, 18 Dec 2012 17:08:51 -0500
Subject: [Bioperl-l] Fwd: how to parse maf file format
In-Reply-To: <CANPzzTtgJJwmN2TyBhDb0h=nEofSh44RED-1v9VEvRG+ZazUvA@mail.gmail.com>
References: <CANPzzTtgJJwmN2TyBhDb0h=nEofSh44RED-1v9VEvRG+ZazUvA@mail.gmail.com>
Message-ID: <CANPzzTueE6TN2vofSkwAQN-RBSTZzm7UuJLVT5kEf6O+7CV2CQ@mail.gmail.com>

Hi,

I am writing a script to parse a multiple genome alignment file in maf
format, generated with mugsy alignment of e.coli genomes.  So far, my
script parses SNPs from synteny blocks conserved in all aligned strains,
and it excludes gaps, which is enough for a phylogenetic analyses.  I was
wondering how can I parse the remaining blocks that are not conserved in
all strains, to see what is conserved in n-1, n-2, etc. strains or unique
to each strain.  I guess this is not a BioPerl question, but it's a Perl
for biologists question so I was hoping to get some insight here.  If there
is a more appropriate forum, please let me know.

Below is my code.

many thanks!

galeb

#!/usr/local/bin/perl
use Modern::Perl '2013';
use autodie;
use List::MoreUtils qw/ each_arrayref /;

# gsa 18.12.2012
# parse mugsy multiple genome alignment for SNPs in synteny blocks
conserved in all aligned strains
=head
##maf version=1 scoring=mugsy
a score=7891 label=40 mult=4
s O55H7_RM12579.O55H7_RM12579        1596752 7262 + 5263980
CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCG
s O55H7_CB9615.O55H7_CB9615        1604426 7262 + 5386352
CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCGG-AT
s O157H7_Sakai.O157H7_Sakai        1787303 7068 + 5498450
CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCGG-AT
s O157H7_EDL.O157H7_EDL933        1729749 7082 + 5528445
CGGGATGCGGGAATGGGAATGCCTTGGTTGACGGGGTGGCGGAAT

a score=6756 label=41 mult=4
s O55H7_RM12579.O55H7_RM12579        1986265 6749 + 5263980
CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGG
s O55H7_CB9615.O55H7_CB9615        1991733 6749 + 5386352
CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGGTTAC
s O157H7_Sakai.O157H7_Sakai        3940728 6751 - 5498450
CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGGTTAC
s O157H7_EDL.O157H7_EDL933        4260689 4042 - 5528445
---------------------------------------------
=cut

my $infile = shift or die "Usage: $0 <alignment_file.maf>\n";

my %snps;
my $strains = 0;
my @alignment;
my( $score, $blkLen, $mult );
my $total_snps;
my $syn_len;
my %lengths;

open my $fh, '<', $infile;

while( <$fh> ) {
    next if /^#/;
    chomp;

    if( /^a/ ) {
        ( $score, $blkLen, $mult ) = ( split )[1,2,3];
        $score =~ s/score\=(\d+)/$1/; # length of alignment block including
'-'
        $blkLen =~ s/label\=(\d+)/$1/; # alignment block number; numbers
ranked on alignment length
        $mult  =~ s/mult\=(\d+)/$1/;# number of strains aligned in block

        $strains = $mult if $mult > $strains; # total number of strains in
alignment
    }
    elsif( /^s/ ) { push @alignment, $_ }

    elsif( /^$/ || ! length $_ ) {
        my( @strNames, @starts, @strands, @dna_mtrx );
        # if sequence conserved in all strains
        if( $strains == @alignment ) {
            $syn_len += $score; # total aligned sequence in all strains
            for( @alignment ) {
                # name, align start, align length (w/o '-'), direction,
align sequence w/ '-'
                my( $name, $start, $len, $strand, $dna ) = ( split /\s+/ )[
1, 2, 3, 4, 6 ];
                #$name =~ s/.*\.(.*)/$1/; # remove duplicated strain name

                # strains are always in same order when all strains in
block.
                push @strNames, $name;
                push @starts, $start;
                push @strands, $strand;
                push @dna_mtrx, [ split '', $dna ];
                # total seqeunce in each strain w/o '-' that is conserved
in all strains
                $lengths{ $name } += $len;
            }

            my $ea = each_arrayref( @dna_mtrx );
            my %gaps;
            my $cnt;
            while( my( @bases ) = $ea->() ) {
                ++$cnt;
                my %temp;
                for( 0 .. $#bases ) { # store gaps if any
                    if( $bases[$_] eq '-' ) {
                        $gaps{$_}++; # key is number, corresponds to index
of other arrays
                    }
                }
                # skip gaps '-'
                unless( '-' ~~ @bases ) { $temp{ uc $_}++ for @bases } # if
snp then %temp will have > 1 key
                if( keys %temp > 1 ) { # if SNP exists, get base and
position for all strains in alignment
                    ++$total_snps;
                    my $pos;
                    for( 0 .. $#bases ) {
                        if( $strands[$_] eq '+' ) { $pos = $starts[$_] +
$cnt - ( $gaps{$_} // 0 ) } # genome positn
                        elsif( $strands[$_] eq '-' ) { $pos = $starts[$_] -
$cnt - ( $gaps{$_} // 0 ) }
                        # HoAoH
                        push @{ $snps{ $strNames[$_] } }, { $pos =>
$bases[$_] };
                    }
                }
            }
        }
        @alignment = ();
    }
}
close $fh;
#print Dumper( \%snps ); use Data::Dumper;
say "Sum length of synteny blocks conserved in all strains, including gaps:
$syn_len bp";
say "Length of conserved sequence for each strain, excluding gaps:";
for my $strain ( keys %lengths ) {
    say "$strain\t$lengths{ $strain } bp";
}

my $outfile = $infile;
$outfile =~ s/\.maf$/_snps.txt/;
open my $fh2, '>', $outfile;
say {$fh2} map{ $_ . "_base\t", $_ . "_pos\t" } keys %snps;
for my $snp ( 0 .. ( $total_snps - 1 ) ) {
    for my $strain ( keys %snps ){
        for my $href ( keys %{ $snps{ $strain }[ $snp ] } ) {
            print {$fh2} "$snps{ $strain }[ $snp ]->{ $href }\t$href\t";
            }
        }
    print {$fh2} "\n";
}


From sanketd at isquareit.ac.in  Mon Dec 31 06:46:41 2012
From: sanketd at isquareit.ac.in (Sanket Desai)
Date: Mon, 31 Dec 2012 12:16:41 +0530 (IST)
Subject: [Bioperl-l] Help in getting organism names of the nucleotide
	entries.
Message-ID: <26019826.10871.1356936401744.JavaMail.root@mail.isquareit.ac.in>

Hello,

With respect to the post:
http://bio.perl.org/pipermail/bioperl-l/2009-December/031831.html

When used for nucleotide database it gives the following error:

--------------------- WARNING ---------------------
MSG: The -email parameter is now required, per NCBI E-utilities policy
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: No linksets returned
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: The -email parameter is now required, per NCBI E-utilities policy
---------------------------------------------------

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: NCBI esummary fatal error: Empty id list - nothing todo
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:472
STACK: Bio::Tools::EUtilities::parse_data /usr/share/perl5/Bio/Tools/EUtilities.pm:382
STACK: Bio::Tools::EUtilities::next_DocSum /usr/share/perl5/Bio/Tools/EUtilities.pm:964
STACK: Bio::DB::EUtilities::next_DocSum /usr/share/perl5/Bio/DB/EUtilities.pm:914
STACK: getOrgNameFrmAccession.pl:29
-----------------------------------------------------------

Please suggest the relevant changes in the above script to make it work for the nucleotide entries also.

Thanks in advance,
Regards,
Sanket


From fcyucn at gmail.com  Tue Dec 18 01:37:45 2012
From: fcyucn at gmail.com (Fengchao Yu)
Date: Tue, 18 Dec 2012 01:37:45 -0000
Subject: [Bioperl-l] Is there any module for the protein digestion?
Message-ID: <7b719317-57a3-46ef-927c-6b0508e1e62d@googlegroups.com>

I notice that Bio::Restriction::Enzyme<http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/Restriction/Enzyme.pm> is 
for DNA digest? I wonder if there is any module for protein digestion?

Thanks