From florent.angly at gmail.com Sun Dec 2 21:36:28 2012 From: florent.angly at gmail.com (Florent Angly) Date: Mon, 03 Dec 2012 12:36:28 +1000 Subject: [Bioperl-l] Bio::DB::Fasta and threads Message-ID: <50BC102C.7080902@gmail.com> Hi all, This is in response to Carson Holt's report that Bio::DB::Fasta does not play well with threads: https://redmine.open-bio.org/issues/3397 The first issue is the serialization of Bio::DB::IndexedBase-inheriting (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for threading (for example when using Thread::Queue::Any). I implemented hooks that make it transparent to serialize using Storable freeze() and thaw(). Another issue was the lack of communication between different Bio::DB::IndexedBase instances, which means that an instance could easily be writing or deleting the database that another instance is working on. To fix this, I needed some form of locking. Some database Bio::DB::IndexedBase backends (DB_file) have some support for locking but Bio::DB::IndexedBase also supports other database backends for which there is no native locking mechanism. So, I had to come up with a more general solution: a lock file. I noticed that Bio::DB::SeqFeature::Store::berkeleydb has a locking mechanism, based on flock(), which means that it does not work with NFS-mounted filesystems. All the Bioperl-based scripts I (and most likely many others) write run on servers that use NFS, so this support is important. I have found only one way to do the NFS locking safely, using File::SharedNFSLock. It has a few downsides though: 1/ it is an external dependency, 2/ it does not work on FAT filesystems (should be mostly restricted to USB sticks nowadays) and the lock is never acquired, and 3/ at the moment, it requires a patch to work in threaded context (https://rt.cpan.org/Public/Bug/Display.html?id=81597) Note that while I have now added basic support for threads in Bio::DB::IndexedBase was added, I still get segfaults in specific cases, for example when returning a database or sequence object. This might be related to this issue: https://rt.perl.org/rt3/Ticket/Display.html?id=115972. Beyond this, the new code seems to work nicely. See the branch https://github.com/bioperl/bioperl-live/tree/storable_db if you want to test yourself. For example, one can now run multiple threads, each of them creating a Bio::DB::Fasta database from the same FASTA file: the first thread performs the indexing while the others wait nicely for the indexing to be finished to query the database. Comments welcome. Regards, Florent From l.m.timmermans at students.uu.nl Mon Dec 3 19:29:59 2012 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Tue, 4 Dec 2012 01:29:59 +0100 Subject: [Bioperl-l] Bio::DB::Fasta and threads In-Reply-To: <50BC102C.7080902@gmail.com> References: <50BC102C.7080902@gmail.com> Message-ID: On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly wrote: > The first issue is the serialization of Bio::DB::IndexedBase-inheriting > (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for > threading (for example when using Thread::Queue::Any). I implemented hooks > that make it transparent to serialize using Storable freeze() and thaw(). I don't think serializing a magical thingie makes much sense. Storable is commonly used for a lot more things than interthread communication (e.g. network communication), this would often not work under such circumstances. Leon From cjfields at illinois.edu Mon Dec 3 22:23:50 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 4 Dec 2012 03:23:50 +0000 Subject: [Bioperl-l] Bio::DB::Fasta and threads In-Reply-To: References: <50BC102C.7080902@gmail.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF45670@CHIMBX5.ad.uillinois.edu> On Dec 3, 2012, at 6:29 PM, Leon Timmermans wrote: > On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly wrote: >> The first issue is the serialization of Bio::DB::IndexedBase-inheriting >> (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for >> threading (for example when using Thread::Queue::Any). I implemented hooks >> that make it transparent to serialize using Storable freeze() and thaw(). > > I don't think serializing a magical thingie makes much sense. Storable > is commonly used for a lot more things than interthread communication > (e.g. network communication), this would often not work under such > circumstances. > > Leon Leon, any suggestions on alternatives? I know this particular bit is a sore spot with MAKER at the moment, so any help would be greatly appreciated. chris From yongli at yeslab.com Sat Dec 1 01:10:15 2012 From: yongli at yeslab.com (=?utf-8?B?eW9uZ2xpQHllc2xhYi5jb20=?=) Date: Sat, 1 Dec 2012 14:10:15 +0800 (CST) Subject: [Bioperl-l] =?utf-8?q?question_about_bioperl_program?= Message-ID: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com> Dear Sir or Madam, I just begin to learn bioperl and am using a bioperl program to extract a kind of bacterial genes sequences from genome genbank file. I download NC_003450.gbk, ppt, ffn files. NC_003450 is a kind of bacterial genome genbank file. My bioperl code as follows: use Bio::Seq; use Bio::SeqIO; $seqio_obj=Bio::SeqIO->new (-file=>'NC_003450.gbk',-format=>'genbank'); # $seq_obj=$seqio_obj->next_seq; while($seq_obj=$seqio_obj->next_seq) { $display_name=$seq_obj->display_name; $desc=$seq_obj->desc; $seq=$seq_obj->seq; $acc = $seq_obj->accession_number; $seqio_obj = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'fasta' ); $seqio_obj->write_seq($seq_obj); } After the program runned I just gain a genome complete sequence without a file including every gene sequence one by one. I want to get the genes sequences one by one in a fasta files. So I write you for help. Yong Li From carsonhh at gmail.com Mon Dec 3 22:35:50 2012 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 03 Dec 2012 22:35:50 -0500 Subject: [Bioperl-l] Bio::DB::Fasta and threads In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF4CF45670@CHIMBX5.ad.uillinois.edu> Message-ID: Bio::DB::Fasta is working for maker now. The previous issues have been fixed, but being as Florent has gone out of his way to build a number of improvements into Bio::DB::Fasta over the past few weeks, this seemed like a useful one as well, so I suggested it. One of the big uses of Bio::DB::Fasta is the Bio::PrimarySeq::Fasta features it creates. They are great for manipulating the sequence without actually having to ever keep it in memory. It's nice because the sequence is made available on demand, but when you try and pass them between threads, your program falls apart. There are creative work arounds, but simply adding a serialization hook to Bio::DB::Fasta to disconnect the database on freezing and then reconnect on thaw also fixes it, and it makes them extremely useful for multi-threaded applications without having to go through other kinds of work arounds (it just makes them work as expected with serialization). Previously I had created my own module and inherited from Bio::DB::Fasta so I could implement the Storable hooks. Because Storable looks for the hooks in anything it serializes, the Bio::DB::Fasta object can even be well down inside of a complex object and you don't have worry about it. Previously I've used Storable hooks to pass the Bio::PrimarySeq::Fasta features across the network using MPI, as long as the database is on an NFS mount it just reconnects on the other node with no issue. If the indexed file isn't available after deserialization over a network, you could just throw an error when the thaw hook is called. I'll give Florent's changes a look over soon to give any suggestions. Thanks, Carson On 12-12-03 10:23 PM, "Fields, Christopher J" wrote: >On Dec 3, 2012, at 6:29 PM, Leon Timmermans > wrote: > >> On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly >>wrote: >>> The first issue is the serialization of Bio::DB::IndexedBase-inheriting >>> (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for >>> threading (for example when using Thread::Queue::Any). I implemented >>>hooks >>> that make it transparent to serialize using Storable freeze() and >>>thaw(). >> >> I don't think serializing a magical thingie makes much sense. Storable >> is commonly used for a lot more things than interthread communication >> (e.g. network communication), this would often not work under such >> circumstances. >> >> Leon > >Leon, any suggestions on alternatives? I know this particular bit is a >sore spot with MAKER at the moment, so any help would be greatly >appreciated. > >chris > From jason.r.gallant at gmail.com Tue Dec 4 15:23:02 2012 From: jason.r.gallant at gmail.com (Jason Gallant) Date: Tue, 4 Dec 2012 12:23:02 -0800 (PST) Subject: [Bioperl-l] Problem with BIO::DB::FASTA and Colon in Fasta Header Message-ID: Hello, I'm trying to retreive fasta sequences that contain a colon in their header. However, I cannot get my BioPerl script to do this!! It works as expected when the header does not contain the colon, however doesn't return anything when it does. Weirdly, when I ask it to return the parsed IDs (see below), it returns the appropriate IDs, which include the colon! Very confusing, would appreciate any help!! Many Thanks, Jason Gallant use strict; use Bio::SearchIO; use Bio::DB::Fasta; my ($file,$id,$start,$end) = ("secondround_merged_expanded.fasta","C7047455:0-100",1,10); my $db = Bio::DB::Fasta->new($file, -reindex=>1); my $seq = $db->seq($id,$start,$end); print $db->ids; print $seq,"\n"; From asjo at koldfront.dk Tue Dec 4 15:53:08 2012 From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=) Date: Tue, 04 Dec 2012 21:53:08 +0100 Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access In-Reply-To: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com> (Francesco Musacchia's message of "Wed, 28 Nov 2012 02:27:16 -0800 (PST)") References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com> Message-ID: <87y5hdletn.fsf@topper.koldfront.dk> On Wed, 28 Nov 2012 02:27:16 -0800 (PST), Francesco wrote: > I'm experiencing that when I have to do a lot of accessess on a GFF > database (with Bio:DB::SeqFeature::Store) the slowness increase until > my script can stay running for more than a day. First you'll need to find out what/where exactly it is slow. One way to do so is using a a profiler; this is a good one for Perl: * https://metacpan.org/module/Devel::NYTProf If you want more specific suggestions, you'll probably have to provide more information. Good luck! Adam -- "As Knuth pointed out long ago, speed only matters Adam Sj?gren in certain critical bottlenecks. And as many asjo at koldfront.dk programmers have observed since, one is very often mistaken about where these bottlenecks are." From cjfields at illinois.edu Tue Dec 4 16:10:00 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 4 Dec 2012 21:10:00 +0000 Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access In-Reply-To: <87y5hdletn.fsf@topper.koldfront.dk> References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com> <87y5hdletn.fsf@topper.koldfront.dk> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu> On Dec 4, 2012, at 2:53 PM, Adam Sj?gren wrote: > On Wed, 28 Nov 2012 02:27:16 -0800 (PST), Francesco wrote: > >> I'm experiencing that when I have to do a lot of accessess on a GFF >> database (with Bio:DB::SeqFeature::Store) the slowness increase until >> my script can stay running for more than a day. > > First you'll need to find out what/where exactly it is slow. One way to > do so is using a a profiler; this is a good one for Perl: > > * https://metacpan.org/module/Devel::NYTProf > > If you want more specific suggestions, you'll probably have to provide > more information. > > > Good luck! > > Adam If anything, we need more profiling of Bioperl code. Ah, if we only had infinite time... :) chris From asjo at koldfront.dk Tue Dec 4 16:33:55 2012 From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=) Date: Tue, 04 Dec 2012 22:33:55 +0100 Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu> (Christopher J. Fields's message of "Tue, 4 Dec 2012 21:10:00 +0000") References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com> <87y5hdletn.fsf@topper.koldfront.dk> <118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu> Message-ID: <87txs1jyd8.fsf@topper.koldfront.dk> On Tue, 4 Dec 2012 21:10:00 +0000, Fields, wrote: > If anything, we need more profiling of Bioperl code. Ah, if we only > had infinite time... :) If we had that, we didn't need profiling! ;-), Adam -- "On the quiet side. Somewhat peculiar. A good Adam Sj?gren companion, in a weird sort of way." asjo at koldfront.dk From florent.angly at gmail.com Tue Dec 4 16:52:41 2012 From: florent.angly at gmail.com (Florent Angly) Date: Wed, 05 Dec 2012 07:52:41 +1000 Subject: [Bioperl-l] Problem with BIO::DB::FASTA and Colon in Fasta Header In-Reply-To: References: Message-ID: <50BE70A9.4060404@gmail.com> Hi Jason, See the documentation for seq() at http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/DB/Fasta.pm#OBJECT_METHODS . When you call seq() with a single argument, e.g. $db->seq('C7047455:0-100'), Bio::DB::Fasta interprets it as a compound ID and looks for position 0 to 100 of a sequence called C7047455. This is a feature that has been in Bio::DB::Fasta since the dawn of time. In this form, seq() expects a colon as part of the compound ID, which is problematic because your sequence ID actually contains a colon. I think that when you call $db->seq($id,$start,$end), Bio::DB::Fasta does not attempt to parse your ID. This is why your code works with this form. Note that if you want to get the entirety of a sequence called 'C7047455:0-100', the easiest if your sequence names contain colon is to use $db->get_Seq_by_id('C7047455:0-100') since get_Seq_by_id() does only take a regular ID (not compound). Florent On 05/12/12 06:23, Jason Gallant wrote: > Hello, > > I'm trying to retreive fasta sequences that contain a colon in their > header. However, I cannot get my BioPerl script to do this!! > > It works as expected when the header does not contain the colon, however > doesn't return anything when it does. Weirdly, when I ask it to return the > parsed IDs (see below), it returns the appropriate IDs, which include the > colon! Very confusing, would appreciate any help!! > > Many Thanks, > Jason Gallant > > > use strict; > use Bio::SearchIO; > use Bio::DB::Fasta; > > > my ($file,$id,$start,$end) = > ("secondround_merged_expanded.fasta","C7047455:0-100",1,10); > > > my $db = Bio::DB::Fasta->new($file, -reindex=>1); > my $seq = $db->seq($id,$start,$end); > > print $db->ids; > > print $seq,"\n"; > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bosborne11 at verizon.net Tue Dec 4 17:12:59 2012 From: bosborne11 at verizon.net (Brian Osborne) Date: Tue, 04 Dec 2012 17:12:59 -0500 Subject: [Bioperl-l] question about bioperl program In-Reply-To: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com> References: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com> Message-ID: <16BBC477-9935-4C79-A70D-6B18716089FB@verizon.net> Yong Li, You want to take a look at this HOWTO: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation Those genes you see in the file are features in the genome sequence. Brian O. On Dec 1, 2012, at 1:10 AM, yongli at yeslab.com wrote: > Dear Sir or Madam, > > > > I just begin to learn bioperl and am using a bioperl program to extract a kind of bacterial genes sequences from genome genbank file. I download NC_003450.gbk, ppt, ffn files. NC_003450 is a kind of bacterial genome genbank file. My bioperl code as follows: > > > > use Bio::Seq; > > use Bio::SeqIO; > > > > $seqio_obj=Bio::SeqIO->new (-file=>'NC_003450.gbk',-format=>'genbank'); > > # $seq_obj=$seqio_obj->next_seq; > > > > while($seq_obj=$seqio_obj->next_seq) > > { > > $display_name=$seq_obj->display_name; > > $desc=$seq_obj->desc; > > $seq=$seq_obj->seq; > > $acc = $seq_obj->accession_number; > > $seqio_obj = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'fasta' ); > > $seqio_obj->write_seq($seq_obj); > > } > > > > After the program runned I just gain a genome complete sequence without a file including every gene sequence one by one. I want to get the genes sequences one by one in a fasta files. So I write you for help. > > > > Yong Li > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From ankh.egypt.public at googlemail.com Fri Dec 7 15:24:20 2012 From: ankh.egypt.public at googlemail.com (Adrian Helmchen) Date: Fri, 07 Dec 2012 21:24:20 +0100 Subject: [Bioperl-l] proteins from an organism Message-ID: <50C25074.8050703@googlemail.com> Hello, I would like to get all proteins from an organism but proteins from cholorplasts or with chrystal structures or something else. I tried to obtain these proteins by send a query 'Arabidopsis thaliana[organism]' with Bio::DB::GenBank and fetch the gi numbers from the cds. But on the one pc I get 6000 proteins and on another pc I get 46000 proteins although Arabidopsis thaliana has 25000 genes. Thank you for your help. From nikkie.vanbers at gmail.com Mon Dec 10 03:07:27 2012 From: nikkie.vanbers at gmail.com (Nikki2) Date: Mon, 10 Dec 2012 00:07:27 -0800 (PST) Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI assembly database Message-ID: <34761946.post@talk.nabble.com> Hi, I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from 'Tracheophyta' that are NCBI's assembly database. However, there are no DocSums returned for the uid's that match the query. When I try the same thing using the genome database it works fine. The script that I used to do the query is at the bottom of this message. The output I get when running the script is: Count = 84 --------------------- WARNING --------------------- MSG: No returned docsums. --------------------------------------------------- I checked the @ids array and it contains the 84 uids. My questions are as follows: 1) Is it possible to get DocSums for uids from the NCBI assembly database, and if yes, how? 2) If not, does anyone have any suggestions how to change my script to get the species-names that match the uids that are returned? Thanks a lot! Nikki ############################################## #!/bin/perl -w use Bio::DB::EUtilities; my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', -db => 'genome', -email => 'my_email at gmail.com', -term => 'Tracheophyta[organism]', -retmax => 5000); print "Count = ",$factory->get_count,"\n"; my @ids = $factory->get_ids; my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary', -email=>'my_email at gmail.com', -db => 'genome', -id => \@ids, ret_max=>5000); while (my $ds = $factory2->next_DocSum) { print "ID: ",$ds->get_id,"\n"; # flattened mode, iterates through all Item objects while (my $item = $ds->next_Item('flattened')) { # not all Items have content, so need to check... printf("%-20s:%s\n",$item->get_name,$item->get_content) if $item->get_content; } print "\n"; } -- View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From cjfields at illinois.edu Mon Dec 10 10:59:03 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 10 Dec 2012 15:59:03 +0000 Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI assembly database In-Reply-To: <34761946.post@talk.nabble.com> References: <34761946.post@talk.nabble.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF4C164@CHIMBX5.ad.uillinois.edu> Nikki, This is b/c a handful of the databases apparently have switched docsum output completely to the DB-specific DocSum schemata (v2), which have not been implemented in Bio::EUtilities as of yet. This requires quite a bit of revision to parse correctly as it's per database, so I don't have a timeline on when this would be available and would likely be incrementally implemented over time. See here for the announcement: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.Release_Notes In the meantime, you can get the raw XML output for these by replacing the loop for $factory2 with: print $factory2->get_Response->content chris On Dec 10, 2012, at 2:07 AM, Nikki2 wrote: > Hi, > > I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from > 'Tracheophyta' that are NCBI's assembly database. However, there are no > DocSums returned for the uid's that match the query. When I try the same > thing using the genome database it works fine. > > The script that I used to do the query is at the bottom of this message. The > output I get when running the script is: > > Count = 84 > > --------------------- WARNING --------------------- > MSG: No returned docsums. > --------------------------------------------------- > > I checked the @ids array and it contains the 84 uids. > > My questions are as follows: > > 1) Is it possible to get DocSums for uids from the NCBI assembly database, > and if yes, how? > 2) If not, does anyone have any suggestions how to change my script to get > the species-names that match the uids that are returned? > > Thanks a lot! > > Nikki > > > > > > > > ############################################## > > #!/bin/perl -w > > use Bio::DB::EUtilities; > > my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', > -db => 'genome', > -email => 'my_email at gmail.com', > -term => 'Tracheophyta[organism]', > -retmax => 5000); > > print "Count = ",$factory->get_count,"\n"; > my @ids = $factory->get_ids; > > my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary', > -email=>'my_email at gmail.com', > -db => 'genome', > -id => \@ids, > ret_max=>5000); > > while (my $ds = $factory2->next_DocSum) { > print "ID: ",$ds->get_id,"\n"; > # flattened mode, iterates through all Item objects > while (my $item = $ds->next_Item('flattened')) { > # not all Items have content, so need to check... > printf("%-20s:%s\n",$item->get_name,$item->get_content) if > $item->get_content; > } > print "\n"; > } > > > -- > View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From jason.stajich at gmail.com Wed Dec 12 23:05:29 2012 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 12 Dec 2012 20:05:29 -0800 Subject: [Bioperl-l] Asking In-Reply-To: <201212131130153627348@gmail.com> References: <201212131130153627348@gmail.com> Message-ID: <7ED416EC-622E-4023-94B7-9A11D29929DC@gmail.com> You want the reroot function. Have you tried reading the howtos on the website already. Node is a node in the tree. There are several functions to find a node or iterate through all the ones in the tree Sent from my iPhone-please excuse typos -- Jason Stajich On Dec 12, 2012, at 7:30 PM, "Xing-Xing Shen" wrote: > Drear Jason > I am a green hand in learning Bioperl. Now, I met a problem about how to define outgroup for a set of newick trees. > > My codes below: > #!/usr/bin/perl > use Bio::TreeIO; > use Bio::Tree::NodeI; > use Bio::Tree::Tree; > my @filenames = glob("*.txt"); > foreach my $filename (@filenames) { > my $treeio = Bio::TreeIO->new('-format' => 'newick', '-file' => "$filename"); > while( my $tree = $treeio->next_tree ) { > $tree->set_root_node("$node"); # what might $node mean? > .......... > .......... > } > } > > > With best, > > Xing-Xing Shen From j.abbott at imperial.ac.uk Thu Dec 13 14:49:15 2012 From: j.abbott at imperial.ac.uk (James Abbott) Date: Thu, 13 Dec 2012 19:49:15 +0000 Subject: [Bioperl-l] deobfuscator broken.... Message-ID: <50CA313B.9060904@imperial.ac.uk> Hi All, Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator is generating internal server errors. I've also been having problems with broken documentation links (cpan links producning the wrong modules, and pdoc pages missing) but can't seem to replicate that problem now.... I am, for now, still obfuscated... Cheers, James -- Dr. James Abbott Lead Bioinformatician Bioinformatics Support Service Imperial College, London From p.j.a.cock at googlemail.com Thu Dec 13 17:52:44 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 13 Dec 2012 22:52:44 +0000 Subject: [Bioperl-l] deobfuscator broken.... In-Reply-To: <50CA313B.9060904@imperial.ac.uk> References: <50CA313B.9060904@imperial.ac.uk> Message-ID: On Thu, Dec 13, 2012 at 7:49 PM, James Abbott wrote: > Hi All, > > Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator > is generating internal server errors. I've also been having problems with > broken documentation links (cpan links producning the wrong modules, and > pdoc pages missing) but can't seem to replicate that problem now.... > > I am, for now, still obfuscated... > > Cheers, > James I would guess this is a side effect from the recent server move, CC'ing root-l in case anyone of the sys-admin team had an idea. Peter From cjfields at illinois.edu Thu Dec 13 17:51:50 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 13 Dec 2012 22:51:50 +0000 Subject: [Bioperl-l] deobfuscator broken.... In-Reply-To: <50CA313B.9060904@imperial.ac.uk> References: <50CA313B.9060904@imperial.ac.uk> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF545DA@CHIMBX5.ad.uillinois.edu> This is likely due to the back-end change in servers. I'm not sure how this was set up but we can inquire about it. chris On Dec 13, 2012, at 1:49 PM, James Abbott wrote: > Hi All, > > Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator is generating internal server errors. I've also been having problems with broken documentation links (cpan links producning the wrong modules, and pdoc pages missing) but can't seem to replicate that problem now.... > > I am, for now, still obfuscated... > > Cheers, > James > -- > Dr. James Abbott > Lead Bioinformatician > Bioinformatics Support Service > Imperial College, London > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Thu Dec 13 18:13:55 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 13 Dec 2012 23:13:55 +0000 Subject: [Bioperl-l] [Root-l] deobfuscator broken.... In-Reply-To: References: <50CA313B.9060904@imperial.ac.uk> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF546F2@CHIMBX5.ad.uillinois.edu> On Dec 13, 2012, at 4:52 PM, Peter Cock wrote: > On Thu, Dec 13, 2012 at 7:49 PM, James Abbott wrote: >> Hi All, >> >> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator >> is generating internal server errors. I've also been having problems with >> broken documentation links (cpan links producning the wrong modules, and >> pdoc pages missing) but can't seem to replicate that problem now.... >> >> I am, for now, still obfuscated... >> >> Cheers, >> James > > I would guess this is a side effect from the recent server move, > CC'ing root-l in case anyone of the sys-admin team had an idea. > > Peter Beat me by four minutes! The CGI code is in websites/bioperl.org/cgi/. I'm checking on the errors now, may take me a little time to get it back up (was missing CGI, now needs to have the lib path extended). chris From jason.stajich at gmail.com Thu Dec 13 18:18:26 2012 From: jason.stajich at gmail.com (Jason Stajich) Date: Thu, 13 Dec 2012 15:18:26 -0800 Subject: [Bioperl-l] [Root-l] deobfuscator broken.... In-Reply-To: References: <50CA313B.9060904@imperial.ac.uk> Message-ID: <0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com> I think it uses mysql but I don't know if that was reconstituted on the new server. On Dec 13, 2012, at 2:52 PM, Peter Cock wrote: > On Thu, Dec 13, 2012 at 7:49 PM, James Abbott wrote: >> Hi All, >> >> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator >> is generating internal server errors. I've also been having problems with >> broken documentation links (cpan links producning the wrong modules, and >> pdoc pages missing) but can't seem to replicate that problem now.... >> >> I am, for now, still obfuscated... >> >> Cheers, >> James > > I would guess this is a side effect from the recent server move, > CC'ing root-l in case anyone of the sys-admin team had an idea. > > Peter > _______________________________________________ > Root-l mailing list > Root-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/root-l Jason Stajich jason.stajich at gmail.com jason at bioperl.org From nikkie.vanbers at gmail.com Wed Dec 5 09:04:09 2012 From: nikkie.vanbers at gmail.com (Nikki2) Date: Wed, 5 Dec 2012 06:04:09 -0800 (PST) Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI assembly database Message-ID: <34761946.post@talk.nabble.com> Hi, I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from 'Tracheophyta' that are NCBI's assembly database. However, there are no DocSums returned for the uid's that match the query. When I try the same thing using the genome database it works fine. The script that I used to do the query is at the bottom of this message. The output I get when running the script is: Count = 84 --------------------- WARNING --------------------- MSG: No returned docsums. --------------------------------------------------- I checked the @ids array and it contains the 84 uids. My questions are as follows: 1) Is it possible to get DocSums for uids from the NCBI assembly database, and if yes, how? 2) If not, does anyone have any suggestions how to change my script to get the species-names that match the uids that are returned? Thanks a lot! Nikki ############################################## #!/bin/perl -w use Bio::DB::EUtilities; my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', -db => 'genome', -email => 'my_email at gmail.com', -term => 'Tracheophyta[organism]', -retmax => 5000); print "Count = ",$factory->get_count,"\n"; my @ids = $factory->get_ids; my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary', -email=>'my_email at gmail.com', -db => 'genome', -id => \@ids, ret_max=>5000); while (my $ds = $factory2->next_DocSum) { print "ID: ",$ds->get_id,"\n"; # flattened mode, iterates through all Item objects while (my $item = $ds->next_Item('flattened')) { # not all Items have content, so need to check... printf("%-20s:%s\n",$item->get_name,$item->get_content) if $item->get_content; } print "\n"; } -- View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From online at davemessina.com Thu Dec 13 18:41:35 2012 From: online at davemessina.com (Dave Messina) Date: Thu, 13 Dec 2012 18:41:35 -0500 Subject: [Bioperl-l] [Root-l] deobfuscator broken.... In-Reply-To: <0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com> References: <50CA313B.9060904@imperial.ac.uk> <0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com> Message-ID: It should be just (shudder) Berkeley DB. On Dec 13, 2012, at 18:18, Jason Stajich wrote: > I think it uses mysql but I don't know if that was reconstituted on the new server. > > On Dec 13, 2012, at 2:52 PM, Peter Cock wrote: > >> On Thu, Dec 13, 2012 at 7:49 PM, James Abbott wrote: >>> Hi All, >>> >>> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator >>> is generating internal server errors. I've also been having problems with >>> broken documentation links (cpan links producning the wrong modules, and >>> pdoc pages missing) but can't seem to replicate that problem now.... >>> >>> I am, for now, still obfuscated... >>> >>> Cheers, >>> James >> >> I would guess this is a side effect from the recent server move, >> CC'ing root-l in case anyone of the sys-admin team had an idea. >> >> Peter >> _______________________________________________ >> Root-l mailing list >> Root-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/root-l > > Jason Stajich > jason.stajich at gmail.com > jason at bioperl.org > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From abualiga2 at gmail.com Tue Dec 18 17:08:51 2012 From: abualiga2 at gmail.com (galeb abu-ali) Date: Tue, 18 Dec 2012 17:08:51 -0500 Subject: [Bioperl-l] Fwd: how to parse maf file format In-Reply-To: References: Message-ID: Hi, I am writing a script to parse a multiple genome alignment file in maf format, generated with mugsy alignment of e.coli genomes. So far, my script parses SNPs from synteny blocks conserved in all aligned strains, and it excludes gaps, which is enough for a phylogenetic analyses. I was wondering how can I parse the remaining blocks that are not conserved in all strains, to see what is conserved in n-1, n-2, etc. strains or unique to each strain. I guess this is not a BioPerl question, but it's a Perl for biologists question so I was hoping to get some insight here. If there is a more appropriate forum, please let me know. Below is my code. many thanks! galeb #!/usr/local/bin/perl use Modern::Perl '2013'; use autodie; use List::MoreUtils qw/ each_arrayref /; # gsa 18.12.2012 # parse mugsy multiple genome alignment for SNPs in synteny blocks conserved in all aligned strains =head ##maf version=1 scoring=mugsy a score=7891 label=40 mult=4 s O55H7_RM12579.O55H7_RM12579 1596752 7262 + 5263980 CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCG s O55H7_CB9615.O55H7_CB9615 1604426 7262 + 5386352 CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCGG-AT s O157H7_Sakai.O157H7_Sakai 1787303 7068 + 5498450 CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCGG-AT s O157H7_EDL.O157H7_EDL933 1729749 7082 + 5528445 CGGGATGCGGGAATGGGAATGCCTTGGTTGACGGGGTGGCGGAAT a score=6756 label=41 mult=4 s O55H7_RM12579.O55H7_RM12579 1986265 6749 + 5263980 CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGG s O55H7_CB9615.O55H7_CB9615 1991733 6749 + 5386352 CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGGTTAC s O157H7_Sakai.O157H7_Sakai 3940728 6751 - 5498450 CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGGTTAC s O157H7_EDL.O157H7_EDL933 4260689 4042 - 5528445 --------------------------------------------- =cut my $infile = shift or die "Usage: $0 \n"; my %snps; my $strains = 0; my @alignment; my( $score, $blkLen, $mult ); my $total_snps; my $syn_len; my %lengths; open my $fh, '<', $infile; while( <$fh> ) { next if /^#/; chomp; if( /^a/ ) { ( $score, $blkLen, $mult ) = ( split )[1,2,3]; $score =~ s/score\=(\d+)/$1/; # length of alignment block including '-' $blkLen =~ s/label\=(\d+)/$1/; # alignment block number; numbers ranked on alignment length $mult =~ s/mult\=(\d+)/$1/;# number of strains aligned in block $strains = $mult if $mult > $strains; # total number of strains in alignment } elsif( /^s/ ) { push @alignment, $_ } elsif( /^$/ || ! length $_ ) { my( @strNames, @starts, @strands, @dna_mtrx ); # if sequence conserved in all strains if( $strains == @alignment ) { $syn_len += $score; # total aligned sequence in all strains for( @alignment ) { # name, align start, align length (w/o '-'), direction, align sequence w/ '-' my( $name, $start, $len, $strand, $dna ) = ( split /\s+/ )[ 1, 2, 3, 4, 6 ]; #$name =~ s/.*\.(.*)/$1/; # remove duplicated strain name # strains are always in same order when all strains in block. push @strNames, $name; push @starts, $start; push @strands, $strand; push @dna_mtrx, [ split '', $dna ]; # total seqeunce in each strain w/o '-' that is conserved in all strains $lengths{ $name } += $len; } my $ea = each_arrayref( @dna_mtrx ); my %gaps; my $cnt; while( my( @bases ) = $ea->() ) { ++$cnt; my %temp; for( 0 .. $#bases ) { # store gaps if any if( $bases[$_] eq '-' ) { $gaps{$_}++; # key is number, corresponds to index of other arrays } } # skip gaps '-' unless( '-' ~~ @bases ) { $temp{ uc $_}++ for @bases } # if snp then %temp will have > 1 key if( keys %temp > 1 ) { # if SNP exists, get base and position for all strains in alignment ++$total_snps; my $pos; for( 0 .. $#bases ) { if( $strands[$_] eq '+' ) { $pos = $starts[$_] + $cnt - ( $gaps{$_} // 0 ) } # genome positn elsif( $strands[$_] eq '-' ) { $pos = $starts[$_] - $cnt - ( $gaps{$_} // 0 ) } # HoAoH push @{ $snps{ $strNames[$_] } }, { $pos => $bases[$_] }; } } } } @alignment = (); } } close $fh; #print Dumper( \%snps ); use Data::Dumper; say "Sum length of synteny blocks conserved in all strains, including gaps: $syn_len bp"; say "Length of conserved sequence for each strain, excluding gaps:"; for my $strain ( keys %lengths ) { say "$strain\t$lengths{ $strain } bp"; } my $outfile = $infile; $outfile =~ s/\.maf$/_snps.txt/; open my $fh2, '>', $outfile; say {$fh2} map{ $_ . "_base\t", $_ . "_pos\t" } keys %snps; for my $snp ( 0 .. ( $total_snps - 1 ) ) { for my $strain ( keys %snps ){ for my $href ( keys %{ $snps{ $strain }[ $snp ] } ) { print {$fh2} "$snps{ $strain }[ $snp ]->{ $href }\t$href\t"; } } print {$fh2} "\n"; } From sanketd at isquareit.ac.in Mon Dec 31 01:46:41 2012 From: sanketd at isquareit.ac.in (Sanket Desai) Date: Mon, 31 Dec 2012 12:16:41 +0530 (IST) Subject: [Bioperl-l] Help in getting organism names of the nucleotide entries. Message-ID: <26019826.10871.1356936401744.JavaMail.root@mail.isquareit.ac.in> Hello, With respect to the post: http://bio.perl.org/pipermail/bioperl-l/2009-December/031831.html When used for nucleotide database it gives the following error: --------------------- WARNING --------------------- MSG: The -email parameter is now required, per NCBI E-utilities policy --------------------------------------------------- --------------------- WARNING --------------------- MSG: No linksets returned --------------------------------------------------- --------------------- WARNING --------------------- MSG: The -email parameter is now required, per NCBI E-utilities policy --------------------------------------------------- ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: NCBI esummary fatal error: Empty id list - nothing todo STACK: Error::throw STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:472 STACK: Bio::Tools::EUtilities::parse_data /usr/share/perl5/Bio/Tools/EUtilities.pm:382 STACK: Bio::Tools::EUtilities::next_DocSum /usr/share/perl5/Bio/Tools/EUtilities.pm:964 STACK: Bio::DB::EUtilities::next_DocSum /usr/share/perl5/Bio/DB/EUtilities.pm:914 STACK: getOrgNameFrmAccession.pl:29 ----------------------------------------------------------- Please suggest the relevant changes in the above script to make it work for the nucleotide entries also. Thanks in advance, Regards, Sanket From fcyucn at gmail.com Mon Dec 17 20:37:45 2012 From: fcyucn at gmail.com (Fengchao Yu) Date: Tue, 18 Dec 2012 01:37:45 -0000 Subject: [Bioperl-l] Is there any module for the protein digestion? Message-ID: <7b719317-57a3-46ef-927c-6b0508e1e62d@googlegroups.com> I notice that Bio::Restriction::Enzyme is for DNA digest? I wonder if there is any module for protein digestion? Thanks From florent.angly at gmail.com Sun Dec 2 21:36:28 2012 From: florent.angly at gmail.com (Florent Angly) Date: Mon, 03 Dec 2012 12:36:28 +1000 Subject: [Bioperl-l] Bio::DB::Fasta and threads Message-ID: <50BC102C.7080902@gmail.com> Hi all, This is in response to Carson Holt's report that Bio::DB::Fasta does not play well with threads: https://redmine.open-bio.org/issues/3397 The first issue is the serialization of Bio::DB::IndexedBase-inheriting (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for threading (for example when using Thread::Queue::Any). I implemented hooks that make it transparent to serialize using Storable freeze() and thaw(). Another issue was the lack of communication between different Bio::DB::IndexedBase instances, which means that an instance could easily be writing or deleting the database that another instance is working on. To fix this, I needed some form of locking. Some database Bio::DB::IndexedBase backends (DB_file) have some support for locking but Bio::DB::IndexedBase also supports other database backends for which there is no native locking mechanism. So, I had to come up with a more general solution: a lock file. I noticed that Bio::DB::SeqFeature::Store::berkeleydb has a locking mechanism, based on flock(), which means that it does not work with NFS-mounted filesystems. All the Bioperl-based scripts I (and most likely many others) write run on servers that use NFS, so this support is important. I have found only one way to do the NFS locking safely, using File::SharedNFSLock. It has a few downsides though: 1/ it is an external dependency, 2/ it does not work on FAT filesystems (should be mostly restricted to USB sticks nowadays) and the lock is never acquired, and 3/ at the moment, it requires a patch to work in threaded context (https://rt.cpan.org/Public/Bug/Display.html?id=81597) Note that while I have now added basic support for threads in Bio::DB::IndexedBase was added, I still get segfaults in specific cases, for example when returning a database or sequence object. This might be related to this issue: https://rt.perl.org/rt3/Ticket/Display.html?id=115972. Beyond this, the new code seems to work nicely. See the branch https://github.com/bioperl/bioperl-live/tree/storable_db if you want to test yourself. For example, one can now run multiple threads, each of them creating a Bio::DB::Fasta database from the same FASTA file: the first thread performs the indexing while the others wait nicely for the indexing to be finished to query the database. Comments welcome. Regards, Florent From l.m.timmermans at students.uu.nl Mon Dec 3 19:29:59 2012 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Tue, 4 Dec 2012 01:29:59 +0100 Subject: [Bioperl-l] Bio::DB::Fasta and threads In-Reply-To: <50BC102C.7080902@gmail.com> References: <50BC102C.7080902@gmail.com> Message-ID: On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly wrote: > The first issue is the serialization of Bio::DB::IndexedBase-inheriting > (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for > threading (for example when using Thread::Queue::Any). I implemented hooks > that make it transparent to serialize using Storable freeze() and thaw(). I don't think serializing a magical thingie makes much sense. Storable is commonly used for a lot more things than interthread communication (e.g. network communication), this would often not work under such circumstances. Leon From cjfields at illinois.edu Mon Dec 3 22:23:50 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 4 Dec 2012 03:23:50 +0000 Subject: [Bioperl-l] Bio::DB::Fasta and threads In-Reply-To: References: <50BC102C.7080902@gmail.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF45670@CHIMBX5.ad.uillinois.edu> On Dec 3, 2012, at 6:29 PM, Leon Timmermans wrote: > On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly wrote: >> The first issue is the serialization of Bio::DB::IndexedBase-inheriting >> (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for >> threading (for example when using Thread::Queue::Any). I implemented hooks >> that make it transparent to serialize using Storable freeze() and thaw(). > > I don't think serializing a magical thingie makes much sense. Storable > is commonly used for a lot more things than interthread communication > (e.g. network communication), this would often not work under such > circumstances. > > Leon Leon, any suggestions on alternatives? I know this particular bit is a sore spot with MAKER at the moment, so any help would be greatly appreciated. chris From yongli at yeslab.com Sat Dec 1 01:10:15 2012 From: yongli at yeslab.com (=?utf-8?B?eW9uZ2xpQHllc2xhYi5jb20=?=) Date: Sat, 1 Dec 2012 14:10:15 +0800 (CST) Subject: [Bioperl-l] =?utf-8?q?question_about_bioperl_program?= Message-ID: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com> Dear Sir or Madam, I just begin to learn bioperl and am using a bioperl program to extract a kind of bacterial genes sequences from genome genbank file. I download NC_003450.gbk, ppt, ffn files. NC_003450 is a kind of bacterial genome genbank file. My bioperl code as follows: use Bio::Seq; use Bio::SeqIO; $seqio_obj=Bio::SeqIO->new (-file=>'NC_003450.gbk',-format=>'genbank'); # $seq_obj=$seqio_obj->next_seq; while($seq_obj=$seqio_obj->next_seq) { $display_name=$seq_obj->display_name; $desc=$seq_obj->desc; $seq=$seq_obj->seq; $acc = $seq_obj->accession_number; $seqio_obj = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'fasta' ); $seqio_obj->write_seq($seq_obj); } After the program runned I just gain a genome complete sequence without a file including every gene sequence one by one. I want to get the genes sequences one by one in a fasta files. So I write you for help. Yong Li From carsonhh at gmail.com Mon Dec 3 22:35:50 2012 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 03 Dec 2012 22:35:50 -0500 Subject: [Bioperl-l] Bio::DB::Fasta and threads In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF4CF45670@CHIMBX5.ad.uillinois.edu> Message-ID: Bio::DB::Fasta is working for maker now. The previous issues have been fixed, but being as Florent has gone out of his way to build a number of improvements into Bio::DB::Fasta over the past few weeks, this seemed like a useful one as well, so I suggested it. One of the big uses of Bio::DB::Fasta is the Bio::PrimarySeq::Fasta features it creates. They are great for manipulating the sequence without actually having to ever keep it in memory. It's nice because the sequence is made available on demand, but when you try and pass them between threads, your program falls apart. There are creative work arounds, but simply adding a serialization hook to Bio::DB::Fasta to disconnect the database on freezing and then reconnect on thaw also fixes it, and it makes them extremely useful for multi-threaded applications without having to go through other kinds of work arounds (it just makes them work as expected with serialization). Previously I had created my own module and inherited from Bio::DB::Fasta so I could implement the Storable hooks. Because Storable looks for the hooks in anything it serializes, the Bio::DB::Fasta object can even be well down inside of a complex object and you don't have worry about it. Previously I've used Storable hooks to pass the Bio::PrimarySeq::Fasta features across the network using MPI, as long as the database is on an NFS mount it just reconnects on the other node with no issue. If the indexed file isn't available after deserialization over a network, you could just throw an error when the thaw hook is called. I'll give Florent's changes a look over soon to give any suggestions. Thanks, Carson On 12-12-03 10:23 PM, "Fields, Christopher J" wrote: >On Dec 3, 2012, at 6:29 PM, Leon Timmermans > wrote: > >> On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly >>wrote: >>> The first issue is the serialization of Bio::DB::IndexedBase-inheriting >>> (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for >>> threading (for example when using Thread::Queue::Any). I implemented >>>hooks >>> that make it transparent to serialize using Storable freeze() and >>>thaw(). >> >> I don't think serializing a magical thingie makes much sense. Storable >> is commonly used for a lot more things than interthread communication >> (e.g. network communication), this would often not work under such >> circumstances. >> >> Leon > >Leon, any suggestions on alternatives? I know this particular bit is a >sore spot with MAKER at the moment, so any help would be greatly >appreciated. > >chris > From jason.r.gallant at gmail.com Tue Dec 4 15:23:02 2012 From: jason.r.gallant at gmail.com (Jason Gallant) Date: Tue, 4 Dec 2012 12:23:02 -0800 (PST) Subject: [Bioperl-l] Problem with BIO::DB::FASTA and Colon in Fasta Header Message-ID: Hello, I'm trying to retreive fasta sequences that contain a colon in their header. However, I cannot get my BioPerl script to do this!! It works as expected when the header does not contain the colon, however doesn't return anything when it does. Weirdly, when I ask it to return the parsed IDs (see below), it returns the appropriate IDs, which include the colon! Very confusing, would appreciate any help!! Many Thanks, Jason Gallant use strict; use Bio::SearchIO; use Bio::DB::Fasta; my ($file,$id,$start,$end) = ("secondround_merged_expanded.fasta","C7047455:0-100",1,10); my $db = Bio::DB::Fasta->new($file, -reindex=>1); my $seq = $db->seq($id,$start,$end); print $db->ids; print $seq,"\n"; From asjo at koldfront.dk Tue Dec 4 15:53:08 2012 From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=) Date: Tue, 04 Dec 2012 21:53:08 +0100 Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access In-Reply-To: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com> (Francesco Musacchia's message of "Wed, 28 Nov 2012 02:27:16 -0800 (PST)") References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com> Message-ID: <87y5hdletn.fsf@topper.koldfront.dk> On Wed, 28 Nov 2012 02:27:16 -0800 (PST), Francesco wrote: > I'm experiencing that when I have to do a lot of accessess on a GFF > database (with Bio:DB::SeqFeature::Store) the slowness increase until > my script can stay running for more than a day. First you'll need to find out what/where exactly it is slow. One way to do so is using a a profiler; this is a good one for Perl: * https://metacpan.org/module/Devel::NYTProf If you want more specific suggestions, you'll probably have to provide more information. Good luck! Adam -- "As Knuth pointed out long ago, speed only matters Adam Sj?gren in certain critical bottlenecks. And as many asjo at koldfront.dk programmers have observed since, one is very often mistaken about where these bottlenecks are." From cjfields at illinois.edu Tue Dec 4 16:10:00 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 4 Dec 2012 21:10:00 +0000 Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access In-Reply-To: <87y5hdletn.fsf@topper.koldfront.dk> References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com> <87y5hdletn.fsf@topper.koldfront.dk> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu> On Dec 4, 2012, at 2:53 PM, Adam Sj?gren wrote: > On Wed, 28 Nov 2012 02:27:16 -0800 (PST), Francesco wrote: > >> I'm experiencing that when I have to do a lot of accessess on a GFF >> database (with Bio:DB::SeqFeature::Store) the slowness increase until >> my script can stay running for more than a day. > > First you'll need to find out what/where exactly it is slow. One way to > do so is using a a profiler; this is a good one for Perl: > > * https://metacpan.org/module/Devel::NYTProf > > If you want more specific suggestions, you'll probably have to provide > more information. > > > Good luck! > > Adam If anything, we need more profiling of Bioperl code. Ah, if we only had infinite time... :) chris From asjo at koldfront.dk Tue Dec 4 16:33:55 2012 From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=) Date: Tue, 04 Dec 2012 22:33:55 +0100 Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu> (Christopher J. Fields's message of "Tue, 4 Dec 2012 21:10:00 +0000") References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com> <87y5hdletn.fsf@topper.koldfront.dk> <118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu> Message-ID: <87txs1jyd8.fsf@topper.koldfront.dk> On Tue, 4 Dec 2012 21:10:00 +0000, Fields, wrote: > If anything, we need more profiling of Bioperl code. Ah, if we only > had infinite time... :) If we had that, we didn't need profiling! ;-), Adam -- "On the quiet side. Somewhat peculiar. A good Adam Sj?gren companion, in a weird sort of way." asjo at koldfront.dk From florent.angly at gmail.com Tue Dec 4 16:52:41 2012 From: florent.angly at gmail.com (Florent Angly) Date: Wed, 05 Dec 2012 07:52:41 +1000 Subject: [Bioperl-l] Problem with BIO::DB::FASTA and Colon in Fasta Header In-Reply-To: References: Message-ID: <50BE70A9.4060404@gmail.com> Hi Jason, See the documentation for seq() at http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/DB/Fasta.pm#OBJECT_METHODS . When you call seq() with a single argument, e.g. $db->seq('C7047455:0-100'), Bio::DB::Fasta interprets it as a compound ID and looks for position 0 to 100 of a sequence called C7047455. This is a feature that has been in Bio::DB::Fasta since the dawn of time. In this form, seq() expects a colon as part of the compound ID, which is problematic because your sequence ID actually contains a colon. I think that when you call $db->seq($id,$start,$end), Bio::DB::Fasta does not attempt to parse your ID. This is why your code works with this form. Note that if you want to get the entirety of a sequence called 'C7047455:0-100', the easiest if your sequence names contain colon is to use $db->get_Seq_by_id('C7047455:0-100') since get_Seq_by_id() does only take a regular ID (not compound). Florent On 05/12/12 06:23, Jason Gallant wrote: > Hello, > > I'm trying to retreive fasta sequences that contain a colon in their > header. However, I cannot get my BioPerl script to do this!! > > It works as expected when the header does not contain the colon, however > doesn't return anything when it does. Weirdly, when I ask it to return the > parsed IDs (see below), it returns the appropriate IDs, which include the > colon! Very confusing, would appreciate any help!! > > Many Thanks, > Jason Gallant > > > use strict; > use Bio::SearchIO; > use Bio::DB::Fasta; > > > my ($file,$id,$start,$end) = > ("secondround_merged_expanded.fasta","C7047455:0-100",1,10); > > > my $db = Bio::DB::Fasta->new($file, -reindex=>1); > my $seq = $db->seq($id,$start,$end); > > print $db->ids; > > print $seq,"\n"; > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bosborne11 at verizon.net Tue Dec 4 17:12:59 2012 From: bosborne11 at verizon.net (Brian Osborne) Date: Tue, 04 Dec 2012 17:12:59 -0500 Subject: [Bioperl-l] question about bioperl program In-Reply-To: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com> References: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com> Message-ID: <16BBC477-9935-4C79-A70D-6B18716089FB@verizon.net> Yong Li, You want to take a look at this HOWTO: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation Those genes you see in the file are features in the genome sequence. Brian O. On Dec 1, 2012, at 1:10 AM, yongli at yeslab.com wrote: > Dear Sir or Madam, > > > > I just begin to learn bioperl and am using a bioperl program to extract a kind of bacterial genes sequences from genome genbank file. I download NC_003450.gbk, ppt, ffn files. NC_003450 is a kind of bacterial genome genbank file. My bioperl code as follows: > > > > use Bio::Seq; > > use Bio::SeqIO; > > > > $seqio_obj=Bio::SeqIO->new (-file=>'NC_003450.gbk',-format=>'genbank'); > > # $seq_obj=$seqio_obj->next_seq; > > > > while($seq_obj=$seqio_obj->next_seq) > > { > > $display_name=$seq_obj->display_name; > > $desc=$seq_obj->desc; > > $seq=$seq_obj->seq; > > $acc = $seq_obj->accession_number; > > $seqio_obj = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'fasta' ); > > $seqio_obj->write_seq($seq_obj); > > } > > > > After the program runned I just gain a genome complete sequence without a file including every gene sequence one by one. I want to get the genes sequences one by one in a fasta files. So I write you for help. > > > > Yong Li > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From ankh.egypt.public at googlemail.com Fri Dec 7 15:24:20 2012 From: ankh.egypt.public at googlemail.com (Adrian Helmchen) Date: Fri, 07 Dec 2012 21:24:20 +0100 Subject: [Bioperl-l] proteins from an organism Message-ID: <50C25074.8050703@googlemail.com> Hello, I would like to get all proteins from an organism but proteins from cholorplasts or with chrystal structures or something else. I tried to obtain these proteins by send a query 'Arabidopsis thaliana[organism]' with Bio::DB::GenBank and fetch the gi numbers from the cds. But on the one pc I get 6000 proteins and on another pc I get 46000 proteins although Arabidopsis thaliana has 25000 genes. Thank you for your help. From nikkie.vanbers at gmail.com Mon Dec 10 03:07:27 2012 From: nikkie.vanbers at gmail.com (Nikki2) Date: Mon, 10 Dec 2012 00:07:27 -0800 (PST) Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI assembly database Message-ID: <34761946.post@talk.nabble.com> Hi, I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from 'Tracheophyta' that are NCBI's assembly database. However, there are no DocSums returned for the uid's that match the query. When I try the same thing using the genome database it works fine. The script that I used to do the query is at the bottom of this message. The output I get when running the script is: Count = 84 --------------------- WARNING --------------------- MSG: No returned docsums. --------------------------------------------------- I checked the @ids array and it contains the 84 uids. My questions are as follows: 1) Is it possible to get DocSums for uids from the NCBI assembly database, and if yes, how? 2) If not, does anyone have any suggestions how to change my script to get the species-names that match the uids that are returned? Thanks a lot! Nikki ############################################## #!/bin/perl -w use Bio::DB::EUtilities; my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', -db => 'genome', -email => 'my_email at gmail.com', -term => 'Tracheophyta[organism]', -retmax => 5000); print "Count = ",$factory->get_count,"\n"; my @ids = $factory->get_ids; my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary', -email=>'my_email at gmail.com', -db => 'genome', -id => \@ids, ret_max=>5000); while (my $ds = $factory2->next_DocSum) { print "ID: ",$ds->get_id,"\n"; # flattened mode, iterates through all Item objects while (my $item = $ds->next_Item('flattened')) { # not all Items have content, so need to check... printf("%-20s:%s\n",$item->get_name,$item->get_content) if $item->get_content; } print "\n"; } -- View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From cjfields at illinois.edu Mon Dec 10 10:59:03 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 10 Dec 2012 15:59:03 +0000 Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI assembly database In-Reply-To: <34761946.post@talk.nabble.com> References: <34761946.post@talk.nabble.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF4C164@CHIMBX5.ad.uillinois.edu> Nikki, This is b/c a handful of the databases apparently have switched docsum output completely to the DB-specific DocSum schemata (v2), which have not been implemented in Bio::EUtilities as of yet. This requires quite a bit of revision to parse correctly as it's per database, so I don't have a timeline on when this would be available and would likely be incrementally implemented over time. See here for the announcement: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.Release_Notes In the meantime, you can get the raw XML output for these by replacing the loop for $factory2 with: print $factory2->get_Response->content chris On Dec 10, 2012, at 2:07 AM, Nikki2 wrote: > Hi, > > I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from > 'Tracheophyta' that are NCBI's assembly database. However, there are no > DocSums returned for the uid's that match the query. When I try the same > thing using the genome database it works fine. > > The script that I used to do the query is at the bottom of this message. The > output I get when running the script is: > > Count = 84 > > --------------------- WARNING --------------------- > MSG: No returned docsums. > --------------------------------------------------- > > I checked the @ids array and it contains the 84 uids. > > My questions are as follows: > > 1) Is it possible to get DocSums for uids from the NCBI assembly database, > and if yes, how? > 2) If not, does anyone have any suggestions how to change my script to get > the species-names that match the uids that are returned? > > Thanks a lot! > > Nikki > > > > > > > > ############################################## > > #!/bin/perl -w > > use Bio::DB::EUtilities; > > my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', > -db => 'genome', > -email => 'my_email at gmail.com', > -term => 'Tracheophyta[organism]', > -retmax => 5000); > > print "Count = ",$factory->get_count,"\n"; > my @ids = $factory->get_ids; > > my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary', > -email=>'my_email at gmail.com', > -db => 'genome', > -id => \@ids, > ret_max=>5000); > > while (my $ds = $factory2->next_DocSum) { > print "ID: ",$ds->get_id,"\n"; > # flattened mode, iterates through all Item objects > while (my $item = $ds->next_Item('flattened')) { > # not all Items have content, so need to check... > printf("%-20s:%s\n",$item->get_name,$item->get_content) if > $item->get_content; > } > print "\n"; > } > > > -- > View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From jason.stajich at gmail.com Wed Dec 12 23:05:29 2012 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 12 Dec 2012 20:05:29 -0800 Subject: [Bioperl-l] Asking In-Reply-To: <201212131130153627348@gmail.com> References: <201212131130153627348@gmail.com> Message-ID: <7ED416EC-622E-4023-94B7-9A11D29929DC@gmail.com> You want the reroot function. Have you tried reading the howtos on the website already. Node is a node in the tree. There are several functions to find a node or iterate through all the ones in the tree Sent from my iPhone-please excuse typos -- Jason Stajich On Dec 12, 2012, at 7:30 PM, "Xing-Xing Shen" wrote: > Drear Jason > I am a green hand in learning Bioperl. Now, I met a problem about how to define outgroup for a set of newick trees. > > My codes below: > #!/usr/bin/perl > use Bio::TreeIO; > use Bio::Tree::NodeI; > use Bio::Tree::Tree; > my @filenames = glob("*.txt"); > foreach my $filename (@filenames) { > my $treeio = Bio::TreeIO->new('-format' => 'newick', '-file' => "$filename"); > while( my $tree = $treeio->next_tree ) { > $tree->set_root_node("$node"); # what might $node mean? > .......... > .......... > } > } > > > With best, > > Xing-Xing Shen From j.abbott at imperial.ac.uk Thu Dec 13 14:49:15 2012 From: j.abbott at imperial.ac.uk (James Abbott) Date: Thu, 13 Dec 2012 19:49:15 +0000 Subject: [Bioperl-l] deobfuscator broken.... Message-ID: <50CA313B.9060904@imperial.ac.uk> Hi All, Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator is generating internal server errors. I've also been having problems with broken documentation links (cpan links producning the wrong modules, and pdoc pages missing) but can't seem to replicate that problem now.... I am, for now, still obfuscated... Cheers, James -- Dr. James Abbott Lead Bioinformatician Bioinformatics Support Service Imperial College, London From p.j.a.cock at googlemail.com Thu Dec 13 17:52:44 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 13 Dec 2012 22:52:44 +0000 Subject: [Bioperl-l] deobfuscator broken.... In-Reply-To: <50CA313B.9060904@imperial.ac.uk> References: <50CA313B.9060904@imperial.ac.uk> Message-ID: On Thu, Dec 13, 2012 at 7:49 PM, James Abbott wrote: > Hi All, > > Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator > is generating internal server errors. I've also been having problems with > broken documentation links (cpan links producning the wrong modules, and > pdoc pages missing) but can't seem to replicate that problem now.... > > I am, for now, still obfuscated... > > Cheers, > James I would guess this is a side effect from the recent server move, CC'ing root-l in case anyone of the sys-admin team had an idea. Peter From cjfields at illinois.edu Thu Dec 13 17:51:50 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 13 Dec 2012 22:51:50 +0000 Subject: [Bioperl-l] deobfuscator broken.... In-Reply-To: <50CA313B.9060904@imperial.ac.uk> References: <50CA313B.9060904@imperial.ac.uk> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF545DA@CHIMBX5.ad.uillinois.edu> This is likely due to the back-end change in servers. I'm not sure how this was set up but we can inquire about it. chris On Dec 13, 2012, at 1:49 PM, James Abbott wrote: > Hi All, > > Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator is generating internal server errors. I've also been having problems with broken documentation links (cpan links producning the wrong modules, and pdoc pages missing) but can't seem to replicate that problem now.... > > I am, for now, still obfuscated... > > Cheers, > James > -- > Dr. James Abbott > Lead Bioinformatician > Bioinformatics Support Service > Imperial College, London > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Thu Dec 13 18:13:55 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 13 Dec 2012 23:13:55 +0000 Subject: [Bioperl-l] [Root-l] deobfuscator broken.... In-Reply-To: References: <50CA313B.9060904@imperial.ac.uk> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF546F2@CHIMBX5.ad.uillinois.edu> On Dec 13, 2012, at 4:52 PM, Peter Cock wrote: > On Thu, Dec 13, 2012 at 7:49 PM, James Abbott wrote: >> Hi All, >> >> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator >> is generating internal server errors. I've also been having problems with >> broken documentation links (cpan links producning the wrong modules, and >> pdoc pages missing) but can't seem to replicate that problem now.... >> >> I am, for now, still obfuscated... >> >> Cheers, >> James > > I would guess this is a side effect from the recent server move, > CC'ing root-l in case anyone of the sys-admin team had an idea. > > Peter Beat me by four minutes! The CGI code is in websites/bioperl.org/cgi/. I'm checking on the errors now, may take me a little time to get it back up (was missing CGI, now needs to have the lib path extended). chris From jason.stajich at gmail.com Thu Dec 13 18:18:26 2012 From: jason.stajich at gmail.com (Jason Stajich) Date: Thu, 13 Dec 2012 15:18:26 -0800 Subject: [Bioperl-l] [Root-l] deobfuscator broken.... In-Reply-To: References: <50CA313B.9060904@imperial.ac.uk> Message-ID: <0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com> I think it uses mysql but I don't know if that was reconstituted on the new server. On Dec 13, 2012, at 2:52 PM, Peter Cock wrote: > On Thu, Dec 13, 2012 at 7:49 PM, James Abbott wrote: >> Hi All, >> >> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator >> is generating internal server errors. I've also been having problems with >> broken documentation links (cpan links producning the wrong modules, and >> pdoc pages missing) but can't seem to replicate that problem now.... >> >> I am, for now, still obfuscated... >> >> Cheers, >> James > > I would guess this is a side effect from the recent server move, > CC'ing root-l in case anyone of the sys-admin team had an idea. > > Peter > _______________________________________________ > Root-l mailing list > Root-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/root-l Jason Stajich jason.stajich at gmail.com jason at bioperl.org From nikkie.vanbers at gmail.com Wed Dec 5 09:04:09 2012 From: nikkie.vanbers at gmail.com (Nikki2) Date: Wed, 5 Dec 2012 06:04:09 -0800 (PST) Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI assembly database Message-ID: <34761946.post@talk.nabble.com> Hi, I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from 'Tracheophyta' that are NCBI's assembly database. However, there are no DocSums returned for the uid's that match the query. When I try the same thing using the genome database it works fine. The script that I used to do the query is at the bottom of this message. The output I get when running the script is: Count = 84 --------------------- WARNING --------------------- MSG: No returned docsums. --------------------------------------------------- I checked the @ids array and it contains the 84 uids. My questions are as follows: 1) Is it possible to get DocSums for uids from the NCBI assembly database, and if yes, how? 2) If not, does anyone have any suggestions how to change my script to get the species-names that match the uids that are returned? Thanks a lot! Nikki ############################################## #!/bin/perl -w use Bio::DB::EUtilities; my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', -db => 'genome', -email => 'my_email at gmail.com', -term => 'Tracheophyta[organism]', -retmax => 5000); print "Count = ",$factory->get_count,"\n"; my @ids = $factory->get_ids; my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary', -email=>'my_email at gmail.com', -db => 'genome', -id => \@ids, ret_max=>5000); while (my $ds = $factory2->next_DocSum) { print "ID: ",$ds->get_id,"\n"; # flattened mode, iterates through all Item objects while (my $item = $ds->next_Item('flattened')) { # not all Items have content, so need to check... printf("%-20s:%s\n",$item->get_name,$item->get_content) if $item->get_content; } print "\n"; } -- View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From online at davemessina.com Thu Dec 13 18:41:35 2012 From: online at davemessina.com (Dave Messina) Date: Thu, 13 Dec 2012 18:41:35 -0500 Subject: [Bioperl-l] [Root-l] deobfuscator broken.... In-Reply-To: <0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com> References: <50CA313B.9060904@imperial.ac.uk> <0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com> Message-ID: It should be just (shudder) Berkeley DB. On Dec 13, 2012, at 18:18, Jason Stajich wrote: > I think it uses mysql but I don't know if that was reconstituted on the new server. > > On Dec 13, 2012, at 2:52 PM, Peter Cock wrote: > >> On Thu, Dec 13, 2012 at 7:49 PM, James Abbott wrote: >>> Hi All, >>> >>> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator >>> is generating internal server errors. I've also been having problems with >>> broken documentation links (cpan links producning the wrong modules, and >>> pdoc pages missing) but can't seem to replicate that problem now.... >>> >>> I am, for now, still obfuscated... >>> >>> Cheers, >>> James >> >> I would guess this is a side effect from the recent server move, >> CC'ing root-l in case anyone of the sys-admin team had an idea. >> >> Peter >> _______________________________________________ >> Root-l mailing list >> Root-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/root-l > > Jason Stajich > jason.stajich at gmail.com > jason at bioperl.org > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From abualiga2 at gmail.com Tue Dec 18 17:08:51 2012 From: abualiga2 at gmail.com (galeb abu-ali) Date: Tue, 18 Dec 2012 17:08:51 -0500 Subject: [Bioperl-l] Fwd: how to parse maf file format In-Reply-To: References: Message-ID: Hi, I am writing a script to parse a multiple genome alignment file in maf format, generated with mugsy alignment of e.coli genomes. So far, my script parses SNPs from synteny blocks conserved in all aligned strains, and it excludes gaps, which is enough for a phylogenetic analyses. I was wondering how can I parse the remaining blocks that are not conserved in all strains, to see what is conserved in n-1, n-2, etc. strains or unique to each strain. I guess this is not a BioPerl question, but it's a Perl for biologists question so I was hoping to get some insight here. If there is a more appropriate forum, please let me know. Below is my code. many thanks! galeb #!/usr/local/bin/perl use Modern::Perl '2013'; use autodie; use List::MoreUtils qw/ each_arrayref /; # gsa 18.12.2012 # parse mugsy multiple genome alignment for SNPs in synteny blocks conserved in all aligned strains =head ##maf version=1 scoring=mugsy a score=7891 label=40 mult=4 s O55H7_RM12579.O55H7_RM12579 1596752 7262 + 5263980 CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCG s O55H7_CB9615.O55H7_CB9615 1604426 7262 + 5386352 CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCGG-AT s O157H7_Sakai.O157H7_Sakai 1787303 7068 + 5498450 CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCGG-AT s O157H7_EDL.O157H7_EDL933 1729749 7082 + 5528445 CGGGATGCGGGAATGGGAATGCCTTGGTTGACGGGGTGGCGGAAT a score=6756 label=41 mult=4 s O55H7_RM12579.O55H7_RM12579 1986265 6749 + 5263980 CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGG s O55H7_CB9615.O55H7_CB9615 1991733 6749 + 5386352 CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGGTTAC s O157H7_Sakai.O157H7_Sakai 3940728 6751 - 5498450 CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGGTTAC s O157H7_EDL.O157H7_EDL933 4260689 4042 - 5528445 --------------------------------------------- =cut my $infile = shift or die "Usage: $0 \n"; my %snps; my $strains = 0; my @alignment; my( $score, $blkLen, $mult ); my $total_snps; my $syn_len; my %lengths; open my $fh, '<', $infile; while( <$fh> ) { next if /^#/; chomp; if( /^a/ ) { ( $score, $blkLen, $mult ) = ( split )[1,2,3]; $score =~ s/score\=(\d+)/$1/; # length of alignment block including '-' $blkLen =~ s/label\=(\d+)/$1/; # alignment block number; numbers ranked on alignment length $mult =~ s/mult\=(\d+)/$1/;# number of strains aligned in block $strains = $mult if $mult > $strains; # total number of strains in alignment } elsif( /^s/ ) { push @alignment, $_ } elsif( /^$/ || ! length $_ ) { my( @strNames, @starts, @strands, @dna_mtrx ); # if sequence conserved in all strains if( $strains == @alignment ) { $syn_len += $score; # total aligned sequence in all strains for( @alignment ) { # name, align start, align length (w/o '-'), direction, align sequence w/ '-' my( $name, $start, $len, $strand, $dna ) = ( split /\s+/ )[ 1, 2, 3, 4, 6 ]; #$name =~ s/.*\.(.*)/$1/; # remove duplicated strain name # strains are always in same order when all strains in block. push @strNames, $name; push @starts, $start; push @strands, $strand; push @dna_mtrx, [ split '', $dna ]; # total seqeunce in each strain w/o '-' that is conserved in all strains $lengths{ $name } += $len; } my $ea = each_arrayref( @dna_mtrx ); my %gaps; my $cnt; while( my( @bases ) = $ea->() ) { ++$cnt; my %temp; for( 0 .. $#bases ) { # store gaps if any if( $bases[$_] eq '-' ) { $gaps{$_}++; # key is number, corresponds to index of other arrays } } # skip gaps '-' unless( '-' ~~ @bases ) { $temp{ uc $_}++ for @bases } # if snp then %temp will have > 1 key if( keys %temp > 1 ) { # if SNP exists, get base and position for all strains in alignment ++$total_snps; my $pos; for( 0 .. $#bases ) { if( $strands[$_] eq '+' ) { $pos = $starts[$_] + $cnt - ( $gaps{$_} // 0 ) } # genome positn elsif( $strands[$_] eq '-' ) { $pos = $starts[$_] - $cnt - ( $gaps{$_} // 0 ) } # HoAoH push @{ $snps{ $strNames[$_] } }, { $pos => $bases[$_] }; } } } } @alignment = (); } } close $fh; #print Dumper( \%snps ); use Data::Dumper; say "Sum length of synteny blocks conserved in all strains, including gaps: $syn_len bp"; say "Length of conserved sequence for each strain, excluding gaps:"; for my $strain ( keys %lengths ) { say "$strain\t$lengths{ $strain } bp"; } my $outfile = $infile; $outfile =~ s/\.maf$/_snps.txt/; open my $fh2, '>', $outfile; say {$fh2} map{ $_ . "_base\t", $_ . "_pos\t" } keys %snps; for my $snp ( 0 .. ( $total_snps - 1 ) ) { for my $strain ( keys %snps ){ for my $href ( keys %{ $snps{ $strain }[ $snp ] } ) { print {$fh2} "$snps{ $strain }[ $snp ]->{ $href }\t$href\t"; } } print {$fh2} "\n"; } From sanketd at isquareit.ac.in Mon Dec 31 01:46:41 2012 From: sanketd at isquareit.ac.in (Sanket Desai) Date: Mon, 31 Dec 2012 12:16:41 +0530 (IST) Subject: [Bioperl-l] Help in getting organism names of the nucleotide entries. Message-ID: <26019826.10871.1356936401744.JavaMail.root@mail.isquareit.ac.in> Hello, With respect to the post: http://bio.perl.org/pipermail/bioperl-l/2009-December/031831.html When used for nucleotide database it gives the following error: --------------------- WARNING --------------------- MSG: The -email parameter is now required, per NCBI E-utilities policy --------------------------------------------------- --------------------- WARNING --------------------- MSG: No linksets returned --------------------------------------------------- --------------------- WARNING --------------------- MSG: The -email parameter is now required, per NCBI E-utilities policy --------------------------------------------------- ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: NCBI esummary fatal error: Empty id list - nothing todo STACK: Error::throw STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:472 STACK: Bio::Tools::EUtilities::parse_data /usr/share/perl5/Bio/Tools/EUtilities.pm:382 STACK: Bio::Tools::EUtilities::next_DocSum /usr/share/perl5/Bio/Tools/EUtilities.pm:964 STACK: Bio::DB::EUtilities::next_DocSum /usr/share/perl5/Bio/DB/EUtilities.pm:914 STACK: getOrgNameFrmAccession.pl:29 ----------------------------------------------------------- Please suggest the relevant changes in the above script to make it work for the nucleotide entries also. Thanks in advance, Regards, Sanket From fcyucn at gmail.com Mon Dec 17 20:37:45 2012 From: fcyucn at gmail.com (Fengchao Yu) Date: Tue, 18 Dec 2012 01:37:45 -0000 Subject: [Bioperl-l] Is there any module for the protein digestion? Message-ID: <7b719317-57a3-46ef-927c-6b0508e1e62d@googlegroups.com> I notice that Bio::Restriction::Enzyme is for DNA digest? I wonder if there is any module for protein digestion? Thanks From florent.angly at gmail.com Sun Dec 2 21:36:28 2012 From: florent.angly at gmail.com (Florent Angly) Date: Mon, 03 Dec 2012 12:36:28 +1000 Subject: [Bioperl-l] Bio::DB::Fasta and threads Message-ID: <50BC102C.7080902@gmail.com> Hi all, This is in response to Carson Holt's report that Bio::DB::Fasta does not play well with threads: https://redmine.open-bio.org/issues/3397 The first issue is the serialization of Bio::DB::IndexedBase-inheriting (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for threading (for example when using Thread::Queue::Any). I implemented hooks that make it transparent to serialize using Storable freeze() and thaw(). Another issue was the lack of communication between different Bio::DB::IndexedBase instances, which means that an instance could easily be writing or deleting the database that another instance is working on. To fix this, I needed some form of locking. Some database Bio::DB::IndexedBase backends (DB_file) have some support for locking but Bio::DB::IndexedBase also supports other database backends for which there is no native locking mechanism. So, I had to come up with a more general solution: a lock file. I noticed that Bio::DB::SeqFeature::Store::berkeleydb has a locking mechanism, based on flock(), which means that it does not work with NFS-mounted filesystems. All the Bioperl-based scripts I (and most likely many others) write run on servers that use NFS, so this support is important. I have found only one way to do the NFS locking safely, using File::SharedNFSLock. It has a few downsides though: 1/ it is an external dependency, 2/ it does not work on FAT filesystems (should be mostly restricted to USB sticks nowadays) and the lock is never acquired, and 3/ at the moment, it requires a patch to work in threaded context (https://rt.cpan.org/Public/Bug/Display.html?id=81597) Note that while I have now added basic support for threads in Bio::DB::IndexedBase was added, I still get segfaults in specific cases, for example when returning a database or sequence object. This might be related to this issue: https://rt.perl.org/rt3/Ticket/Display.html?id=115972. Beyond this, the new code seems to work nicely. See the branch https://github.com/bioperl/bioperl-live/tree/storable_db if you want to test yourself. For example, one can now run multiple threads, each of them creating a Bio::DB::Fasta database from the same FASTA file: the first thread performs the indexing while the others wait nicely for the indexing to be finished to query the database. Comments welcome. Regards, Florent From l.m.timmermans at students.uu.nl Mon Dec 3 19:29:59 2012 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Tue, 4 Dec 2012 01:29:59 +0100 Subject: [Bioperl-l] Bio::DB::Fasta and threads In-Reply-To: <50BC102C.7080902@gmail.com> References: <50BC102C.7080902@gmail.com> Message-ID: On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly wrote: > The first issue is the serialization of Bio::DB::IndexedBase-inheriting > (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for > threading (for example when using Thread::Queue::Any). I implemented hooks > that make it transparent to serialize using Storable freeze() and thaw(). I don't think serializing a magical thingie makes much sense. Storable is commonly used for a lot more things than interthread communication (e.g. network communication), this would often not work under such circumstances. Leon From cjfields at illinois.edu Mon Dec 3 22:23:50 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 4 Dec 2012 03:23:50 +0000 Subject: [Bioperl-l] Bio::DB::Fasta and threads In-Reply-To: References: <50BC102C.7080902@gmail.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF45670@CHIMBX5.ad.uillinois.edu> On Dec 3, 2012, at 6:29 PM, Leon Timmermans wrote: > On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly wrote: >> The first issue is the serialization of Bio::DB::IndexedBase-inheriting >> (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for >> threading (for example when using Thread::Queue::Any). I implemented hooks >> that make it transparent to serialize using Storable freeze() and thaw(). > > I don't think serializing a magical thingie makes much sense. Storable > is commonly used for a lot more things than interthread communication > (e.g. network communication), this would often not work under such > circumstances. > > Leon Leon, any suggestions on alternatives? I know this particular bit is a sore spot with MAKER at the moment, so any help would be greatly appreciated. chris From yongli at yeslab.com Sat Dec 1 01:10:15 2012 From: yongli at yeslab.com (=?utf-8?B?eW9uZ2xpQHllc2xhYi5jb20=?=) Date: Sat, 1 Dec 2012 14:10:15 +0800 (CST) Subject: [Bioperl-l] =?utf-8?q?question_about_bioperl_program?= Message-ID: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com> Dear Sir or Madam, I just begin to learn bioperl and am using a bioperl program to extract a kind of bacterial genes sequences from genome genbank file. I download NC_003450.gbk, ppt, ffn files. NC_003450 is a kind of bacterial genome genbank file. My bioperl code as follows: use Bio::Seq; use Bio::SeqIO; $seqio_obj=Bio::SeqIO->new (-file=>'NC_003450.gbk',-format=>'genbank'); # $seq_obj=$seqio_obj->next_seq; while($seq_obj=$seqio_obj->next_seq) { $display_name=$seq_obj->display_name; $desc=$seq_obj->desc; $seq=$seq_obj->seq; $acc = $seq_obj->accession_number; $seqio_obj = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'fasta' ); $seqio_obj->write_seq($seq_obj); } After the program runned I just gain a genome complete sequence without a file including every gene sequence one by one. I want to get the genes sequences one by one in a fasta files. So I write you for help. Yong Li From carsonhh at gmail.com Mon Dec 3 22:35:50 2012 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 03 Dec 2012 22:35:50 -0500 Subject: [Bioperl-l] Bio::DB::Fasta and threads In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF4CF45670@CHIMBX5.ad.uillinois.edu> Message-ID: Bio::DB::Fasta is working for maker now. The previous issues have been fixed, but being as Florent has gone out of his way to build a number of improvements into Bio::DB::Fasta over the past few weeks, this seemed like a useful one as well, so I suggested it. One of the big uses of Bio::DB::Fasta is the Bio::PrimarySeq::Fasta features it creates. They are great for manipulating the sequence without actually having to ever keep it in memory. It's nice because the sequence is made available on demand, but when you try and pass them between threads, your program falls apart. There are creative work arounds, but simply adding a serialization hook to Bio::DB::Fasta to disconnect the database on freezing and then reconnect on thaw also fixes it, and it makes them extremely useful for multi-threaded applications without having to go through other kinds of work arounds (it just makes them work as expected with serialization). Previously I had created my own module and inherited from Bio::DB::Fasta so I could implement the Storable hooks. Because Storable looks for the hooks in anything it serializes, the Bio::DB::Fasta object can even be well down inside of a complex object and you don't have worry about it. Previously I've used Storable hooks to pass the Bio::PrimarySeq::Fasta features across the network using MPI, as long as the database is on an NFS mount it just reconnects on the other node with no issue. If the indexed file isn't available after deserialization over a network, you could just throw an error when the thaw hook is called. I'll give Florent's changes a look over soon to give any suggestions. Thanks, Carson On 12-12-03 10:23 PM, "Fields, Christopher J" wrote: >On Dec 3, 2012, at 6:29 PM, Leon Timmermans > wrote: > >> On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly >>wrote: >>> The first issue is the serialization of Bio::DB::IndexedBase-inheriting >>> (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for >>> threading (for example when using Thread::Queue::Any). I implemented >>>hooks >>> that make it transparent to serialize using Storable freeze() and >>>thaw(). >> >> I don't think serializing a magical thingie makes much sense. Storable >> is commonly used for a lot more things than interthread communication >> (e.g. network communication), this would often not work under such >> circumstances. >> >> Leon > >Leon, any suggestions on alternatives? I know this particular bit is a >sore spot with MAKER at the moment, so any help would be greatly >appreciated. > >chris > From jason.r.gallant at gmail.com Tue Dec 4 15:23:02 2012 From: jason.r.gallant at gmail.com (Jason Gallant) Date: Tue, 4 Dec 2012 12:23:02 -0800 (PST) Subject: [Bioperl-l] Problem with BIO::DB::FASTA and Colon in Fasta Header Message-ID: Hello, I'm trying to retreive fasta sequences that contain a colon in their header. However, I cannot get my BioPerl script to do this!! It works as expected when the header does not contain the colon, however doesn't return anything when it does. Weirdly, when I ask it to return the parsed IDs (see below), it returns the appropriate IDs, which include the colon! Very confusing, would appreciate any help!! Many Thanks, Jason Gallant use strict; use Bio::SearchIO; use Bio::DB::Fasta; my ($file,$id,$start,$end) = ("secondround_merged_expanded.fasta","C7047455:0-100",1,10); my $db = Bio::DB::Fasta->new($file, -reindex=>1); my $seq = $db->seq($id,$start,$end); print $db->ids; print $seq,"\n"; From asjo at koldfront.dk Tue Dec 4 15:53:08 2012 From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=) Date: Tue, 04 Dec 2012 21:53:08 +0100 Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access In-Reply-To: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com> (Francesco Musacchia's message of "Wed, 28 Nov 2012 02:27:16 -0800 (PST)") References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com> Message-ID: <87y5hdletn.fsf@topper.koldfront.dk> On Wed, 28 Nov 2012 02:27:16 -0800 (PST), Francesco wrote: > I'm experiencing that when I have to do a lot of accessess on a GFF > database (with Bio:DB::SeqFeature::Store) the slowness increase until > my script can stay running for more than a day. First you'll need to find out what/where exactly it is slow. One way to do so is using a a profiler; this is a good one for Perl: * https://metacpan.org/module/Devel::NYTProf If you want more specific suggestions, you'll probably have to provide more information. Good luck! Adam -- "As Knuth pointed out long ago, speed only matters Adam Sj?gren in certain critical bottlenecks. And as many asjo at koldfront.dk programmers have observed since, one is very often mistaken about where these bottlenecks are." From cjfields at illinois.edu Tue Dec 4 16:10:00 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 4 Dec 2012 21:10:00 +0000 Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access In-Reply-To: <87y5hdletn.fsf@topper.koldfront.dk> References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com> <87y5hdletn.fsf@topper.koldfront.dk> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu> On Dec 4, 2012, at 2:53 PM, Adam Sj?gren wrote: > On Wed, 28 Nov 2012 02:27:16 -0800 (PST), Francesco wrote: > >> I'm experiencing that when I have to do a lot of accessess on a GFF >> database (with Bio:DB::SeqFeature::Store) the slowness increase until >> my script can stay running for more than a day. > > First you'll need to find out what/where exactly it is slow. One way to > do so is using a a profiler; this is a good one for Perl: > > * https://metacpan.org/module/Devel::NYTProf > > If you want more specific suggestions, you'll probably have to provide > more information. > > > Good luck! > > Adam If anything, we need more profiling of Bioperl code. Ah, if we only had infinite time... :) chris From asjo at koldfront.dk Tue Dec 4 16:33:55 2012 From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=) Date: Tue, 04 Dec 2012 22:33:55 +0100 Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu> (Christopher J. Fields's message of "Tue, 4 Dec 2012 21:10:00 +0000") References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com> <87y5hdletn.fsf@topper.koldfront.dk> <118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu> Message-ID: <87txs1jyd8.fsf@topper.koldfront.dk> On Tue, 4 Dec 2012 21:10:00 +0000, Fields, wrote: > If anything, we need more profiling of Bioperl code. Ah, if we only > had infinite time... :) If we had that, we didn't need profiling! ;-), Adam -- "On the quiet side. Somewhat peculiar. A good Adam Sj?gren companion, in a weird sort of way." asjo at koldfront.dk From florent.angly at gmail.com Tue Dec 4 16:52:41 2012 From: florent.angly at gmail.com (Florent Angly) Date: Wed, 05 Dec 2012 07:52:41 +1000 Subject: [Bioperl-l] Problem with BIO::DB::FASTA and Colon in Fasta Header In-Reply-To: References: Message-ID: <50BE70A9.4060404@gmail.com> Hi Jason, See the documentation for seq() at http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/DB/Fasta.pm#OBJECT_METHODS . When you call seq() with a single argument, e.g. $db->seq('C7047455:0-100'), Bio::DB::Fasta interprets it as a compound ID and looks for position 0 to 100 of a sequence called C7047455. This is a feature that has been in Bio::DB::Fasta since the dawn of time. In this form, seq() expects a colon as part of the compound ID, which is problematic because your sequence ID actually contains a colon. I think that when you call $db->seq($id,$start,$end), Bio::DB::Fasta does not attempt to parse your ID. This is why your code works with this form. Note that if you want to get the entirety of a sequence called 'C7047455:0-100', the easiest if your sequence names contain colon is to use $db->get_Seq_by_id('C7047455:0-100') since get_Seq_by_id() does only take a regular ID (not compound). Florent On 05/12/12 06:23, Jason Gallant wrote: > Hello, > > I'm trying to retreive fasta sequences that contain a colon in their > header. However, I cannot get my BioPerl script to do this!! > > It works as expected when the header does not contain the colon, however > doesn't return anything when it does. Weirdly, when I ask it to return the > parsed IDs (see below), it returns the appropriate IDs, which include the > colon! Very confusing, would appreciate any help!! > > Many Thanks, > Jason Gallant > > > use strict; > use Bio::SearchIO; > use Bio::DB::Fasta; > > > my ($file,$id,$start,$end) = > ("secondround_merged_expanded.fasta","C7047455:0-100",1,10); > > > my $db = Bio::DB::Fasta->new($file, -reindex=>1); > my $seq = $db->seq($id,$start,$end); > > print $db->ids; > > print $seq,"\n"; > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bosborne11 at verizon.net Tue Dec 4 17:12:59 2012 From: bosborne11 at verizon.net (Brian Osborne) Date: Tue, 04 Dec 2012 17:12:59 -0500 Subject: [Bioperl-l] question about bioperl program In-Reply-To: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com> References: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com> Message-ID: <16BBC477-9935-4C79-A70D-6B18716089FB@verizon.net> Yong Li, You want to take a look at this HOWTO: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation Those genes you see in the file are features in the genome sequence. Brian O. On Dec 1, 2012, at 1:10 AM, yongli at yeslab.com wrote: > Dear Sir or Madam, > > > > I just begin to learn bioperl and am using a bioperl program to extract a kind of bacterial genes sequences from genome genbank file. I download NC_003450.gbk, ppt, ffn files. NC_003450 is a kind of bacterial genome genbank file. My bioperl code as follows: > > > > use Bio::Seq; > > use Bio::SeqIO; > > > > $seqio_obj=Bio::SeqIO->new (-file=>'NC_003450.gbk',-format=>'genbank'); > > # $seq_obj=$seqio_obj->next_seq; > > > > while($seq_obj=$seqio_obj->next_seq) > > { > > $display_name=$seq_obj->display_name; > > $desc=$seq_obj->desc; > > $seq=$seq_obj->seq; > > $acc = $seq_obj->accession_number; > > $seqio_obj = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'fasta' ); > > $seqio_obj->write_seq($seq_obj); > > } > > > > After the program runned I just gain a genome complete sequence without a file including every gene sequence one by one. I want to get the genes sequences one by one in a fasta files. So I write you for help. > > > > Yong Li > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From ankh.egypt.public at googlemail.com Fri Dec 7 15:24:20 2012 From: ankh.egypt.public at googlemail.com (Adrian Helmchen) Date: Fri, 07 Dec 2012 21:24:20 +0100 Subject: [Bioperl-l] proteins from an organism Message-ID: <50C25074.8050703@googlemail.com> Hello, I would like to get all proteins from an organism but proteins from cholorplasts or with chrystal structures or something else. I tried to obtain these proteins by send a query 'Arabidopsis thaliana[organism]' with Bio::DB::GenBank and fetch the gi numbers from the cds. But on the one pc I get 6000 proteins and on another pc I get 46000 proteins although Arabidopsis thaliana has 25000 genes. Thank you for your help. From nikkie.vanbers at gmail.com Mon Dec 10 03:07:27 2012 From: nikkie.vanbers at gmail.com (Nikki2) Date: Mon, 10 Dec 2012 00:07:27 -0800 (PST) Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI assembly database Message-ID: <34761946.post@talk.nabble.com> Hi, I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from 'Tracheophyta' that are NCBI's assembly database. However, there are no DocSums returned for the uid's that match the query. When I try the same thing using the genome database it works fine. The script that I used to do the query is at the bottom of this message. The output I get when running the script is: Count = 84 --------------------- WARNING --------------------- MSG: No returned docsums. --------------------------------------------------- I checked the @ids array and it contains the 84 uids. My questions are as follows: 1) Is it possible to get DocSums for uids from the NCBI assembly database, and if yes, how? 2) If not, does anyone have any suggestions how to change my script to get the species-names that match the uids that are returned? Thanks a lot! Nikki ############################################## #!/bin/perl -w use Bio::DB::EUtilities; my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', -db => 'genome', -email => 'my_email at gmail.com', -term => 'Tracheophyta[organism]', -retmax => 5000); print "Count = ",$factory->get_count,"\n"; my @ids = $factory->get_ids; my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary', -email=>'my_email at gmail.com', -db => 'genome', -id => \@ids, ret_max=>5000); while (my $ds = $factory2->next_DocSum) { print "ID: ",$ds->get_id,"\n"; # flattened mode, iterates through all Item objects while (my $item = $ds->next_Item('flattened')) { # not all Items have content, so need to check... printf("%-20s:%s\n",$item->get_name,$item->get_content) if $item->get_content; } print "\n"; } -- View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From cjfields at illinois.edu Mon Dec 10 10:59:03 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 10 Dec 2012 15:59:03 +0000 Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI assembly database In-Reply-To: <34761946.post@talk.nabble.com> References: <34761946.post@talk.nabble.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF4C164@CHIMBX5.ad.uillinois.edu> Nikki, This is b/c a handful of the databases apparently have switched docsum output completely to the DB-specific DocSum schemata (v2), which have not been implemented in Bio::EUtilities as of yet. This requires quite a bit of revision to parse correctly as it's per database, so I don't have a timeline on when this would be available and would likely be incrementally implemented over time. See here for the announcement: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.Release_Notes In the meantime, you can get the raw XML output for these by replacing the loop for $factory2 with: print $factory2->get_Response->content chris On Dec 10, 2012, at 2:07 AM, Nikki2 wrote: > Hi, > > I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from > 'Tracheophyta' that are NCBI's assembly database. However, there are no > DocSums returned for the uid's that match the query. When I try the same > thing using the genome database it works fine. > > The script that I used to do the query is at the bottom of this message. The > output I get when running the script is: > > Count = 84 > > --------------------- WARNING --------------------- > MSG: No returned docsums. > --------------------------------------------------- > > I checked the @ids array and it contains the 84 uids. > > My questions are as follows: > > 1) Is it possible to get DocSums for uids from the NCBI assembly database, > and if yes, how? > 2) If not, does anyone have any suggestions how to change my script to get > the species-names that match the uids that are returned? > > Thanks a lot! > > Nikki > > > > > > > > ############################################## > > #!/bin/perl -w > > use Bio::DB::EUtilities; > > my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', > -db => 'genome', > -email => 'my_email at gmail.com', > -term => 'Tracheophyta[organism]', > -retmax => 5000); > > print "Count = ",$factory->get_count,"\n"; > my @ids = $factory->get_ids; > > my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary', > -email=>'my_email at gmail.com', > -db => 'genome', > -id => \@ids, > ret_max=>5000); > > while (my $ds = $factory2->next_DocSum) { > print "ID: ",$ds->get_id,"\n"; > # flattened mode, iterates through all Item objects > while (my $item = $ds->next_Item('flattened')) { > # not all Items have content, so need to check... > printf("%-20s:%s\n",$item->get_name,$item->get_content) if > $item->get_content; > } > print "\n"; > } > > > -- > View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From jason.stajich at gmail.com Wed Dec 12 23:05:29 2012 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 12 Dec 2012 20:05:29 -0800 Subject: [Bioperl-l] Asking In-Reply-To: <201212131130153627348@gmail.com> References: <201212131130153627348@gmail.com> Message-ID: <7ED416EC-622E-4023-94B7-9A11D29929DC@gmail.com> You want the reroot function. Have you tried reading the howtos on the website already. Node is a node in the tree. There are several functions to find a node or iterate through all the ones in the tree Sent from my iPhone-please excuse typos -- Jason Stajich On Dec 12, 2012, at 7:30 PM, "Xing-Xing Shen" wrote: > Drear Jason > I am a green hand in learning Bioperl. Now, I met a problem about how to define outgroup for a set of newick trees. > > My codes below: > #!/usr/bin/perl > use Bio::TreeIO; > use Bio::Tree::NodeI; > use Bio::Tree::Tree; > my @filenames = glob("*.txt"); > foreach my $filename (@filenames) { > my $treeio = Bio::TreeIO->new('-format' => 'newick', '-file' => "$filename"); > while( my $tree = $treeio->next_tree ) { > $tree->set_root_node("$node"); # what might $node mean? > .......... > .......... > } > } > > > With best, > > Xing-Xing Shen From j.abbott at imperial.ac.uk Thu Dec 13 14:49:15 2012 From: j.abbott at imperial.ac.uk (James Abbott) Date: Thu, 13 Dec 2012 19:49:15 +0000 Subject: [Bioperl-l] deobfuscator broken.... Message-ID: <50CA313B.9060904@imperial.ac.uk> Hi All, Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator is generating internal server errors. I've also been having problems with broken documentation links (cpan links producning the wrong modules, and pdoc pages missing) but can't seem to replicate that problem now.... I am, for now, still obfuscated... Cheers, James -- Dr. James Abbott Lead Bioinformatician Bioinformatics Support Service Imperial College, London From p.j.a.cock at googlemail.com Thu Dec 13 17:52:44 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 13 Dec 2012 22:52:44 +0000 Subject: [Bioperl-l] deobfuscator broken.... In-Reply-To: <50CA313B.9060904@imperial.ac.uk> References: <50CA313B.9060904@imperial.ac.uk> Message-ID: On Thu, Dec 13, 2012 at 7:49 PM, James Abbott wrote: > Hi All, > > Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator > is generating internal server errors. I've also been having problems with > broken documentation links (cpan links producning the wrong modules, and > pdoc pages missing) but can't seem to replicate that problem now.... > > I am, for now, still obfuscated... > > Cheers, > James I would guess this is a side effect from the recent server move, CC'ing root-l in case anyone of the sys-admin team had an idea. Peter From cjfields at illinois.edu Thu Dec 13 17:51:50 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 13 Dec 2012 22:51:50 +0000 Subject: [Bioperl-l] deobfuscator broken.... In-Reply-To: <50CA313B.9060904@imperial.ac.uk> References: <50CA313B.9060904@imperial.ac.uk> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF545DA@CHIMBX5.ad.uillinois.edu> This is likely due to the back-end change in servers. I'm not sure how this was set up but we can inquire about it. chris On Dec 13, 2012, at 1:49 PM, James Abbott wrote: > Hi All, > > Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator is generating internal server errors. I've also been having problems with broken documentation links (cpan links producning the wrong modules, and pdoc pages missing) but can't seem to replicate that problem now.... > > I am, for now, still obfuscated... > > Cheers, > James > -- > Dr. James Abbott > Lead Bioinformatician > Bioinformatics Support Service > Imperial College, London > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Thu Dec 13 18:13:55 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 13 Dec 2012 23:13:55 +0000 Subject: [Bioperl-l] [Root-l] deobfuscator broken.... In-Reply-To: References: <50CA313B.9060904@imperial.ac.uk> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF546F2@CHIMBX5.ad.uillinois.edu> On Dec 13, 2012, at 4:52 PM, Peter Cock wrote: > On Thu, Dec 13, 2012 at 7:49 PM, James Abbott wrote: >> Hi All, >> >> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator >> is generating internal server errors. I've also been having problems with >> broken documentation links (cpan links producning the wrong modules, and >> pdoc pages missing) but can't seem to replicate that problem now.... >> >> I am, for now, still obfuscated... >> >> Cheers, >> James > > I would guess this is a side effect from the recent server move, > CC'ing root-l in case anyone of the sys-admin team had an idea. > > Peter Beat me by four minutes! The CGI code is in websites/bioperl.org/cgi/. I'm checking on the errors now, may take me a little time to get it back up (was missing CGI, now needs to have the lib path extended). chris From jason.stajich at gmail.com Thu Dec 13 18:18:26 2012 From: jason.stajich at gmail.com (Jason Stajich) Date: Thu, 13 Dec 2012 15:18:26 -0800 Subject: [Bioperl-l] [Root-l] deobfuscator broken.... In-Reply-To: References: <50CA313B.9060904@imperial.ac.uk> Message-ID: <0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com> I think it uses mysql but I don't know if that was reconstituted on the new server. On Dec 13, 2012, at 2:52 PM, Peter Cock wrote: > On Thu, Dec 13, 2012 at 7:49 PM, James Abbott wrote: >> Hi All, >> >> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator >> is generating internal server errors. I've also been having problems with >> broken documentation links (cpan links producning the wrong modules, and >> pdoc pages missing) but can't seem to replicate that problem now.... >> >> I am, for now, still obfuscated... >> >> Cheers, >> James > > I would guess this is a side effect from the recent server move, > CC'ing root-l in case anyone of the sys-admin team had an idea. > > Peter > _______________________________________________ > Root-l mailing list > Root-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/root-l Jason Stajich jason.stajich at gmail.com jason at bioperl.org From nikkie.vanbers at gmail.com Wed Dec 5 09:04:09 2012 From: nikkie.vanbers at gmail.com (Nikki2) Date: Wed, 5 Dec 2012 06:04:09 -0800 (PST) Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI assembly database Message-ID: <34761946.post@talk.nabble.com> Hi, I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from 'Tracheophyta' that are NCBI's assembly database. However, there are no DocSums returned for the uid's that match the query. When I try the same thing using the genome database it works fine. The script that I used to do the query is at the bottom of this message. The output I get when running the script is: Count = 84 --------------------- WARNING --------------------- MSG: No returned docsums. --------------------------------------------------- I checked the @ids array and it contains the 84 uids. My questions are as follows: 1) Is it possible to get DocSums for uids from the NCBI assembly database, and if yes, how? 2) If not, does anyone have any suggestions how to change my script to get the species-names that match the uids that are returned? Thanks a lot! Nikki ############################################## #!/bin/perl -w use Bio::DB::EUtilities; my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', -db => 'genome', -email => 'my_email at gmail.com', -term => 'Tracheophyta[organism]', -retmax => 5000); print "Count = ",$factory->get_count,"\n"; my @ids = $factory->get_ids; my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary', -email=>'my_email at gmail.com', -db => 'genome', -id => \@ids, ret_max=>5000); while (my $ds = $factory2->next_DocSum) { print "ID: ",$ds->get_id,"\n"; # flattened mode, iterates through all Item objects while (my $item = $ds->next_Item('flattened')) { # not all Items have content, so need to check... printf("%-20s:%s\n",$item->get_name,$item->get_content) if $item->get_content; } print "\n"; } -- View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From online at davemessina.com Thu Dec 13 18:41:35 2012 From: online at davemessina.com (Dave Messina) Date: Thu, 13 Dec 2012 18:41:35 -0500 Subject: [Bioperl-l] [Root-l] deobfuscator broken.... In-Reply-To: <0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com> References: <50CA313B.9060904@imperial.ac.uk> <0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com> Message-ID: It should be just (shudder) Berkeley DB. On Dec 13, 2012, at 18:18, Jason Stajich wrote: > I think it uses mysql but I don't know if that was reconstituted on the new server. > > On Dec 13, 2012, at 2:52 PM, Peter Cock wrote: > >> On Thu, Dec 13, 2012 at 7:49 PM, James Abbott wrote: >>> Hi All, >>> >>> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator >>> is generating internal server errors. I've also been having problems with >>> broken documentation links (cpan links producning the wrong modules, and >>> pdoc pages missing) but can't seem to replicate that problem now.... >>> >>> I am, for now, still obfuscated... >>> >>> Cheers, >>> James >> >> I would guess this is a side effect from the recent server move, >> CC'ing root-l in case anyone of the sys-admin team had an idea. >> >> Peter >> _______________________________________________ >> Root-l mailing list >> Root-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/root-l > > Jason Stajich > jason.stajich at gmail.com > jason at bioperl.org > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From abualiga2 at gmail.com Tue Dec 18 17:08:51 2012 From: abualiga2 at gmail.com (galeb abu-ali) Date: Tue, 18 Dec 2012 17:08:51 -0500 Subject: [Bioperl-l] Fwd: how to parse maf file format In-Reply-To: References: Message-ID: Hi, I am writing a script to parse a multiple genome alignment file in maf format, generated with mugsy alignment of e.coli genomes. So far, my script parses SNPs from synteny blocks conserved in all aligned strains, and it excludes gaps, which is enough for a phylogenetic analyses. I was wondering how can I parse the remaining blocks that are not conserved in all strains, to see what is conserved in n-1, n-2, etc. strains or unique to each strain. I guess this is not a BioPerl question, but it's a Perl for biologists question so I was hoping to get some insight here. If there is a more appropriate forum, please let me know. Below is my code. many thanks! galeb #!/usr/local/bin/perl use Modern::Perl '2013'; use autodie; use List::MoreUtils qw/ each_arrayref /; # gsa 18.12.2012 # parse mugsy multiple genome alignment for SNPs in synteny blocks conserved in all aligned strains =head ##maf version=1 scoring=mugsy a score=7891 label=40 mult=4 s O55H7_RM12579.O55H7_RM12579 1596752 7262 + 5263980 CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCG s O55H7_CB9615.O55H7_CB9615 1604426 7262 + 5386352 CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCGG-AT s O157H7_Sakai.O157H7_Sakai 1787303 7068 + 5498450 CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCGG-AT s O157H7_EDL.O157H7_EDL933 1729749 7082 + 5528445 CGGGATGCGGGAATGGGAATGCCTTGGTTGACGGGGTGGCGGAAT a score=6756 label=41 mult=4 s O55H7_RM12579.O55H7_RM12579 1986265 6749 + 5263980 CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGG s O55H7_CB9615.O55H7_CB9615 1991733 6749 + 5386352 CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGGTTAC s O157H7_Sakai.O157H7_Sakai 3940728 6751 - 5498450 CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGGTTAC s O157H7_EDL.O157H7_EDL933 4260689 4042 - 5528445 --------------------------------------------- =cut my $infile = shift or die "Usage: $0 \n"; my %snps; my $strains = 0; my @alignment; my( $score, $blkLen, $mult ); my $total_snps; my $syn_len; my %lengths; open my $fh, '<', $infile; while( <$fh> ) { next if /^#/; chomp; if( /^a/ ) { ( $score, $blkLen, $mult ) = ( split )[1,2,3]; $score =~ s/score\=(\d+)/$1/; # length of alignment block including '-' $blkLen =~ s/label\=(\d+)/$1/; # alignment block number; numbers ranked on alignment length $mult =~ s/mult\=(\d+)/$1/;# number of strains aligned in block $strains = $mult if $mult > $strains; # total number of strains in alignment } elsif( /^s/ ) { push @alignment, $_ } elsif( /^$/ || ! length $_ ) { my( @strNames, @starts, @strands, @dna_mtrx ); # if sequence conserved in all strains if( $strains == @alignment ) { $syn_len += $score; # total aligned sequence in all strains for( @alignment ) { # name, align start, align length (w/o '-'), direction, align sequence w/ '-' my( $name, $start, $len, $strand, $dna ) = ( split /\s+/ )[ 1, 2, 3, 4, 6 ]; #$name =~ s/.*\.(.*)/$1/; # remove duplicated strain name # strains are always in same order when all strains in block. push @strNames, $name; push @starts, $start; push @strands, $strand; push @dna_mtrx, [ split '', $dna ]; # total seqeunce in each strain w/o '-' that is conserved in all strains $lengths{ $name } += $len; } my $ea = each_arrayref( @dna_mtrx ); my %gaps; my $cnt; while( my( @bases ) = $ea->() ) { ++$cnt; my %temp; for( 0 .. $#bases ) { # store gaps if any if( $bases[$_] eq '-' ) { $gaps{$_}++; # key is number, corresponds to index of other arrays } } # skip gaps '-' unless( '-' ~~ @bases ) { $temp{ uc $_}++ for @bases } # if snp then %temp will have > 1 key if( keys %temp > 1 ) { # if SNP exists, get base and position for all strains in alignment ++$total_snps; my $pos; for( 0 .. $#bases ) { if( $strands[$_] eq '+' ) { $pos = $starts[$_] + $cnt - ( $gaps{$_} // 0 ) } # genome positn elsif( $strands[$_] eq '-' ) { $pos = $starts[$_] - $cnt - ( $gaps{$_} // 0 ) } # HoAoH push @{ $snps{ $strNames[$_] } }, { $pos => $bases[$_] }; } } } } @alignment = (); } } close $fh; #print Dumper( \%snps ); use Data::Dumper; say "Sum length of synteny blocks conserved in all strains, including gaps: $syn_len bp"; say "Length of conserved sequence for each strain, excluding gaps:"; for my $strain ( keys %lengths ) { say "$strain\t$lengths{ $strain } bp"; } my $outfile = $infile; $outfile =~ s/\.maf$/_snps.txt/; open my $fh2, '>', $outfile; say {$fh2} map{ $_ . "_base\t", $_ . "_pos\t" } keys %snps; for my $snp ( 0 .. ( $total_snps - 1 ) ) { for my $strain ( keys %snps ){ for my $href ( keys %{ $snps{ $strain }[ $snp ] } ) { print {$fh2} "$snps{ $strain }[ $snp ]->{ $href }\t$href\t"; } } print {$fh2} "\n"; } From sanketd at isquareit.ac.in Mon Dec 31 01:46:41 2012 From: sanketd at isquareit.ac.in (Sanket Desai) Date: Mon, 31 Dec 2012 12:16:41 +0530 (IST) Subject: [Bioperl-l] Help in getting organism names of the nucleotide entries. Message-ID: <26019826.10871.1356936401744.JavaMail.root@mail.isquareit.ac.in> Hello, With respect to the post: http://bio.perl.org/pipermail/bioperl-l/2009-December/031831.html When used for nucleotide database it gives the following error: --------------------- WARNING --------------------- MSG: The -email parameter is now required, per NCBI E-utilities policy --------------------------------------------------- --------------------- WARNING --------------------- MSG: No linksets returned --------------------------------------------------- --------------------- WARNING --------------------- MSG: The -email parameter is now required, per NCBI E-utilities policy --------------------------------------------------- ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: NCBI esummary fatal error: Empty id list - nothing todo STACK: Error::throw STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:472 STACK: Bio::Tools::EUtilities::parse_data /usr/share/perl5/Bio/Tools/EUtilities.pm:382 STACK: Bio::Tools::EUtilities::next_DocSum /usr/share/perl5/Bio/Tools/EUtilities.pm:964 STACK: Bio::DB::EUtilities::next_DocSum /usr/share/perl5/Bio/DB/EUtilities.pm:914 STACK: getOrgNameFrmAccession.pl:29 ----------------------------------------------------------- Please suggest the relevant changes in the above script to make it work for the nucleotide entries also. Thanks in advance, Regards, Sanket From fcyucn at gmail.com Mon Dec 17 20:37:45 2012 From: fcyucn at gmail.com (Fengchao Yu) Date: Tue, 18 Dec 2012 01:37:45 -0000 Subject: [Bioperl-l] Is there any module for the protein digestion? Message-ID: <7b719317-57a3-46ef-927c-6b0508e1e62d@googlegroups.com> I notice that Bio::Restriction::Enzyme is for DNA digest? I wonder if there is any module for protein digestion? Thanks From florent.angly at gmail.com Mon Dec 3 02:36:28 2012 From: florent.angly at gmail.com (Florent Angly) Date: Mon, 03 Dec 2012 12:36:28 +1000 Subject: [Bioperl-l] Bio::DB::Fasta and threads Message-ID: <50BC102C.7080902@gmail.com> Hi all, This is in response to Carson Holt's report that Bio::DB::Fasta does not play well with threads: https://redmine.open-bio.org/issues/3397 The first issue is the serialization of Bio::DB::IndexedBase-inheriting (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for threading (for example when using Thread::Queue::Any). I implemented hooks that make it transparent to serialize using Storable freeze() and thaw(). Another issue was the lack of communication between different Bio::DB::IndexedBase instances, which means that an instance could easily be writing or deleting the database that another instance is working on. To fix this, I needed some form of locking. Some database Bio::DB::IndexedBase backends (DB_file) have some support for locking but Bio::DB::IndexedBase also supports other database backends for which there is no native locking mechanism. So, I had to come up with a more general solution: a lock file. I noticed that Bio::DB::SeqFeature::Store::berkeleydb has a locking mechanism, based on flock(), which means that it does not work with NFS-mounted filesystems. All the Bioperl-based scripts I (and most likely many others) write run on servers that use NFS, so this support is important. I have found only one way to do the NFS locking safely, using File::SharedNFSLock. It has a few downsides though: 1/ it is an external dependency, 2/ it does not work on FAT filesystems (should be mostly restricted to USB sticks nowadays) and the lock is never acquired, and 3/ at the moment, it requires a patch to work in threaded context (https://rt.cpan.org/Public/Bug/Display.html?id=81597) Note that while I have now added basic support for threads in Bio::DB::IndexedBase was added, I still get segfaults in specific cases, for example when returning a database or sequence object. This might be related to this issue: https://rt.perl.org/rt3/Ticket/Display.html?id=115972. Beyond this, the new code seems to work nicely. See the branch https://github.com/bioperl/bioperl-live/tree/storable_db if you want to test yourself. For example, one can now run multiple threads, each of them creating a Bio::DB::Fasta database from the same FASTA file: the first thread performs the indexing while the others wait nicely for the indexing to be finished to query the database. Comments welcome. Regards, Florent From l.m.timmermans at students.uu.nl Tue Dec 4 00:29:59 2012 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Tue, 4 Dec 2012 01:29:59 +0100 Subject: [Bioperl-l] Bio::DB::Fasta and threads In-Reply-To: <50BC102C.7080902@gmail.com> References: <50BC102C.7080902@gmail.com> Message-ID: On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly wrote: > The first issue is the serialization of Bio::DB::IndexedBase-inheriting > (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for > threading (for example when using Thread::Queue::Any). I implemented hooks > that make it transparent to serialize using Storable freeze() and thaw(). I don't think serializing a magical thingie makes much sense. Storable is commonly used for a lot more things than interthread communication (e.g. network communication), this would often not work under such circumstances. Leon From cjfields at illinois.edu Tue Dec 4 03:23:50 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 4 Dec 2012 03:23:50 +0000 Subject: [Bioperl-l] Bio::DB::Fasta and threads In-Reply-To: References: <50BC102C.7080902@gmail.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF45670@CHIMBX5.ad.uillinois.edu> On Dec 3, 2012, at 6:29 PM, Leon Timmermans wrote: > On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly wrote: >> The first issue is the serialization of Bio::DB::IndexedBase-inheriting >> (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for >> threading (for example when using Thread::Queue::Any). I implemented hooks >> that make it transparent to serialize using Storable freeze() and thaw(). > > I don't think serializing a magical thingie makes much sense. Storable > is commonly used for a lot more things than interthread communication > (e.g. network communication), this would often not work under such > circumstances. > > Leon Leon, any suggestions on alternatives? I know this particular bit is a sore spot with MAKER at the moment, so any help would be greatly appreciated. chris From yongli at yeslab.com Sat Dec 1 06:10:15 2012 From: yongli at yeslab.com (=?utf-8?B?eW9uZ2xpQHllc2xhYi5jb20=?=) Date: Sat, 1 Dec 2012 14:10:15 +0800 (CST) Subject: [Bioperl-l] =?utf-8?q?question_about_bioperl_program?= Message-ID: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com> Dear Sir or Madam, I just begin to learn bioperl and am using a bioperl program to extract a kind of bacterial genes sequences from genome genbank file. I download NC_003450.gbk, ppt, ffn files. NC_003450 is a kind of bacterial genome genbank file. My bioperl code as follows: use Bio::Seq; use Bio::SeqIO; $seqio_obj=Bio::SeqIO->new (-file=>'NC_003450.gbk',-format=>'genbank'); # $seq_obj=$seqio_obj->next_seq; while($seq_obj=$seqio_obj->next_seq) { $display_name=$seq_obj->display_name; $desc=$seq_obj->desc; $seq=$seq_obj->seq; $acc = $seq_obj->accession_number; $seqio_obj = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'fasta' ); $seqio_obj->write_seq($seq_obj); } After the program runned I just gain a genome complete sequence without a file including every gene sequence one by one. I want to get the genes sequences one by one in a fasta files. So I write you for help. Yong Li From carsonhh at gmail.com Tue Dec 4 03:35:50 2012 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 03 Dec 2012 22:35:50 -0500 Subject: [Bioperl-l] Bio::DB::Fasta and threads In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF4CF45670@CHIMBX5.ad.uillinois.edu> Message-ID: Bio::DB::Fasta is working for maker now. The previous issues have been fixed, but being as Florent has gone out of his way to build a number of improvements into Bio::DB::Fasta over the past few weeks, this seemed like a useful one as well, so I suggested it. One of the big uses of Bio::DB::Fasta is the Bio::PrimarySeq::Fasta features it creates. They are great for manipulating the sequence without actually having to ever keep it in memory. It's nice because the sequence is made available on demand, but when you try and pass them between threads, your program falls apart. There are creative work arounds, but simply adding a serialization hook to Bio::DB::Fasta to disconnect the database on freezing and then reconnect on thaw also fixes it, and it makes them extremely useful for multi-threaded applications without having to go through other kinds of work arounds (it just makes them work as expected with serialization). Previously I had created my own module and inherited from Bio::DB::Fasta so I could implement the Storable hooks. Because Storable looks for the hooks in anything it serializes, the Bio::DB::Fasta object can even be well down inside of a complex object and you don't have worry about it. Previously I've used Storable hooks to pass the Bio::PrimarySeq::Fasta features across the network using MPI, as long as the database is on an NFS mount it just reconnects on the other node with no issue. If the indexed file isn't available after deserialization over a network, you could just throw an error when the thaw hook is called. I'll give Florent's changes a look over soon to give any suggestions. Thanks, Carson On 12-12-03 10:23 PM, "Fields, Christopher J" wrote: >On Dec 3, 2012, at 6:29 PM, Leon Timmermans > wrote: > >> On Mon, Dec 3, 2012 at 3:36 AM, Florent Angly >>wrote: >>> The first issue is the serialization of Bio::DB::IndexedBase-inheriting >>> (e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for >>> threading (for example when using Thread::Queue::Any). I implemented >>>hooks >>> that make it transparent to serialize using Storable freeze() and >>>thaw(). >> >> I don't think serializing a magical thingie makes much sense. Storable >> is commonly used for a lot more things than interthread communication >> (e.g. network communication), this would often not work under such >> circumstances. >> >> Leon > >Leon, any suggestions on alternatives? I know this particular bit is a >sore spot with MAKER at the moment, so any help would be greatly >appreciated. > >chris > From jason.r.gallant at gmail.com Tue Dec 4 20:23:02 2012 From: jason.r.gallant at gmail.com (Jason Gallant) Date: Tue, 4 Dec 2012 12:23:02 -0800 (PST) Subject: [Bioperl-l] Problem with BIO::DB::FASTA and Colon in Fasta Header Message-ID: Hello, I'm trying to retreive fasta sequences that contain a colon in their header. However, I cannot get my BioPerl script to do this!! It works as expected when the header does not contain the colon, however doesn't return anything when it does. Weirdly, when I ask it to return the parsed IDs (see below), it returns the appropriate IDs, which include the colon! Very confusing, would appreciate any help!! Many Thanks, Jason Gallant use strict; use Bio::SearchIO; use Bio::DB::Fasta; my ($file,$id,$start,$end) = ("secondround_merged_expanded.fasta","C7047455:0-100",1,10); my $db = Bio::DB::Fasta->new($file, -reindex=>1); my $seq = $db->seq($id,$start,$end); print $db->ids; print $seq,"\n"; From asjo at koldfront.dk Tue Dec 4 20:53:08 2012 From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=) Date: Tue, 04 Dec 2012 21:53:08 +0100 Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access In-Reply-To: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com> (Francesco Musacchia's message of "Wed, 28 Nov 2012 02:27:16 -0800 (PST)") References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com> Message-ID: <87y5hdletn.fsf@topper.koldfront.dk> On Wed, 28 Nov 2012 02:27:16 -0800 (PST), Francesco wrote: > I'm experiencing that when I have to do a lot of accessess on a GFF > database (with Bio:DB::SeqFeature::Store) the slowness increase until > my script can stay running for more than a day. First you'll need to find out what/where exactly it is slow. One way to do so is using a a profiler; this is a good one for Perl: * https://metacpan.org/module/Devel::NYTProf If you want more specific suggestions, you'll probably have to provide more information. Good luck! Adam -- "As Knuth pointed out long ago, speed only matters Adam Sj?gren in certain critical bottlenecks. And as many asjo at koldfront.dk programmers have observed since, one is very often mistaken about where these bottlenecks are." From cjfields at illinois.edu Tue Dec 4 21:10:00 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 4 Dec 2012 21:10:00 +0000 Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access In-Reply-To: <87y5hdletn.fsf@topper.koldfront.dk> References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com> <87y5hdletn.fsf@topper.koldfront.dk> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu> On Dec 4, 2012, at 2:53 PM, Adam Sj?gren wrote: > On Wed, 28 Nov 2012 02:27:16 -0800 (PST), Francesco wrote: > >> I'm experiencing that when I have to do a lot of accessess on a GFF >> database (with Bio:DB::SeqFeature::Store) the slowness increase until >> my script can stay running for more than a day. > > First you'll need to find out what/where exactly it is slow. One way to > do so is using a a profiler; this is a good one for Perl: > > * https://metacpan.org/module/Devel::NYTProf > > If you want more specific suggestions, you'll probably have to provide > more information. > > > Good luck! > > Adam If anything, we need more profiling of Bioperl code. Ah, if we only had infinite time... :) chris From asjo at koldfront.dk Tue Dec 4 21:33:55 2012 From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=) Date: Tue, 04 Dec 2012 22:33:55 +0100 Subject: [Bioperl-l] Slowness of Bioperl with GFF3 database access In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu> (Christopher J. Fields's message of "Tue, 4 Dec 2012 21:10:00 +0000") References: <183c0b5d-248f-4166-936f-cecd8bc00da8@googlegroups.com> <87y5hdletn.fsf@topper.koldfront.dk> <118F034CF4C3EF48A96F86CE585B94BF4CF46281@CHIMBX5.ad.uillinois.edu> Message-ID: <87txs1jyd8.fsf@topper.koldfront.dk> On Tue, 4 Dec 2012 21:10:00 +0000, Fields, wrote: > If anything, we need more profiling of Bioperl code. Ah, if we only > had infinite time... :) If we had that, we didn't need profiling! ;-), Adam -- "On the quiet side. Somewhat peculiar. A good Adam Sj?gren companion, in a weird sort of way." asjo at koldfront.dk From florent.angly at gmail.com Tue Dec 4 21:52:41 2012 From: florent.angly at gmail.com (Florent Angly) Date: Wed, 05 Dec 2012 07:52:41 +1000 Subject: [Bioperl-l] Problem with BIO::DB::FASTA and Colon in Fasta Header In-Reply-To: References: Message-ID: <50BE70A9.4060404@gmail.com> Hi Jason, See the documentation for seq() at http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/DB/Fasta.pm#OBJECT_METHODS . When you call seq() with a single argument, e.g. $db->seq('C7047455:0-100'), Bio::DB::Fasta interprets it as a compound ID and looks for position 0 to 100 of a sequence called C7047455. This is a feature that has been in Bio::DB::Fasta since the dawn of time. In this form, seq() expects a colon as part of the compound ID, which is problematic because your sequence ID actually contains a colon. I think that when you call $db->seq($id,$start,$end), Bio::DB::Fasta does not attempt to parse your ID. This is why your code works with this form. Note that if you want to get the entirety of a sequence called 'C7047455:0-100', the easiest if your sequence names contain colon is to use $db->get_Seq_by_id('C7047455:0-100') since get_Seq_by_id() does only take a regular ID (not compound). Florent On 05/12/12 06:23, Jason Gallant wrote: > Hello, > > I'm trying to retreive fasta sequences that contain a colon in their > header. However, I cannot get my BioPerl script to do this!! > > It works as expected when the header does not contain the colon, however > doesn't return anything when it does. Weirdly, when I ask it to return the > parsed IDs (see below), it returns the appropriate IDs, which include the > colon! Very confusing, would appreciate any help!! > > Many Thanks, > Jason Gallant > > > use strict; > use Bio::SearchIO; > use Bio::DB::Fasta; > > > my ($file,$id,$start,$end) = > ("secondround_merged_expanded.fasta","C7047455:0-100",1,10); > > > my $db = Bio::DB::Fasta->new($file, -reindex=>1); > my $seq = $db->seq($id,$start,$end); > > print $db->ids; > > print $seq,"\n"; > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bosborne11 at verizon.net Tue Dec 4 22:12:59 2012 From: bosborne11 at verizon.net (Brian Osborne) Date: Tue, 04 Dec 2012 17:12:59 -0500 Subject: [Bioperl-l] question about bioperl program In-Reply-To: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com> References: <57225.114.91.212.55.1354342215.bossmail@mail.yeslab.com> Message-ID: <16BBC477-9935-4C79-A70D-6B18716089FB@verizon.net> Yong Li, You want to take a look at this HOWTO: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation Those genes you see in the file are features in the genome sequence. Brian O. On Dec 1, 2012, at 1:10 AM, yongli at yeslab.com wrote: > Dear Sir or Madam, > > > > I just begin to learn bioperl and am using a bioperl program to extract a kind of bacterial genes sequences from genome genbank file. I download NC_003450.gbk, ppt, ffn files. NC_003450 is a kind of bacterial genome genbank file. My bioperl code as follows: > > > > use Bio::Seq; > > use Bio::SeqIO; > > > > $seqio_obj=Bio::SeqIO->new (-file=>'NC_003450.gbk',-format=>'genbank'); > > # $seq_obj=$seqio_obj->next_seq; > > > > while($seq_obj=$seqio_obj->next_seq) > > { > > $display_name=$seq_obj->display_name; > > $desc=$seq_obj->desc; > > $seq=$seq_obj->seq; > > $acc = $seq_obj->accession_number; > > $seqio_obj = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'fasta' ); > > $seqio_obj->write_seq($seq_obj); > > } > > > > After the program runned I just gain a genome complete sequence without a file including every gene sequence one by one. I want to get the genes sequences one by one in a fasta files. So I write you for help. > > > > Yong Li > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From ankh.egypt.public at googlemail.com Fri Dec 7 20:24:20 2012 From: ankh.egypt.public at googlemail.com (Adrian Helmchen) Date: Fri, 07 Dec 2012 21:24:20 +0100 Subject: [Bioperl-l] proteins from an organism Message-ID: <50C25074.8050703@googlemail.com> Hello, I would like to get all proteins from an organism but proteins from cholorplasts or with chrystal structures or something else. I tried to obtain these proteins by send a query 'Arabidopsis thaliana[organism]' with Bio::DB::GenBank and fetch the gi numbers from the cds. But on the one pc I get 6000 proteins and on another pc I get 46000 proteins although Arabidopsis thaliana has 25000 genes. Thank you for your help. From nikkie.vanbers at gmail.com Mon Dec 10 08:07:27 2012 From: nikkie.vanbers at gmail.com (Nikki2) Date: Mon, 10 Dec 2012 00:07:27 -0800 (PST) Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI assembly database Message-ID: <34761946.post@talk.nabble.com> Hi, I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from 'Tracheophyta' that are NCBI's assembly database. However, there are no DocSums returned for the uid's that match the query. When I try the same thing using the genome database it works fine. The script that I used to do the query is at the bottom of this message. The output I get when running the script is: Count = 84 --------------------- WARNING --------------------- MSG: No returned docsums. --------------------------------------------------- I checked the @ids array and it contains the 84 uids. My questions are as follows: 1) Is it possible to get DocSums for uids from the NCBI assembly database, and if yes, how? 2) If not, does anyone have any suggestions how to change my script to get the species-names that match the uids that are returned? Thanks a lot! Nikki ############################################## #!/bin/perl -w use Bio::DB::EUtilities; my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', -db => 'genome', -email => 'my_email at gmail.com', -term => 'Tracheophyta[organism]', -retmax => 5000); print "Count = ",$factory->get_count,"\n"; my @ids = $factory->get_ids; my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary', -email=>'my_email at gmail.com', -db => 'genome', -id => \@ids, ret_max=>5000); while (my $ds = $factory2->next_DocSum) { print "ID: ",$ds->get_id,"\n"; # flattened mode, iterates through all Item objects while (my $item = $ds->next_Item('flattened')) { # not all Items have content, so need to check... printf("%-20s:%s\n",$item->get_name,$item->get_content) if $item->get_content; } print "\n"; } -- View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From cjfields at illinois.edu Mon Dec 10 15:59:03 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 10 Dec 2012 15:59:03 +0000 Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI assembly database In-Reply-To: <34761946.post@talk.nabble.com> References: <34761946.post@talk.nabble.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF4C164@CHIMBX5.ad.uillinois.edu> Nikki, This is b/c a handful of the databases apparently have switched docsum output completely to the DB-specific DocSum schemata (v2), which have not been implemented in Bio::EUtilities as of yet. This requires quite a bit of revision to parse correctly as it's per database, so I don't have a timeline on when this would be available and would likely be incrementally implemented over time. See here for the announcement: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.Release_Notes In the meantime, you can get the raw XML output for these by replacing the loop for $factory2 with: print $factory2->get_Response->content chris On Dec 10, 2012, at 2:07 AM, Nikki2 wrote: > Hi, > > I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from > 'Tracheophyta' that are NCBI's assembly database. However, there are no > DocSums returned for the uid's that match the query. When I try the same > thing using the genome database it works fine. > > The script that I used to do the query is at the bottom of this message. The > output I get when running the script is: > > Count = 84 > > --------------------- WARNING --------------------- > MSG: No returned docsums. > --------------------------------------------------- > > I checked the @ids array and it contains the 84 uids. > > My questions are as follows: > > 1) Is it possible to get DocSums for uids from the NCBI assembly database, > and if yes, how? > 2) If not, does anyone have any suggestions how to change my script to get > the species-names that match the uids that are returned? > > Thanks a lot! > > Nikki > > > > > > > > ############################################## > > #!/bin/perl -w > > use Bio::DB::EUtilities; > > my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', > -db => 'genome', > -email => 'my_email at gmail.com', > -term => 'Tracheophyta[organism]', > -retmax => 5000); > > print "Count = ",$factory->get_count,"\n"; > my @ids = $factory->get_ids; > > my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary', > -email=>'my_email at gmail.com', > -db => 'genome', > -id => \@ids, > ret_max=>5000); > > while (my $ds = $factory2->next_DocSum) { > print "ID: ",$ds->get_id,"\n"; > # flattened mode, iterates through all Item objects > while (my $item = $ds->next_Item('flattened')) { > # not all Items have content, so need to check... > printf("%-20s:%s\n",$item->get_name,$item->get_content) if > $item->get_content; > } > print "\n"; > } > > > -- > View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From jason.stajich at gmail.com Thu Dec 13 04:05:29 2012 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 12 Dec 2012 20:05:29 -0800 Subject: [Bioperl-l] Asking In-Reply-To: <201212131130153627348@gmail.com> References: <201212131130153627348@gmail.com> Message-ID: <7ED416EC-622E-4023-94B7-9A11D29929DC@gmail.com> You want the reroot function. Have you tried reading the howtos on the website already. Node is a node in the tree. There are several functions to find a node or iterate through all the ones in the tree Sent from my iPhone-please excuse typos -- Jason Stajich On Dec 12, 2012, at 7:30 PM, "Xing-Xing Shen" wrote: > Drear Jason > I am a green hand in learning Bioperl. Now, I met a problem about how to define outgroup for a set of newick trees. > > My codes below: > #!/usr/bin/perl > use Bio::TreeIO; > use Bio::Tree::NodeI; > use Bio::Tree::Tree; > my @filenames = glob("*.txt"); > foreach my $filename (@filenames) { > my $treeio = Bio::TreeIO->new('-format' => 'newick', '-file' => "$filename"); > while( my $tree = $treeio->next_tree ) { > $tree->set_root_node("$node"); # what might $node mean? > .......... > .......... > } > } > > > With best, > > Xing-Xing Shen From j.abbott at imperial.ac.uk Thu Dec 13 19:49:15 2012 From: j.abbott at imperial.ac.uk (James Abbott) Date: Thu, 13 Dec 2012 19:49:15 +0000 Subject: [Bioperl-l] deobfuscator broken.... Message-ID: <50CA313B.9060904@imperial.ac.uk> Hi All, Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator is generating internal server errors. I've also been having problems with broken documentation links (cpan links producning the wrong modules, and pdoc pages missing) but can't seem to replicate that problem now.... I am, for now, still obfuscated... Cheers, James -- Dr. James Abbott Lead Bioinformatician Bioinformatics Support Service Imperial College, London From p.j.a.cock at googlemail.com Thu Dec 13 22:52:44 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 13 Dec 2012 22:52:44 +0000 Subject: [Bioperl-l] deobfuscator broken.... In-Reply-To: <50CA313B.9060904@imperial.ac.uk> References: <50CA313B.9060904@imperial.ac.uk> Message-ID: On Thu, Dec 13, 2012 at 7:49 PM, James Abbott wrote: > Hi All, > > Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator > is generating internal server errors. I've also been having problems with > broken documentation links (cpan links producning the wrong modules, and > pdoc pages missing) but can't seem to replicate that problem now.... > > I am, for now, still obfuscated... > > Cheers, > James I would guess this is a side effect from the recent server move, CC'ing root-l in case anyone of the sys-admin team had an idea. Peter From cjfields at illinois.edu Thu Dec 13 22:51:50 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 13 Dec 2012 22:51:50 +0000 Subject: [Bioperl-l] deobfuscator broken.... In-Reply-To: <50CA313B.9060904@imperial.ac.uk> References: <50CA313B.9060904@imperial.ac.uk> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF545DA@CHIMBX5.ad.uillinois.edu> This is likely due to the back-end change in servers. I'm not sure how this was set up but we can inquire about it. chris On Dec 13, 2012, at 1:49 PM, James Abbott wrote: > Hi All, > > Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator is generating internal server errors. I've also been having problems with broken documentation links (cpan links producning the wrong modules, and pdoc pages missing) but can't seem to replicate that problem now.... > > I am, for now, still obfuscated... > > Cheers, > James > -- > Dr. James Abbott > Lead Bioinformatician > Bioinformatics Support Service > Imperial College, London > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Thu Dec 13 23:13:55 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 13 Dec 2012 23:13:55 +0000 Subject: [Bioperl-l] [Root-l] deobfuscator broken.... In-Reply-To: References: <50CA313B.9060904@imperial.ac.uk> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF4CF546F2@CHIMBX5.ad.uillinois.edu> On Dec 13, 2012, at 4:52 PM, Peter Cock wrote: > On Thu, Dec 13, 2012 at 7:49 PM, James Abbott wrote: >> Hi All, >> >> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator >> is generating internal server errors. I've also been having problems with >> broken documentation links (cpan links producning the wrong modules, and >> pdoc pages missing) but can't seem to replicate that problem now.... >> >> I am, for now, still obfuscated... >> >> Cheers, >> James > > I would guess this is a side effect from the recent server move, > CC'ing root-l in case anyone of the sys-admin team had an idea. > > Peter Beat me by four minutes! The CGI code is in websites/bioperl.org/cgi/. I'm checking on the errors now, may take me a little time to get it back up (was missing CGI, now needs to have the lib path extended). chris From jason.stajich at gmail.com Thu Dec 13 23:18:26 2012 From: jason.stajich at gmail.com (Jason Stajich) Date: Thu, 13 Dec 2012 15:18:26 -0800 Subject: [Bioperl-l] [Root-l] deobfuscator broken.... In-Reply-To: References: <50CA313B.9060904@imperial.ac.uk> Message-ID: <0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com> I think it uses mysql but I don't know if that was reconstituted on the new server. On Dec 13, 2012, at 2:52 PM, Peter Cock wrote: > On Thu, Dec 13, 2012 at 7:49 PM, James Abbott wrote: >> Hi All, >> >> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator >> is generating internal server errors. I've also been having problems with >> broken documentation links (cpan links producning the wrong modules, and >> pdoc pages missing) but can't seem to replicate that problem now.... >> >> I am, for now, still obfuscated... >> >> Cheers, >> James > > I would guess this is a side effect from the recent server move, > CC'ing root-l in case anyone of the sys-admin team had an idea. > > Peter > _______________________________________________ > Root-l mailing list > Root-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/root-l Jason Stajich jason.stajich at gmail.com jason at bioperl.org From nikkie.vanbers at gmail.com Wed Dec 5 14:04:09 2012 From: nikkie.vanbers at gmail.com (Nikki2) Date: Wed, 5 Dec 2012 06:04:09 -0800 (PST) Subject: [Bioperl-l] Eutilities and no DocSums returned from NCBI assembly database Message-ID: <34761946.post@talk.nabble.com> Hi, I'm using 'Bio::DB::EUtilities' in order to retrieve all the names from 'Tracheophyta' that are NCBI's assembly database. However, there are no DocSums returned for the uid's that match the query. When I try the same thing using the genome database it works fine. The script that I used to do the query is at the bottom of this message. The output I get when running the script is: Count = 84 --------------------- WARNING --------------------- MSG: No returned docsums. --------------------------------------------------- I checked the @ids array and it contains the 84 uids. My questions are as follows: 1) Is it possible to get DocSums for uids from the NCBI assembly database, and if yes, how? 2) If not, does anyone have any suggestions how to change my script to get the species-names that match the uids that are returned? Thanks a lot! Nikki ############################################## #!/bin/perl -w use Bio::DB::EUtilities; my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', -db => 'genome', -email => 'my_email at gmail.com', -term => 'Tracheophyta[organism]', -retmax => 5000); print "Count = ",$factory->get_count,"\n"; my @ids = $factory->get_ids; my $factory2 = Bio::DB::EUtilities->new(-eutil => 'esummary', -email=>'my_email at gmail.com', -db => 'genome', -id => \@ids, ret_max=>5000); while (my $ds = $factory2->next_DocSum) { print "ID: ",$ds->get_id,"\n"; # flattened mode, iterates through all Item objects while (my $item = $ds->next_Item('flattened')) { # not all Items have content, so need to check... printf("%-20s:%s\n",$item->get_name,$item->get_content) if $item->get_content; } print "\n"; } -- View this message in context: http://old.nabble.com/Eutilities-and-no-DocSums-returned-from-NCBI-assembly-database-tp34761946p34761946.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From online at davemessina.com Thu Dec 13 23:41:35 2012 From: online at davemessina.com (Dave Messina) Date: Thu, 13 Dec 2012 18:41:35 -0500 Subject: [Bioperl-l] [Root-l] deobfuscator broken.... In-Reply-To: <0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com> References: <50CA313B.9060904@imperial.ac.uk> <0F706FA4-68B8-4E50-BDEC-BF8632929BB6@gmail.com> Message-ID: It should be just (shudder) Berkeley DB. On Dec 13, 2012, at 18:18, Jason Stajich wrote: > I think it uses mysql but I don't know if that was reconstituted on the new server. > > On Dec 13, 2012, at 2:52 PM, Peter Cock wrote: > >> On Thu, Dec 13, 2012 at 7:49 PM, James Abbott wrote: >>> Hi All, >>> >>> Don't know if anyone admin folk are aware, but the bioperl.org deobfuscator >>> is generating internal server errors. I've also been having problems with >>> broken documentation links (cpan links producning the wrong modules, and >>> pdoc pages missing) but can't seem to replicate that problem now.... >>> >>> I am, for now, still obfuscated... >>> >>> Cheers, >>> James >> >> I would guess this is a side effect from the recent server move, >> CC'ing root-l in case anyone of the sys-admin team had an idea. >> >> Peter >> _______________________________________________ >> Root-l mailing list >> Root-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/root-l > > Jason Stajich > jason.stajich at gmail.com > jason at bioperl.org > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From abualiga2 at gmail.com Tue Dec 18 22:08:51 2012 From: abualiga2 at gmail.com (galeb abu-ali) Date: Tue, 18 Dec 2012 17:08:51 -0500 Subject: [Bioperl-l] Fwd: how to parse maf file format In-Reply-To: References: Message-ID: Hi, I am writing a script to parse a multiple genome alignment file in maf format, generated with mugsy alignment of e.coli genomes. So far, my script parses SNPs from synteny blocks conserved in all aligned strains, and it excludes gaps, which is enough for a phylogenetic analyses. I was wondering how can I parse the remaining blocks that are not conserved in all strains, to see what is conserved in n-1, n-2, etc. strains or unique to each strain. I guess this is not a BioPerl question, but it's a Perl for biologists question so I was hoping to get some insight here. If there is a more appropriate forum, please let me know. Below is my code. many thanks! galeb #!/usr/local/bin/perl use Modern::Perl '2013'; use autodie; use List::MoreUtils qw/ each_arrayref /; # gsa 18.12.2012 # parse mugsy multiple genome alignment for SNPs in synteny blocks conserved in all aligned strains =head ##maf version=1 scoring=mugsy a score=7891 label=40 mult=4 s O55H7_RM12579.O55H7_RM12579 1596752 7262 + 5263980 CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCG s O55H7_CB9615.O55H7_CB9615 1604426 7262 + 5386352 CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCGG-AT s O157H7_Sakai.O157H7_Sakai 1787303 7068 + 5498450 CGGGATGCGGGGATGGGAATGCC-TGGTTGACGGGGTGGCGG-AT s O157H7_EDL.O157H7_EDL933 1729749 7082 + 5528445 CGGGATGCGGGAATGGGAATGCCTTGGTTGACGGGGTGGCGGAAT a score=6756 label=41 mult=4 s O55H7_RM12579.O55H7_RM12579 1986265 6749 + 5263980 CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGG s O55H7_CB9615.O55H7_CB9615 1991733 6749 + 5386352 CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGGTTAC s O157H7_Sakai.O157H7_Sakai 3940728 6751 - 5498450 CAGGAGGGGCATCAGCTCACACCGACAGCCCCTGCGTATGGTTAC s O157H7_EDL.O157H7_EDL933 4260689 4042 - 5528445 --------------------------------------------- =cut my $infile = shift or die "Usage: $0 \n"; my %snps; my $strains = 0; my @alignment; my( $score, $blkLen, $mult ); my $total_snps; my $syn_len; my %lengths; open my $fh, '<', $infile; while( <$fh> ) { next if /^#/; chomp; if( /^a/ ) { ( $score, $blkLen, $mult ) = ( split )[1,2,3]; $score =~ s/score\=(\d+)/$1/; # length of alignment block including '-' $blkLen =~ s/label\=(\d+)/$1/; # alignment block number; numbers ranked on alignment length $mult =~ s/mult\=(\d+)/$1/;# number of strains aligned in block $strains = $mult if $mult > $strains; # total number of strains in alignment } elsif( /^s/ ) { push @alignment, $_ } elsif( /^$/ || ! length $_ ) { my( @strNames, @starts, @strands, @dna_mtrx ); # if sequence conserved in all strains if( $strains == @alignment ) { $syn_len += $score; # total aligned sequence in all strains for( @alignment ) { # name, align start, align length (w/o '-'), direction, align sequence w/ '-' my( $name, $start, $len, $strand, $dna ) = ( split /\s+/ )[ 1, 2, 3, 4, 6 ]; #$name =~ s/.*\.(.*)/$1/; # remove duplicated strain name # strains are always in same order when all strains in block. push @strNames, $name; push @starts, $start; push @strands, $strand; push @dna_mtrx, [ split '', $dna ]; # total seqeunce in each strain w/o '-' that is conserved in all strains $lengths{ $name } += $len; } my $ea = each_arrayref( @dna_mtrx ); my %gaps; my $cnt; while( my( @bases ) = $ea->() ) { ++$cnt; my %temp; for( 0 .. $#bases ) { # store gaps if any if( $bases[$_] eq '-' ) { $gaps{$_}++; # key is number, corresponds to index of other arrays } } # skip gaps '-' unless( '-' ~~ @bases ) { $temp{ uc $_}++ for @bases } # if snp then %temp will have > 1 key if( keys %temp > 1 ) { # if SNP exists, get base and position for all strains in alignment ++$total_snps; my $pos; for( 0 .. $#bases ) { if( $strands[$_] eq '+' ) { $pos = $starts[$_] + $cnt - ( $gaps{$_} // 0 ) } # genome positn elsif( $strands[$_] eq '-' ) { $pos = $starts[$_] - $cnt - ( $gaps{$_} // 0 ) } # HoAoH push @{ $snps{ $strNames[$_] } }, { $pos => $bases[$_] }; } } } } @alignment = (); } } close $fh; #print Dumper( \%snps ); use Data::Dumper; say "Sum length of synteny blocks conserved in all strains, including gaps: $syn_len bp"; say "Length of conserved sequence for each strain, excluding gaps:"; for my $strain ( keys %lengths ) { say "$strain\t$lengths{ $strain } bp"; } my $outfile = $infile; $outfile =~ s/\.maf$/_snps.txt/; open my $fh2, '>', $outfile; say {$fh2} map{ $_ . "_base\t", $_ . "_pos\t" } keys %snps; for my $snp ( 0 .. ( $total_snps - 1 ) ) { for my $strain ( keys %snps ){ for my $href ( keys %{ $snps{ $strain }[ $snp ] } ) { print {$fh2} "$snps{ $strain }[ $snp ]->{ $href }\t$href\t"; } } print {$fh2} "\n"; } From sanketd at isquareit.ac.in Mon Dec 31 06:46:41 2012 From: sanketd at isquareit.ac.in (Sanket Desai) Date: Mon, 31 Dec 2012 12:16:41 +0530 (IST) Subject: [Bioperl-l] Help in getting organism names of the nucleotide entries. Message-ID: <26019826.10871.1356936401744.JavaMail.root@mail.isquareit.ac.in> Hello, With respect to the post: http://bio.perl.org/pipermail/bioperl-l/2009-December/031831.html When used for nucleotide database it gives the following error: --------------------- WARNING --------------------- MSG: The -email parameter is now required, per NCBI E-utilities policy --------------------------------------------------- --------------------- WARNING --------------------- MSG: No linksets returned --------------------------------------------------- --------------------- WARNING --------------------- MSG: The -email parameter is now required, per NCBI E-utilities policy --------------------------------------------------- ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: NCBI esummary fatal error: Empty id list - nothing todo STACK: Error::throw STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:472 STACK: Bio::Tools::EUtilities::parse_data /usr/share/perl5/Bio/Tools/EUtilities.pm:382 STACK: Bio::Tools::EUtilities::next_DocSum /usr/share/perl5/Bio/Tools/EUtilities.pm:964 STACK: Bio::DB::EUtilities::next_DocSum /usr/share/perl5/Bio/DB/EUtilities.pm:914 STACK: getOrgNameFrmAccession.pl:29 ----------------------------------------------------------- Please suggest the relevant changes in the above script to make it work for the nucleotide entries also. Thanks in advance, Regards, Sanket From fcyucn at gmail.com Tue Dec 18 01:37:45 2012 From: fcyucn at gmail.com (Fengchao Yu) Date: Tue, 18 Dec 2012 01:37:45 -0000 Subject: [Bioperl-l] Is there any module for the protein digestion? Message-ID: <7b719317-57a3-46ef-927c-6b0508e1e62d@googlegroups.com> I notice that Bio::Restriction::Enzyme is for DNA digest? I wonder if there is any module for protein digestion? Thanks