From tarakaramji at gmail.com Wed May 2 09:14:47 2012 From: tarakaramji at gmail.com (tarakaramji M) Date: Wed, 2 May 2012 18:44:47 +0530 Subject: [Bioperl-l] to retrieve Upstream and downstream genes -Reg Message-ID: Dear all, I am new to use bioperl modules. Could please guide me retrieving the upstream and downstream genes of a operon if a set of sequences or Accession numbers are given. Thanks in advance -- M. Taraka Ramji Int. PhD Department of Biological Sciences Indian Institute of Science Education and Research kolkata From carandraug+dev at gmail.com Wed May 2 10:20:49 2012 From: carandraug+dev at gmail.com (=?ISO-8859-1?Q?Carn=EB_Draug?=) Date: Wed, 2 May 2012 15:20:49 +0100 Subject: [Bioperl-l] to retrieve Upstream and downstream genes -Reg In-Reply-To: References: Message-ID: On 2 May 2012 14:14, tarakaramji M wrote: > Dear all, > I am new to use bioperl modules. Could please guide me retrieving the > upstream and downstream genes of a operon if a set of sequences or > Accession numbers are given. > Thanks in advance Hi Tarak, Look into the bp_genbank_ref_extractor script (which is already in bioperl https://github.com/bioperl/bioperl-live/blob/master/scripts/Bio-DB-EUtilities/bp_genbank_ref_extractor.pl ). You can configure how much downstream and upstream base pairs to get, just read the documentation. From hnorpois at googlemail.com Wed May 2 13:42:05 2012 From: hnorpois at googlemail.com (Hermann Norpois) Date: Wed, 2 May 2012 19:42:05 +0200 Subject: [Bioperl-l] get geneID for gene names Message-ID: Hello, I wish to get gene IDs for gene names (e.g. bdnf, copg). I thought it was a good idea to use Bio::DB::EUtilities (see below) and addressed UNISTS as database because there it was quite easy to find the gene ID. So far I was unable to retrieve the gene ID from UNISTS. Could anybody give me a hint how to proceed? The cookbook ... Yes, I was trying. #!/bin/perl -w use Bio::DB::EUtilities; my $name = "Copg"; my $factory = Bio::DB::EUtilities->new(-eutil => 'efetch', -db => 'unists', -term => '$name AND mouse [ORGN]', -email => 'hnorpois at mpipsykl.mpg.de' ) Thank you Hermann Norpois From cjfields at illinois.edu Wed May 2 13:55:29 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 2 May 2012 17:55:29 +0000 Subject: [Bioperl-l] get geneID for gene names In-Reply-To: References: Message-ID: <9D9805AD-8B53-4E51-81B9-00CB65F891AE@illinois.edu> Hermann, The below works for me (note I'm using esearch, not efetch). To actually get the records you will use efetch and the IDs obtained below. chris ------------------------------ my $name = "Copg"; my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', -db => 'unists', -term => '$name AND mouse [ORGN]', -email => '', ); print join(',',$factory->get_ids)."\n"; On May 2, 2012, at 12:42 PM, Hermann Norpois wrote: > Hello, > > I wish to get gene IDs for gene names (e.g. bdnf, copg). I thought it was a > good idea to use Bio::DB::EUtilities (see below) and addressed UNISTS as > database because there it was quite easy to find the gene ID. So far I was > unable to retrieve the gene ID from UNISTS. Could anybody give me a hint > how to proceed? The cookbook ... Yes, I was trying. > > #!/bin/perl -w > > use Bio::DB::EUtilities; > > my $name = "Copg"; > my $factory = Bio::DB::EUtilities->new(-eutil => 'efetch', > -db => 'unists', > -term => '$name AND mouse [ORGN]', > -email => 'hnorpois at mpipsykl.mpg.de' > ) > > > Thank you > Hermann Norpois > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Wed May 2 14:03:34 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 2 May 2012 18:03:34 +0000 Subject: [Bioperl-l] get geneID for gene names In-Reply-To: <9D9805AD-8B53-4E51-81B9-00CB65F891AE@illinois.edu> References: <9D9805AD-8B53-4E51-81B9-00CB65F891AE@illinois.edu> Message-ID: Also, a small but very significant bug is in the below. Can you spot it? The '-term' value is in single quotes, these need to be double-quotes to interpolate $name. Otherwise, it is literally looking for '$name'. chris On May 2, 2012, at 12:55 PM, Christopher Fields wrote: > Hermann, > > The below works for me (note I'm using esearch, not efetch). To actually get the records you will use efetch and the IDs obtained below. > > chris > > ------------------------------ > my $name = "Copg"; > my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', > -db => 'unists', > -term => '$name AND mouse [ORGN]', > -email => '', > ); > > print join(',',$factory->get_ids)."\n"; > > > On May 2, 2012, at 12:42 PM, Hermann Norpois wrote: > >> Hello, >> >> I wish to get gene IDs for gene names (e.g. bdnf, copg). I thought it was a >> good idea to use Bio::DB::EUtilities (see below) and addressed UNISTS as >> database because there it was quite easy to find the gene ID. So far I was >> unable to retrieve the gene ID from UNISTS. Could anybody give me a hint >> how to proceed? The cookbook ... Yes, I was trying. >> >> #!/bin/perl -w >> >> use Bio::DB::EUtilities; >> >> my $name = "Copg"; >> my $factory = Bio::DB::EUtilities->new(-eutil => 'efetch', >> -db => 'unists', >> -term => '$name AND mouse [ORGN]', >> -email => 'hnorpois at mpipsykl.mpg.de' >> ) >> >> >> Thank you >> Hermann Norpois >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > From hnorpois at googlemail.com Wed May 2 17:00:56 2012 From: hnorpois at googlemail.com (Hermann Norpois) Date: Wed, 2 May 2012 23:00:56 +0200 Subject: [Bioperl-l] get geneID for gene names In-Reply-To: References: <9D9805AD-8B53-4E51-81B9-00CB65F891AE@illinois.edu> Message-ID: Thank you very much. But there still is a problem. This is my output: 525211,210532,167498,142652 I get some ids (the first one is the UniSTS ID, the following ... I do not know) but there is no gene ID. If you compare to the following link: http://www.ncbi.nlm.nih.gov/genome/sts/sts.cgi?uid=525211 The gene ID should be 54161 . This is my (your) script: #!/bin/perl -w use Bio::DB::EUtilities; my $name = "Copg"; my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', -db => 'unists', -term => "$name AND Mus musculus [ORGN]", -email => 'hnorpois at mpipsykl.mpg.de', ); print join(',',$factory->get_ids)."\n"; 2012/5/2 Fields, Christopher J > Also, a small but very significant bug is in the below. Can you spot it? > > The '-term' value is in single quotes, these need to be double-quotes to > interpolate $name. Otherwise, it is literally looking for '$name'. > > chris > > On May 2, 2012, at 12:55 PM, Christopher Fields wrote: > > > Hermann, > > > > The below works for me (note I'm using esearch, not efetch). To > actually get the records you will use efetch and the IDs obtained below. > > > > chris > > > > ------------------------------ > > my $name = "Copg"; > > my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', > > -db => 'unists', > > -term => '$name AND mouse [ORGN]', > > -email => '', > > ); > > > > print join(',',$factory->get_ids)."\n"; > > > > > > On May 2, 2012, at 12:42 PM, Hermann Norpois wrote: > > > >> Hello, > >> > >> I wish to get gene IDs for gene names (e.g. bdnf, copg). I thought it > was a > >> good idea to use Bio::DB::EUtilities (see below) and addressed UNISTS as > >> database because there it was quite easy to find the gene ID. So far I > was > >> unable to retrieve the gene ID from UNISTS. Could anybody give me a hint > >> how to proceed? The cookbook ... Yes, I was trying. > >> > >> #!/bin/perl -w > >> > >> use Bio::DB::EUtilities; > >> > >> my $name = "Copg"; > >> my $factory = Bio::DB::EUtilities->new(-eutil => 'efetch', > >> -db => 'unists', > >> -term => '$name AND mouse > [ORGN]', > >> -email => ' > hnorpois at mpipsykl.mpg.de' > >> ) > >> > >> > >> Thank you > >> Hermann Norpois > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > From Russell.Smithies at agresearch.co.nz Wed May 2 19:12:14 2012 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Thu, 3 May 2012 11:12:14 +1200 Subject: [Bioperl-l] get geneID for gene names In-Reply-To: References: <9D9805AD-8B53-4E51-81B9-00CB65F891AE@illinois.edu> Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CCECE6738@exchsth.agresearch.co.nz> If you're looking for gene information, why are you searching UniSTS? Unless I've overlooked something, wouldn't it be more useful to search the "gene" database and tighten up your query a bit? #!/bin/perl use strict; use warnings; use Bio::DB::EUtilities; my $factory = Bio::DB::EUtilities->new( -eutil => 'esearch', -db => 'gene', -term => '(copg[Gene Name]) AND mouse[Organism]', -email => 'hnorpois at mpipsykl.mpg.de', -usehistory => 'y' ); my $hist = $factory->next_History || die "No history data returned"; $factory->set_parameters( -eutil => 'efetch', -history => $hist ); print $factory->get_Response->content; --Russell -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Hermann Norpois Sent: Thursday, 3 May 2012 9:01 a.m. To: Fields, Christopher J Cc: Subject: Re: [Bioperl-l] get geneID for gene names Thank you very much. But there still is a problem. This is my output: 525211,210532,167498,142652 I get some ids (the first one is the UniSTS ID, the following ... I do not know) but there is no gene ID. If you compare to the following link: http://www.ncbi.nlm.nih.gov/genome/sts/sts.cgi?uid=525211 The gene ID should be 54161 . This is my (your) script: #!/bin/perl -w use Bio::DB::EUtilities; my $name = "Copg"; my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', -db => 'unists', -term => "$name AND Mus musculus [ORGN]", -email => 'hnorpois at mpipsykl.mpg.de', ); print join(',',$factory->get_ids)."\n"; 2012/5/2 Fields, Christopher J > Also, a small but very significant bug is in the below. Can you spot it? > > The '-term' value is in single quotes, these need to be double-quotes > to interpolate $name. Otherwise, it is literally looking for '$name'. > > chris > > On May 2, 2012, at 12:55 PM, Christopher Fields wrote: > > > Hermann, > > > > The below works for me (note I'm using esearch, not efetch). To > actually get the records you will use efetch and the IDs obtained below. > > > > chris > > > > ------------------------------ > > my $name = "Copg"; > > my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', > > -db => 'unists', > > -term => '$name AND mouse [ORGN]', > > -email => '', > > ); > > > > print join(',',$factory->get_ids)."\n"; > > > > > > On May 2, 2012, at 12:42 PM, Hermann Norpois wrote: > > > >> Hello, > >> > >> I wish to get gene IDs for gene names (e.g. bdnf, copg). I thought > >> it > was a > >> good idea to use Bio::DB::EUtilities (see below) and addressed > >> UNISTS as database because there it was quite easy to find the gene > >> ID. So far I > was > >> unable to retrieve the gene ID from UNISTS. Could anybody give me a > >> hint how to proceed? The cookbook ... Yes, I was trying. > >> > >> #!/bin/perl -w > >> > >> use Bio::DB::EUtilities; > >> > >> my $name = "Copg"; > >> my $factory = Bio::DB::EUtilities->new(-eutil => 'efetch', > >> -db => 'unists', > >> -term => '$name AND mouse > [ORGN]', > >> -email => ' > hnorpois at mpipsykl.mpg.de' > >> ) > >> > >> > >> Thank you > >> Hermann Norpois > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From prasadms693 at gmail.com Fri May 4 01:13:23 2012 From: prasadms693 at gmail.com (Prasad ms) Date: Fri, 4 May 2012 10:43:23 +0530 Subject: [Bioperl-l] About bioperl global alignment Message-ID: Hello sir, I am Prasad, student of MS in bioinformatics. I am doing my final year project, and sequence alignment is the part of my project. I am having nearly 50k sequences and i want to do a pairwise global alignment (NW alignment). I read the bioperl tutorial. But in that there is no mention about this. Could you please guide how can i do this type of alignment using bioperl. I assure that all the usage is purely for academic. Looking forward to hear from you. Thank you, Regards, Prasad MS From fs5 at sanger.ac.uk Fri May 4 02:31:20 2012 From: fs5 at sanger.ac.uk (Frank Schwach) Date: Fri, 04 May 2012 07:31:20 +0100 Subject: [Bioperl-l] About bioperl global alignment In-Reply-To: References: Message-ID: <4FA377B8.8000103@sanger.ac.uk> Hi Prasad, have a look at this: http://www.bioperl.org/wiki/HOWTO:AlignIO_and_SimpleAlign#Aligning_multiple_sequences_with_Clustalw.pm_and_TCoffee.pm The HOWTO pages are a brilliant source of information and starting points for your own scripts. The main point here is: Bioperl doesn't do any alignment. It provides the tools to automate making alignments (or other things) with third-party software such as ClustalW or TCoffee as described in the text. You need to install those programs locally and then use Bioperl to go fetch sequences form your FASTA (or whatever format) sequence files and run them through the aligner, then use more Bioperl methods to extract data from the alignments and generate your final results. That makes it possible, for example, to write a script that extracts every possible pair of sequences from a FASTA file and run them through ClustalW, then analyse the results and record the percent identity or whatever you are interested in and generate a spreadsheet with your final results, ready to be sent to Nature :-) !!!! Feel free to ask if you need more help. Good luck! Frank On 04/05/12 06:13, Prasad ms wrote: > Hello sir, > I am Prasad, student of MS in bioinformatics. I am doing my final year > project, and sequence alignment is the part of my project. I am having > nearly 50k sequences and i want to do a pairwise global alignment (NW > alignment). I read the bioperl tutorial. But in that there is no mention > about this. Could you please guide how can i do this type of alignment > using bioperl. > I assure that all the usage is purely for academic. > > Looking forward to hear from you. > > Thank you, > > Regards, > Prasad MS > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From hnorpois at googlemail.com Fri May 4 08:09:52 2012 From: hnorpois at googlemail.com (Hermann Norpois) Date: Fri, 4 May 2012 14:09:52 +0200 Subject: [Bioperl-l] get geneID for gene names In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF34CCECE6738@exchsth.agresearch.co.nz> References: <9D9805AD-8B53-4E51-81B9-00CB65F891AE@illinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CCECE6738@exchsth.agresearch.co.nz> Message-ID: Thank you. I am very happy with -db `gene'. Originally I thought -db unists was less ambigious. I combined the suggestions. So my script is: #!/bin/perl use Bio::DB::EUtilities; open (OUT, "> geneID_list"); open (OUT2, "> genename_ID_list"); while (<>) { $name = $_; my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', -db => 'gene', -term => "$name [Gene Name] AND Mus musculus [Organism]", -email => 'hnorpois at mpipsykl.mpg.de', ); my @ids = $factory->get_ids; # print "$name\t at ids[0]\n"; my $geneids = join(',', at ids); # For the case there is more than one ID. print "Fetching GENEID\t$geneids for GENE NAME\t$name\n"; print OUT "$geneids\n"; # my %name_id = ($name =>$geneids); print OUT2 "$geneids\t$name"; } But there still is something I do not understand. It is not important but ... $geneids seems to include "\n". Because this is what I get on the screen: Fetching GENEID 54161 for GENE NAME copg Fetching GENEID 12064 for GENE NAME bdnf Fetching GENEID 71661 for GENE NAME 0610005C13RIK Fetching GENEID 382908 for GENE NAME LOC382908 Fetching GENEID 54633 for GENE NAME PQBP1 Fetching GENEID 258908 for GENE NAME MOR154-1 Thanks Hermann Norpois 2012/5/3 Smithies, Russell > If you're looking for gene information, why are you searching UniSTS? > Unless I've overlooked something, wouldn't it be more useful to search the > "gene" database and tighten up your query a bit? > > #!/bin/perl > use strict; > use warnings; > > use Bio::DB::EUtilities; > > my $factory = Bio::DB::EUtilities->new( > -eutil => 'esearch', > -db => 'gene', > -term => '(copg[Gene Name]) AND mouse[Organism]', > -email => 'hnorpois at mpipsykl.mpg.de', > -usehistory => 'y' > ); > > my $hist = $factory->next_History || die "No history data returned"; > > $factory->set_parameters( > -eutil => 'efetch', > -history => $hist > ); > > print $factory->get_Response->content; > > > --Russell > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto: > bioperl-l-bounces at lists.open-bio.org] On Behalf Of Hermann Norpois > Sent: Thursday, 3 May 2012 9:01 a.m. > To: Fields, Christopher J > Cc: > Subject: Re: [Bioperl-l] get geneID for gene names > > Thank you very much. But there still is a problem. > > This is my output: > 525211,210532,167498,142652 > > I get some ids (the first one is the UniSTS ID, the following ... I do not > know) but there is no gene ID. If you compare to the following link: > http://www.ncbi.nlm.nih.gov/genome/sts/sts.cgi?uid=525211 The gene ID > should be 54161< > http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=54161 > > > . > > This is my (your) script: > > #!/bin/perl -w > > use Bio::DB::EUtilities; > > my $name = "Copg"; > my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', > -db => 'unists', > -term => "$name AND Mus musculus > [ORGN]", > -email => 'hnorpois at mpipsykl.mpg.de', > ); > > print join(',',$factory->get_ids)."\n"; > > > > 2012/5/2 Fields, Christopher J > > > Also, a small but very significant bug is in the below. Can you spot it? > > > > The '-term' value is in single quotes, these need to be double-quotes > > to interpolate $name. Otherwise, it is literally looking for '$name'. > > > > chris > > > > On May 2, 2012, at 12:55 PM, Christopher Fields wrote: > > > > > Hermann, > > > > > > The below works for me (note I'm using esearch, not efetch). To > > actually get the records you will use efetch and the IDs obtained below. > > > > > > chris > > > > > > ------------------------------ > > > my $name = "Copg"; > > > my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', > > > -db => 'unists', > > > -term => '$name AND mouse > [ORGN]', > > > -email => '', > > > ); > > > > > > print join(',',$factory->get_ids)."\n"; > > > > > > > > > On May 2, 2012, at 12:42 PM, Hermann Norpois wrote: > > > > > >> Hello, > > >> > > >> I wish to get gene IDs for gene names (e.g. bdnf, copg). I thought > > >> it > > was a > > >> good idea to use Bio::DB::EUtilities (see below) and addressed > > >> UNISTS as database because there it was quite easy to find the gene > > >> ID. So far I > > was > > >> unable to retrieve the gene ID from UNISTS. Could anybody give me a > > >> hint how to proceed? The cookbook ... Yes, I was trying. > > >> > > >> #!/bin/perl -w > > >> > > >> use Bio::DB::EUtilities; > > >> > > >> my $name = "Copg"; > > >> my $factory = Bio::DB::EUtilities->new(-eutil => 'efetch', > > >> -db => 'unists', > > >> -term => '$name AND mouse > > [ORGN]', > > >> -email => ' > > hnorpois at mpipsykl.mpg.de' > > >> ) > > >> > > >> > > >> Thank you > > >> Hermann Norpois > > >> _______________________________________________ > > >> Bioperl-l mailing list > > >> Bioperl-l at lists.open-bio.org > > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > From biorges at gmail.com Wed May 2 13:49:12 2012 From: biorges at gmail.com (guillermo romero) Date: Wed, 2 May 2012 12:49:12 -0500 Subject: [Bioperl-l] Abot prot_param code :) Message-ID: To whom it may concern, I am reading your prot param source code ( http://cpansearch.perl.org/src/CJFIELDS/BioPerl-1.6.901/Bio/Tools/Protparam.pm) and i am wondering how to use it: How do I execute it? Which are the parameters needed to use it? Sincere thanks Cheers Guillermo Romero University of Mexico From maquino at knome.com Wed May 2 21:39:54 2012 From: maquino at knome.com (Mark Aquino) Date: Wed, 2 May 2012 21:39:54 -0400 Subject: [Bioperl-l] Count read depths at specific loci w/Bio::DB::Sam Message-ID: Hi all, I'm a little stumped as to how to successfully count the depths of all reads at a specific locus in a sam/bam file. I know I can do this with GATK DepthOfCoverage but I wanted to do some more customized things with my script yet I haven't figured out how to get the right base. I was a bit surprised there wasn't (or that it's not well documented) a method to get the individuals read's base at a specific position while getting the $refbase is quite easy. (I'm betting such a method exists and is just not documented well) At any rate, gaps in the alignment are the cause for my problems, so if anyone knows a simpler way to do call the bases correctly, or a clever algorithm to deal with this issue, it would be much appreciated. Here's what I have for code and it works except in cases where there are multiple gaps in the reference sequence, e.g. the alignment below should be T-T here not C-C but is shifted due to the second gap. #!/progs/bin/perl use strict; use warnings; use Bio::DB::Sam; use Bio::DB::Bam::AlignWrapper; use Pod::Usage; use Getopt::Long; use Bio::DB::Bam::Pileup; use Term::ANSIColor; my $sam = Bio::DB::Sam->new(-bam =>$BAM, -fasta=> $FASTA); getBases($chr, $pos, $pos); sub getBases { my $print = 1; my ($chr, $start_query, $end_query) = @_; my @alignments = $sam->get_features_by_location(-seq_id => $chr, -start => $start_query, -end => $end_query); my $refbase; my ($a_count, $t_count, $g_count, $c_count, $n_count, $del_count, $ins_count) = (0, 0, 0, 0, 0, 0, 0); for my $a (@alignments) { my $start = $a->start; my $end = $a->end; my $query_start = $a->query->start; my $query_end = $a->query->end; my $ref_dna = $a->dna; # reference sequence bases my ($ref, $matches, $query) = $a->padded_alignment; my $offset = 0; if ($ref =~ /^([-]+)[ATCG]+/){ $offset = length($1); } #print "$offset\n"; $refbase = $sam->segment($chr,$start_query,$start_query)->dna; printAlignment($ref, $matches, $query, $start_query, $start, $offset); my $base = substr($query, $start_query-$start+$offset, 1); if (!$base){ next; } $a_count++ if ($base eq "A"); $t_count++ if ($base eq "T"); $c_count++ if ($base eq "C"); $g_count++ if ($base eq "G"); $n_count++ if ($base eq "N"); $del_count++ if ($base eq "-"); my @scores = $a->qscore; # per-base quality scores my $match_qual= $a->qual; # quality of the match } my $total_depth = $a_count + $t_count + $c_count + $g_count + $n_count + $del_count; if ($print == 1){ # print "$start_query\tref base: $ref_base\n"; print "$chr:$start_query($refbase)\t"; print "A:$a_count\t"; print "T:$t_count\t"; print "C:$c_count\t"; print "G:$g_count\t"; print "N:$n_count\t"; print "D:$del_count\t"; print "Total:$total_depth\n"; } return ($a_count, $t_count, $c_count, $g_count, $n_count, $del_count, $ins_count); } sub printAlignment{ my ($ref, $matches, $query, $start_query, $start, $offset) = @_; print substr($ref, 0, $start_query-$start+$offset); print (color("red"), substr($ref, $start_query-$start+$offset, 1), color("reset")); print substr($ref, $start_query-$start+$offset+1),"\n"; print substr($matches, 0, $start_query-$start+$offset); print (color("red"), substr($matches, $start_query-$start+$offset, 1), color("reset")); print substr($matches, $start_query-$start+$offset+1),"\n"; print substr($query, 0, $start_query-$start+$offset); print (color("red"), substr($query, $start_query-$start+$offset, 1), color("reset")); print substr($query, $start_query-$start+$offset+1),"\n"; } -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Untitled.tiff Type: image/tiff Size: 6222 bytes Desc: not available URL: From scott at scottcain.net Fri May 4 11:28:25 2012 From: scott at scottcain.net (Scott Cain) Date: Fri, 4 May 2012 11:28:25 -0400 Subject: [Bioperl-l] GMOD 2013 meeting location survey Message-ID: Hello, As we are trying to plan for next years GMOD meeting, we would like to decide between two venues as soon as possible. To help us decide, we've put together a simple survey. We are asking your help in deciding between: * San Diego, California in January before or after the Plant and Animal Genomes meeting * Cambridge, England in April before or after the International Society of Biocurators meeting. Each option has its upsides: the Plant and Animal Genomes is a large meeting attended by several members of the GMOD community, so it would likely have a fairly high attendance. On the other hand, having a meeting in Cambridge would make it easier for European members of the GMOD community to attend. Please share your thoughts with us and take this survey. http://www.surveymonkey.com/s/CPC25P5 Thanks, Scott -- ------------------------------------------------------------------------ Scott Cain, Ph. D.? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? scott at scottcain dot net GMOD Coordinator (http://gmod.org/)? ? ? ? ? ? ? ? ? ?? 216-392-3087 Ontario Institute for Cancer Research From david.garcia at insp.mx Fri May 4 10:13:32 2012 From: david.garcia at insp.mx (David Efrain Garcia Lopez) Date: Fri, 4 May 2012 09:13:32 -0500 Subject: [Bioperl-l] doubt withc protparam.pm Message-ID: <755E3412FDAB3248859418918BD76463FAB2DB2A8F@CCR01.redinsp.insp.mx> Hello, I need your help to use protparam.pm, I want to know how use, because I'm very interested in the library, Could you help me please? give me an example or something, Thank you very much, Cheers David TSU. David Garcia Lopez Bioinformatica Instuto Nacional de Salud Publica Cuernavaca, Morelos, Mexico. 3293000(2732) From fs5 at sanger.ac.uk Fri May 4 12:19:39 2012 From: fs5 at sanger.ac.uk (Frank Schwach) Date: Fri, 04 May 2012 17:19:39 +0100 Subject: [Bioperl-l] Fwd: Bioperl for global alignment In-Reply-To: References: Message-ID: <4FA4019B.1020406@sanger.ac.uk> Prasad, did you not get my reply? Here it is again just to be on the safe side: Hi Prasad, have a look at this: http://www.bioperl.org/wiki/HOWTO:AlignIO_and_SimpleAlign#Aligning_multiple_sequences_with_Clustalw.pm_and_TCoffee.pm The HOWTO pages are a brilliant source of information and starting points for your own scripts. The main point here is: Bioperl doesn't do any alignment. It provides the tools to automate making alignments (or other things) with third-party software such as ClustalW or TCoffee as described in the text. You need to install those programs locally and then use Bioperl to go fetch sequences form your FASTA (or whatever format) sequence files and run them through the aligner, then use more Bioperl methods to extract data from the alignments and generate your final results. That makes it possible, for example, to write a script that extracts every possible pair of sequences from a FASTA file and run them through ClustalW, then analyse the results and record the percent identity or whatever you are interested in and generate a spreadsheet with your final results, ready to be sent to Nature !!!! Feel free to ask if you need more help. Good luck! Frank On 04/05/12 06:13, Prasad ms wrote: Hello sir, I am Prasad, student of MS in bioinformatics. I am doing my final year project, and sequence alignment is the part of my project. I am having nearly 50k sequences and i want to do a pairwise global alignment (NW alignment). I read the bioperl tutorial. But in that there is no mention about this. Could you please guide how can i do this type of alignment using bioperl. I assure that all the usage is purely for academic. Looking forward to hear from you. Thank you, Regards, Prasad MS _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l On 30/04/12 06:40, prasad ms wrote: > Hello sir, > I am Prasad, student of MS in bioinformatics. I am doing my final year > project, and sequence alignment is the part of my project. I am having > nearly 50k sequences and i want to do a pairwise global alignment (NW > alignment). I read the bioperl tutorial. But in that there is no mention > about this. Could you please guide how can i do this type of alignment > using bioperl. > I assure that all the usage is purely for academic. > > Looking forward to hear from you. > > Thank you, > > Regards, > Prasad MS > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From j_martin at lbl.gov Fri May 4 12:54:46 2012 From: j_martin at lbl.gov (Joel Martin) Date: Fri, 4 May 2012 09:54:46 -0700 Subject: [Bioperl-l] Abot prot_param code :) In-Reply-To: References: Message-ID: first, use bioperl-live. the url for protparam has changed since 1.6.901 was released. second, the example code is a bit off, replace the start with this, it's just adding the use lines and changing "my $pp = Protparam..." to "my $pp = Bio::Tools::Protparam..." use Bio::DB::GenBank; use Bio::Tools::Protparam; my $gb = new Bio::DB::GenBank(-retrievaltype => 'tempfile' , -format => 'Fasta'); my @ids=qw(O14521 O43709 O43826); my $seqio = $gb->get_Stream_by_acc(\@ids ); while( my $seq = $seqio->next_seq ) { my $pp = Bio::Tools::Protparam->new(seq=>$seq->seq); # it's correct from that line on down On Wed, May 2, 2012 at 10:49 AM, guillermo romero wrote: > To whom it may concern, > > I am reading your prot param source code ( > http://cpansearch.perl.org/src/CJFIELDS/BioPerl-1.6.901/Bio/Tools/Protparam.pm) > and i am wondering how to use it: How do I execute it? Which are the > parameters needed to use it? > > Sincere thanks > > Cheers > > > Guillermo Romero > University of Mexico > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From j_martin at lbl.gov Fri May 4 13:32:23 2012 From: j_martin at lbl.gov (Joel Martin) Date: Fri, 4 May 2012 10:32:23 -0700 Subject: [Bioperl-l] Abot prot_param code :) In-Reply-To: References: Message-ID: you need to install bioperl-live, the protparam website has changed since the cpan version of bioperl was released so you need the most up to date version of bioperl. the instructions for installing are at http://www.bioperl.org/wiki/Using_Git Joel On Fri, May 4, 2012 at 10:06 AM, guillermo romero wrote: > Thanks for your reply !!! > > I have just run your suggestions but an error appears: > > ?EXCEPTION: Bio::Root::Exception ------------- > MSG:?http://www.expasy.org/cgi-bin/protparam?error: 301 Moved Permanently > STACK: Error::throw > STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:472 > STACK: Bio::Tools::Protparam::new > /usr/share/perl5/Bio/Tools/Protparam.pm:128 > STACK:?prog_protparam.pl:14 > > Would you recommend me to do something else? > > Thanks again :) > > > On 4 May 2012 11:54, Joel Martin wrote: >> >> first, use bioperl-live. ?the url for protparam has changed since >> 1.6.901 was released. >> >> second, the example code is a bit off, replace the start with this, >> it's just adding the >> use lines and changing "my $pp = Protparam..." to "my $pp = >> Bio::Tools::Protparam..." >> >> use Bio::DB::GenBank; >> use Bio::Tools::Protparam; >> >> my $gb = new Bio::DB::GenBank(-retrievaltype => 'tempfile' , >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?-format => 'Fasta'); >> my @ids=qw(O14521 O43709 O43826); >> my $seqio = $gb->get_Stream_by_acc(\@ids ); >> ?while( my $seq = ?$seqio->next_seq ) { >> ? ?my $pp = Bio::Tools::Protparam->new(seq=>$seq->seq); >> # it's correct from that line on down >> >> On Wed, May 2, 2012 at 10:49 AM, guillermo romero >> wrote: >> > To whom it may concern, >> > >> > I am reading your prot param source code ( >> > >> > http://cpansearch.perl.org/src/CJFIELDS/BioPerl-1.6.901/Bio/Tools/Protparam.pm) >> > and i am wondering how to use it: How do I execute it? Which are the >> > parameters needed to use it? >> > >> > Sincere thanks >> > >> > Cheers >> > >> > >> > Guillermo Romero >> > University of Mexico >> > _______________________________________________ >> > Bioperl-l mailing list >> > Bioperl-l at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From jimhu at tamu.edu Fri May 4 14:08:02 2012 From: jimhu at tamu.edu (Jim Hu) Date: Fri, 4 May 2012 13:08:02 -0500 Subject: [Bioperl-l] Teaching with BioPerl this summer Message-ID: <83B972F9-0612-467B-880D-D83852358CDB@tamu.edu> We (Rodolfo Aramayo and I) are going to teach an intensive summer course for undergrads at Texas A&M built around the idea of creating a learning community of students who will be able to help faculty with simple research and teaching projects involving computational biology. I think I've convinced Rodolfo that this should be built around Perl because of BioPerl and the fact that we know more helpful people in the BioPerl community than for other BioYourFavoriteOtherLanguage groups. Consider this a warning that this means I am likely to send a lot of questions to the list! --- The first dumb question: For my own use, I tend to use BioPerl-live. But I thought it might be wiser to just use BioPerl 1.6.901 installed via cpan, so I looked at http://www.bioperl.org/wiki/Installing_Bioperl_for_Unix#INSTALLING_BIOPERL_THE_EASY_WAY_USING_CPAN and I'm wondering: why so complicated? Why not just cpan>force install Bundle::BioPerl --- Anyway, in case anyone is interested my tentative plan (which has to become a real plan in a couple of weeks) is: - start with Hello World, but do it two ways: the usual STDOUT to shell way, plus a cgi-bin using Template Toolkit Actually, before we write hello.pl, I'm going to introduce perldoc. This will introduce the idea of instantiating objects and sending messages to them. - next have them use Perl to calculate a simple math function (factorials). Again, make a shell and cgi-bin version. For the cgi, use templates based on the RGraph javascript library to plot the data in an HTML5 canvas. I like the idea of using HTML5 instead of GD making png files based on not having to link to images in a tmp directory. This will introduce loops, arrays, and join - use CGI.pm for both shell and web input, ignoring shift etc. Have them modify their graphing program to accept different ranges. Note that I will NOT use CGI.pm to output HTML. I think it doesn't separate logic from presentation enough. - introduce BioPerl - Use BioPerl and the graphing they learned earlier to count things about genomes they grab via SeqIO. Mostly I'm thinking of having them make histograms (or pie charts). Possible examples: histogram showing distribution of CDS sizes histogram showing distribution of # of introns histogram showing distribution of lengths of 5' or 3' UTRs. histogram showing CDS's using different start and stop codons and identities of aa 2. I will probably use some of the other stuff from the HOWTOs, and I have the pdf of the CSHL course notes. I also am thinking about having Perl write code to have paper.js draw things, but first I need to figure out how to do that myself. We will make all this stuff publicly available via a website soon. Feedback, suggestions, and ridicule are all welcome! I'd especially be interested if anyone has experience in using Subversion to have students turn in assignments by committing their branches. Jim ===================================== Jim Hu Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054 From cjfields at illinois.edu Fri May 4 15:17:17 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 4 May 2012 19:17:17 +0000 Subject: [Bioperl-l] Teaching with BioPerl this summer In-Reply-To: <83B972F9-0612-467B-880D-D83852358CDB@tamu.edu> References: <83B972F9-0612-467B-880D-D83852358CDB@tamu.edu> Message-ID: On May 4, 2012, at 1:08 PM, Jim Hu wrote: > We (Rodolfo Aramayo and I) are going to teach an intensive summer course for undergrads at Texas A&M built around the idea of creating a learning community of students who will be able to help faculty with simple research and teaching projects involving computational biology. I think I've convinced Rodolfo that this should be built around Perl because of BioPerl and the fact that we know more helpful people in the BioPerl community than for other BioYourFavoriteOtherLanguage groups. > > Consider this a warning that this means I am likely to send a lot of questions to the list! No problem, we'll do our best to help. > --- > The first dumb question: > For my own use, I tend to use BioPerl-live. But I thought it might be wiser to just use BioPerl 1.6.901 installed via cpan, so I looked at > > http://www.bioperl.org/wiki/Installing_Bioperl_for_Unix#INSTALLING_BIOPERL_THE_EASY_WAY_USING_CPAN > > and I'm wondering: why so complicated? Why not just > > cpan>force install Bundle::BioPerl I wouldn't suggest force-installing some modules (XS-based ones for instance). That aside, I don't think Bundle::Bioperl is up-to-date, so it's possible even if you install it there will be problems (e.g. missing deps added after the last Bundle::BioPerl update). I can ask Chris Dagdigian about this, he's listed as the current maintainer. You could try something like './Build installdeps' from bioperl-live. Theoretically, it should work, just haven't tested it. Finding discrepancies like this is a good thing to note, though. The basic BioPerl documentation needs to be updated, and to include new tools like 'cpanm' and so on. > Anyway, in case anyone is interested my tentative plan (which has to become a real plan in a couple of weeks) is: > > - start with Hello World, but do it two ways: the usual STDOUT to shell way, plus a cgi-bin using Template Toolkit > Actually, before we write hello.pl, I'm going to introduce perldoc. > This will introduce the idea of instantiating objects and sending messages to them. > - next have them use Perl to calculate a simple math function (factorials). Again, make a shell and cgi-bin version. For the cgi, use templates based on the RGraph javascript library to plot the data in an HTML5 canvas. I like the idea of using HTML5 instead of GD making png files based on not having to link to images in a tmp directory. > This will introduce loops, arrays, and join > - use CGI.pm for both shell and web input, ignoring shift etc. Have them modify their graphing program to accept different ranges. Note that I will NOT use CGI.pm to output HTML. I think it doesn't separate logic from presentation enough. > - introduce BioPerl > - Use BioPerl and the graphing they learned earlier to count things about genomes they grab via SeqIO. Mostly I'm thinking of having them make histograms (or pie charts). Possible examples: > histogram showing distribution of CDS sizes > histogram showing distribution of # of introns > histogram showing distribution of lengths of 5' or 3' UTRs. > histogram showing CDS's using different start and stop codons and identities of aa 2. Ok, seems straightforward. Regarding using CGI.pm that's fine as an introduction, but a good number of users seem to be switching to something like Plack/PSGI (though this may take more effort in terms of training). Also, OOP is a pretty big part of BioPerl, and seems to be what trips most new users up. How much detail do you anticipate adding for that? > I will probably use some of the other stuff from the HOWTOs, and I have the pdf of the CSHL course notes. I also am thinking about having Perl write code to have paper.js draw things, but first I need to figure out how to do that myself. > > We will make all this stuff publicly available via a website soon. Feedback, suggestions, and ridicule are all welcome! I'd especially be interested if anyone has experience in using Subversion to have students turn in assignments by committing their branches. Having this available publicly would be very nice! Can't give much feedback on using svn branches for homework unfortunately, but I can help out from my end in making updates as needed (both code and documentation). Let me know what you find. > Jim > ===================================== > Jim Hu > Professor > Dept. of Biochemistry and Biophysics > 2128 TAMU > Texas A&M Univ. > College Station, TX 77843-2128 > 979-862-4054 > chris From hnorpois at googlemail.com Fri May 4 15:29:37 2012 From: hnorpois at googlemail.com (Hermann Norpois) Date: Fri, 4 May 2012 21:29:37 +0200 Subject: [Bioperl-l] genomic coordinates always on the plus strand Message-ID: Hello, in the tutorial http://www.bioperl.org/wiki/HOWTO:Getting_Genomic_Sequencesthere is a script that retrieves genomic coordinates (see below). I tested it with 14 geneIDs and got always coordinates on "plus strand" meaning $from was always a lower number than $to. Principally this is nice but I was surprised. This means that all by genes are (by chance) on the plus strand or that there are 2 "coordinates" (one for the "plus" one for the "minus" strand). Then it could be possible (theoretically and not very likely) that there are two genes for one $from/$to pair (one on the plus and one on the minus strand with the same coordinates with different IDs). I did not find anything about this issue in the documentation or in the archive. Could please anybody comment on this? use strict;use Bio::DB::EntrezGene; my $id = shift or die "Id?\n"; # use a Gene id my $db = new Bio::DB::EntrezGene; my $seq = $db->get_Seq_by_id($id); my $ac = $seq->annotation; for my $ann ($ac->get_Annotations('dblink')) { if ($ann->database eq "Evidence Viewer") { # get the sequence identifier, the start, and the stop my ($contig,$from,$to) = $ann->url =~ /contig=([^&]+).+from=(\d+)&to=(\d+)/; print "$contig\t$from\t$to\n"; }} Thank you Hermann Norpois From jimhu at tamu.edu Fri May 4 16:02:40 2012 From: jimhu at tamu.edu (Jim Hu) Date: Fri, 4 May 2012 15:02:40 -0500 Subject: [Bioperl-l] Teaching with BioPerl this summer In-Reply-To: References: <83B972F9-0612-467B-880D-D83852358CDB@tamu.edu> Message-ID: On May 4, 2012, at 2:17 PM, Fields, Christopher J wrote: > On May 4, 2012, at 1:08 PM, Jim Hu wrote: > >> We (Rodolfo Aramayo and I) are going to teach an intensive summer course for undergrads at Texas A&M built around the idea of creating a learning community of students who will be able to help faculty with simple research and teaching projects involving computational biology. I think I've convinced Rodolfo that this should be built around Perl because of BioPerl and the fact that we know more helpful people in the BioPerl community than for other BioYourFavoriteOtherLanguage groups. >> >> Consider this a warning that this means I am likely to send a lot of questions to the list! > > No problem, we'll do our best to help. I was counting on that! > >> Anyway, in case anyone is interested my tentative plan (which has to become a real plan in a couple of weeks) is: >> >> - start with Hello World, but do it two ways: the usual STDOUT to shell way, plus a cgi-bin using Template Toolkit >> Actually, before we write hello.pl, I'm going to introduce perldoc. >> This will introduce the idea of instantiating objects and sending messages to them. >> - next have them use Perl to calculate a simple math function (factorials). Again, make a shell and cgi-bin version. For the cgi, use templates based on the RGraph javascript library to plot the data in an HTML5 canvas. I like the idea of using HTML5 instead of GD making png files based on not having to link to images in a tmp directory. >> This will introduce loops, arrays, and join >> - use CGI.pm for both shell and web input, ignoring shift etc. Have them modify their graphing program to accept different ranges. Note that I will NOT use CGI.pm to output HTML. I think it doesn't separate logic from presentation enough. >> - introduce BioPerl >> - Use BioPerl and the graphing they learned earlier to count things about genomes they grab via SeqIO. Mostly I'm thinking of having them make histograms (or pie charts). Possible examples: >> histogram showing distribution of CDS sizes >> histogram showing distribution of # of introns >> histogram showing distribution of lengths of 5' or 3' UTRs. >> histogram showing CDS's using different start and stop codons and identities of aa 2. > > Ok, seems straightforward. Regarding using CGI.pm that's fine as an introduction, but a good number of users seem to be switching to something like Plack/PSGI (though this may take more effort in terms of training). > > Also, OOP is a pretty big part of BioPerl, and seems to be what trips most new users up. How much detail do you anticipate adding for that? I am actually hoping to teach from an OOP perspective from the start. That's part of the benefit of using Template::Toolkit in object style. It gets to the idea of "programming to interface, not implementation". Similarly, we will have them use the object style programming for CGI and BioPerl. I'm not sure about getting to having them write their own object classes, but we could try that. > >> I will probably use some of the other stuff from the HOWTOs, and I have the pdf of the CSHL course notes. I also am thinking about having Perl write code to have paper.js draw things, but first I need to figure out how to do that myself. >> >> We will make all this stuff publicly available via a website soon. Feedback, suggestions, and ridicule are all welcome! I'd especially be interested if anyone has experience in using Subversion to have students turn in assignments by committing their branches. > > Having this available publicly would be very nice! Can't give much feedback on using svn branches for homework unfortunately, but I can help out from my end in making updates as needed (both code and documentation). Let me know what you find. > >> Jim >> ===================================== >> Jim Hu >> Professor >> Dept. of Biochemistry and Biophysics >> 2128 TAMU >> Texas A&M Univ. >> College Station, TX 77843-2128 >> 979-862-4054 >> > > > chris ===================================== Jim Hu Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054 From jimhu at tamu.edu Sat May 5 16:35:51 2012 From: jimhu at tamu.edu (Jim Hu) Date: Sat, 5 May 2012 15:35:51 -0500 Subject: [Bioperl-l] Teaching with BioPerl this summer In-Reply-To: References: <83B972F9-0612-467B-880D-D83852358CDB@tamu.edu> Message-ID: <39CBAD8E-E081-443A-A1A1-CE9F0B934934@tamu.edu> I am not a GD expert by any stretch, but my understanding is that a CGI using GD can return the image with an appropriate http header, such as: print "Content-type: image/png\n\n"; followed by outputting the image from the GD object. However, to embed the image in a web page with other content, I always did it by outputting to a tmp file and used an img tag to access the file. Is there another way to do this? I suppose that the img tag could link to another cgi that actually generates the image... I guess having a second cgi in their library that uses GD::Graph would work the same way. But the RGraph library is really easy to use, and it will delegate the processing of the image to the client, which may be useful at some point in the future if we can actually get the students to help with online teaching tools that will be clicked on simultaneously by classes with hundreds of students. Jim On May 5, 2012, at 10:40 AM, Mike Williams wrote: > > > On Fri, May 4, 2012 at 2:08 PM, Jim Hu wrote: > > Anyway, in case anyone is interested my tentative plan (which has to become a real plan in a couple of weeks) is: > > - next have them use Perl to calculate a simple math function (factorials). Again, make a shell and cgi-bin version. For the cgi, use templates based on the RGraph javascript library to plot the data in an HTML5 canvas. I like the idea of using HTML5 instead of GD making png files based on not having to link to images in a tmp directory. > > You can use GD to send an image directly to the browser without creating a file. > > Mike > ===================================== Jim Hu Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054 From jimhu at tamu.edu Sat May 5 16:44:10 2012 From: jimhu at tamu.edu (Jim Hu) Date: Sat, 5 May 2012 15:44:10 -0500 Subject: [Bioperl-l] genomic coordinates always on the plus strand In-Reply-To: References: Message-ID: <31A039BE-8EC9-4482-9C43-622CAF033639@tamu.edu> In BioPerl end( to) is always > start(from) and the strand is indicated by strand. IIRC, there is a proposal for how to handle this for features that cross the origin in circular genomes, but it hasn't been implemented yet. See: http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/Range.html Jim Hu On May 4, 2012, at 2:29 PM, Hermann Norpois wrote: > Hello, > > in the tutorial > http://www.bioperl.org/wiki/HOWTO:Getting_Genomic_Sequencesthere is a > script that retrieves genomic coordinates (see below). I tested > it with 14 geneIDs and got always coordinates on "plus strand" meaning > $from was always a lower number than $to. Principally this is nice but I > was surprised. This means that all by genes are (by chance) on the plus > strand or that there are 2 "coordinates" (one for the "plus" one for the > "minus" strand). Then it could be possible (theoretically and not very > likely) that there are two genes for one $from/$to pair (one on the plus > and one on the minus strand with the same coordinates with different IDs). > I did not find anything about this issue in the documentation or in the > archive. Could please anybody comment on this? > > use strict;use Bio::DB::EntrezGene; > my $id = shift or die > "Id?\n"; # use a Gene id > my $db = new Bio::DB::EntrezGene; > my $seq = $db->get_Seq_by_id($id); > my $ac = $seq->annotation; > for my $ann ($ac->get_Annotations('dblink')) { > if ($ann->database eq "Evidence Viewer") { > # get the sequence identifier, the start, and the stop > my ($contig,$from,$to) = $ann->url =~ > /contig=([^&]+).+from=(\d+)&to=(\d+)/; > print "$contig\t$from\t$to\n"; > }} > > > Thank you > Hermann Norpois > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l ===================================== Jim Hu Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054 From jimhu at tamu.edu Sun May 6 14:40:17 2012 From: jimhu at tamu.edu (Jim Hu) Date: Sun, 6 May 2012 13:40:17 -0500 Subject: [Bioperl-l] illustrating get_Seqfeatures vs get_all_SeqFeatures Message-ID: <2DC14616-F54E-45D0-93B8-E0D89C42B61F@tamu.edu> Is there a good example of a small genome record, such as a viral genome, where the difference between the flattened and unflattened versions can be examined? The Genbank records of the bacteriophages i like to use as examples are mostly flat to begin with. Thanks, Jim ===================================== Jim Hu Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054 From Russell.Smithies at agresearch.co.nz Sun May 6 16:20:49 2012 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Mon, 7 May 2012 08:20:49 +1200 Subject: [Bioperl-l] get geneID for gene names In-Reply-To: References: <9D9805AD-8B53-4E51-81B9-00CB65F891AE@illinois.edu> <18DF7D20DFEC044098A1062202F5FFF34CCECE6738@exchsth.agresearch.co.nz> Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CCECE6C86@exchsth.agresearch.co.nz> Looks like it's the $name that has the trailing new-line and I suspect one cause might be your file of gene names is in Windows format. General practise is to put a "chomp" in while doing reads to remove these. I'd also recommend "use strict;" and "use warnings;" in your headers as it simplifies development and prevents simple mistakes creeping in. Eg. #!/bin/perl use warnings; use strict; use Bio::DB::EUtilities; open (OUT, "> geneID_list"); open (OUT2, "> genename_ID_list"); while (<>){ chomp; $name = $_; If you have a lot of queries to make (i.e. >10,000) it might be easier to download the geneinfo list and grep the data out of that. ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ --Russell From: Hermann Norpois [mailto:hnorpois at googlemail.com] Sent: Saturday, 5 May 2012 12:10 a.m. To: Smithies, Russell Cc: bioperl-l at lists.open-bio.org Subject: Re: [Bioperl-l] get geneID for gene names Thank you. I am very happy with -db `gene'. Originally I thought -db unists was less ambigious. I combined the suggestions. So my script is: #!/bin/perl use Bio::DB::EUtilities; open (OUT, "> geneID_list"); open (OUT2, "> genename_ID_list"); while (<>) { $name = $_; my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', -db => 'gene', -term => "$name [Gene Name] AND Mus musculus [Organism]", -email => 'hnorpois at mpipsykl.mpg.de', ); my @ids = $factory->get_ids; # print "$name\t at ids[0]\n"; my $geneids = join(',', at ids); # For the case there is more than one ID. print "Fetching GENEID\t$geneids for GENE NAME\t$name\n"; print OUT "$geneids\n"; # my %name_id = ($name =>$geneids); print OUT2 "$geneids\t$name"; } But there still is something I do not understand. It is not important but ... $geneids seems to include "\n". Because this is what I get on the screen: Fetching GENEID 54161 for GENE NAME copg Fetching GENEID 12064 for GENE NAME bdnf Fetching GENEID 71661 for GENE NAME 0610005C13RIK Fetching GENEID 382908 for GENE NAME LOC382908 Fetching GENEID 54633 for GENE NAME PQBP1 Fetching GENEID 258908 for GENE NAME MOR154-1 Thanks Hermann Norpois 2012/5/3 Smithies, Russell > If you're looking for gene information, why are you searching UniSTS? Unless I've overlooked something, wouldn't it be more useful to search the "gene" database and tighten up your query a bit? #!/bin/perl use strict; use warnings; use Bio::DB::EUtilities; my $factory = Bio::DB::EUtilities->new( -eutil => 'esearch', -db => 'gene', -term => '(copg[Gene Name]) AND mouse[Organism]', -email => 'hnorpois at mpipsykl.mpg.de', -usehistory => 'y' ); my $hist = $factory->next_History || die "No history data returned"; $factory->set_parameters( -eutil => 'efetch', -history => $hist ); print $factory->get_Response->content; --Russell -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Hermann Norpois Sent: Thursday, 3 May 2012 9:01 a.m. To: Fields, Christopher J Cc: > Subject: Re: [Bioperl-l] get geneID for gene names Thank you very much. But there still is a problem. This is my output: 525211,210532,167498,142652 I get some ids (the first one is the UniSTS ID, the following ... I do not know) but there is no gene ID. If you compare to the following link: http://www.ncbi.nlm.nih.gov/genome/sts/sts.cgi?uid=525211 The gene ID should be 54161 . This is my (your) script: #!/bin/perl -w use Bio::DB::EUtilities; my $name = "Copg"; my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', -db => 'unists', -term => "$name AND Mus musculus [ORGN]", -email => 'hnorpois at mpipsykl.mpg.de', ); print join(',',$factory->get_ids)."\n"; 2012/5/2 Fields, Christopher J > > Also, a small but very significant bug is in the below. Can you spot it? > > The '-term' value is in single quotes, these need to be double-quotes > to interpolate $name. Otherwise, it is literally looking for '$name'. > > chris > > On May 2, 2012, at 12:55 PM, Christopher Fields wrote: > > > Hermann, > > > > The below works for me (note I'm using esearch, not efetch). To > actually get the records you will use efetch and the IDs obtained below. > > > > chris > > > > ------------------------------ > > my $name = "Copg"; > > my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch', > > -db => 'unists', > > -term => '$name AND mouse [ORGN]', > > -email => '', > > ); > > > > print join(',',$factory->get_ids)."\n"; > > > > > > On May 2, 2012, at 12:42 PM, Hermann Norpois wrote: > > > >> Hello, > >> > >> I wish to get gene IDs for gene names (e.g. bdnf, copg). I thought > >> it > was a > >> good idea to use Bio::DB::EUtilities (see below) and addressed > >> UNISTS as database because there it was quite easy to find the gene > >> ID. So far I > was > >> unable to retrieve the gene ID from UNISTS. Could anybody give me a > >> hint how to proceed? The cookbook ... Yes, I was trying. > >> > >> #!/bin/perl -w > >> > >> use Bio::DB::EUtilities; > >> > >> my $name = "Copg"; > >> my $factory = Bio::DB::EUtilities->new(-eutil => 'efetch', > >> -db => 'unists', > >> -term => '$name AND mouse > [ORGN]', > >> -email => ' > hnorpois at mpipsykl.mpg.de' > >> ) > >> > >> > >> Thank you > >> Hermann Norpois > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From Russell.Smithies at agresearch.co.nz Sun May 6 16:42:58 2012 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Mon, 7 May 2012 08:42:58 +1200 Subject: [Bioperl-l] codon usage In-Reply-To: <20120421025950.8579.qmail@f4mail-235-122.rediffmail.com> References: <20120421025950.8579.qmail@f4mail-235-122.rediffmail.com> Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CCECE6CAC@exchsth.agresearch.co.nz> I'd be tempted to not use Perl but just use grep if all you need is a count of codons. I suspect your code is going to be quite slow on large sequences with those nested loops. --Russell -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of subarna thakur Sent: Saturday, 21 April 2012 3:00 p.m. To: bioperl-l at bioperl.org Subject: [Bioperl-l] codon usage I am writing a script for determining number of genes containing a particular codon. The codons are mentioned in a separate file. The output is coming all right for the first codon mentioned in the file but for the other codons , the script is not working. Please suggest the error in the script. The script is as follows ---- #!/usr/bin/perl -w use Bio::SeqIO; $file2="table.txt"; $codon=0; open OUT, ">out-test.txt" or die $!; $seqio_obj = Bio::SeqIO->new( -file => "gopi2.txt" , '-format' => 'Fasta'); open( my $fh2, $file2 ) or die "$!"; while( my $line = <$fh2> ){ $acc=$line; chomp $acc; while ($seq1= $seqio_obj->next_seq){ my @output = $seq1->id; my $string = $seq1->seq; $v=0; $l= length($string); $t=$l/3; $k=0; for ($i=1; $i <= $t; $i++){ @array2 = substr($string, $k, 3); $k=$k+3; foreach $value (@array2) { if ($value eq "$acc") { print OUT " The sequence id is @output\n"; print OUT "$acc codon found in position $i\n\n"; $v=$v+1; } } } if ($v==0) { $h=0; } else { $h=1; } $codon=$codon+$h; } print OUT "Total number of sequences with $acc codon"; print OUT "\t"; print OUT $codon; } exit; _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From Russell.Smithies at agresearch.co.nz Sun May 6 16:52:55 2012 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Mon, 7 May 2012 08:52:55 +1200 Subject: [Bioperl-l] codon usage In-Reply-To: <20120421025950.8579.qmail@f4mail-235-122.rediffmail.com> References: <20120421025950.8579.qmail@f4mail-235-122.rediffmail.com> Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CCECE6CB7@exchsth.agresearch.co.nz> Or use Bio::Tools::SeqStats (this is straight from the perldocs http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/Tools/SeqStats.pm ) $seqobj = Bio::PrimarySeq->new(-seq => 'ACTGTGGCGTCAACTG',-alphabet => 'dna',-id => 'test'); $seq_stats = Bio::Tools::SeqStats->new(-seq => $seqobj); $hash_ref = $seq_stats-> count_codons(); # for nucleic acid sequence foreach $base (sort keys %$hash_ref) { print "Number of codons of type ", $base, "= ", %$hash_ref->{$base},"\n"; } --Russell -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of subarna thakur Sent: Saturday, 21 April 2012 3:00 p.m. To: bioperl-l at bioperl.org Subject: [Bioperl-l] codon usage I am writing a script for determining number of genes containing a particular codon. The codons are mentioned in a separate file. The output is coming all right for the first codon mentioned in the file but for the other codons , the script is not working. Please suggest the error in the script. The script is as follows ---- #!/usr/bin/perl -w use Bio::SeqIO; $file2="table.txt"; $codon=0; open OUT, ">out-test.txt" or die $!; $seqio_obj = Bio::SeqIO->new( -file => "gopi2.txt" , '-format' => 'Fasta'); open( my $fh2, $file2 ) or die "$!"; while( my $line = <$fh2> ){ $acc=$line; chomp $acc; while ($seq1= $seqio_obj->next_seq){ my @output = $seq1->id; my $string = $seq1->seq; $v=0; $l= length($string); $t=$l/3; $k=0; for ($i=1; $i <= $t; $i++){ @array2 = substr($string, $k, 3); $k=$k+3; foreach $value (@array2) { if ($value eq "$acc") { print OUT " The sequence id is @output\n"; print OUT "$acc codon found in position $i\n\n"; $v=$v+1; } } } if ($v==0) { $h=0; } else { $h=1; } $codon=$codon+$h; } print OUT "Total number of sequences with $acc codon"; print OUT "\t"; print OUT $codon; } exit; _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From Russell.Smithies at agresearch.co.nz Sun May 6 17:00:34 2012 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Mon, 7 May 2012 09:00:34 +1200 Subject: [Bioperl-l] Teaching with BioPerl this summer In-Reply-To: <39CBAD8E-E081-443A-A1A1-CE9F0B934934@tamu.edu> References: <83B972F9-0612-467B-880D-D83852358CDB@tamu.edu> <39CBAD8E-E081-443A-A1A1-CE9F0B934934@tamu.edu> Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CCECE6CC7@exchsth.agresearch.co.nz> Several browsers support inline data for images and I've done it before as a simple method of transferring data between servers i.e. the alignments and images were created on one server, the html with embedded images created, then ssh'ed to an external web server. It works very well , take a look at http://en.wikipedia.org/wiki/Data_URI_scheme eg. Red dot Russell Smithies Infrastructure Technician T 03 489 9085 M 027 4734 600 E russell.smithies at agresearch.co.nz Invermay Agricultural Centre Puddle Alley, Private Bag 50034, Mosgiel 9053, New Zealand T ?+64 3 489 3809? F? +64 3 489 3739? www.agresearch.co.nz ? -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Jim Hu Sent: Sunday, 6 May 2012 8:36 a.m. To: Mike Williams Cc: bioperl-l at portal.open-bio.org Subject: Re: [Bioperl-l] Teaching with BioPerl this summer I am not a GD expert by any stretch, but my understanding is that a CGI using GD can return the image with an appropriate http header, such as: print "Content-type: image/png\n\n"; followed by outputting the image from the GD object. However, to embed the image in a web page with other content, I always did it by outputting to a tmp file and used an img tag to access the file. Is there another way to do this? I suppose that the img tag could link to another cgi that actually generates the image... I guess having a second cgi in their library that uses GD::Graph would work the same way. But the RGraph library is really easy to use, and it will delegate the processing of the image to the client, which may be useful at some point in the future if we can actually get the students to help with online teaching tools that will be clicked on simultaneously by classes with hundreds of students. Jim On May 5, 2012, at 10:40 AM, Mike Williams wrote: > > > On Fri, May 4, 2012 at 2:08 PM, Jim Hu wrote: > > Anyway, in case anyone is interested my tentative plan (which has to become a real plan in a couple of weeks) is: > > - next have them use Perl to calculate a simple math function (factorials). Again, make a shell and cgi-bin version. For the cgi, use templates based on the RGraph javascript library to plot the data in an HTML5 canvas. I like the idea of using HTML5 instead of GD making png files based on not having to link to images in a tmp directory. > > You can use GD to send an image directly to the browser without creating a file. > > Mike > ===================================== Jim Hu Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054 _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From ss2489 at cornell.edu Sun May 6 17:15:12 2012 From: ss2489 at cornell.edu (Surya Saha) Date: Sun, 6 May 2012 17:15:12 -0400 Subject: [Bioperl-l] genomic coordinates always on the plus strand In-Reply-To: <31A039BE-8EC9-4482-9C43-622CAF033639@tamu.edu> References: <31A039BE-8EC9-4482-9C43-622CAF033639@tamu.edu> Message-ID: Hi Hermann, To back up what Jim said.. this convention is not only specific to BioPerl but all GFF files, the de-facto file format for annotations. See http://gmod.org/wiki/GFF. Coordinates are always numbered according to the positive strand. If you have two genes that differ only in strand, then the GFF records will only differ in the value of the strand field. Hope that helps. -Surya On Sat, May 5, 2012 at 4:44 PM, Jim Hu wrote: > In BioPerl end( to) is always > start(from) and the strand is indicated by > strand. IIRC, there is a proposal for how to handle this for features that > cross the origin in circular genomes, but it hasn't been implemented yet. > See: > > http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/Range.html > > Jim Hu > > On May 4, 2012, at 2:29 PM, Hermann Norpois wrote: > > > Hello, > > > > in the tutorial > > http://www.bioperl.org/wiki/HOWTO:Getting_Genomic_Sequencesthere is a > > script that retrieves genomic coordinates (see below). I tested > > it with 14 geneIDs and got always coordinates on "plus strand" meaning > > $from was always a lower number than $to. Principally this is nice but I > > was surprised. This means that all by genes are (by chance) on the plus > > strand or that there are 2 "coordinates" (one for the "plus" one for the > > "minus" strand). Then it could be possible (theoretically and not very > > likely) that there are two genes for one $from/$to pair (one on the plus > > and one on the minus strand with the same coordinates with different > IDs). > > I did not find anything about this issue in the documentation or in the > > archive. Could please anybody comment on this? > > > > use strict;use Bio::DB::EntrezGene; > > my $id = shift or die > > "Id?\n"; # use a Gene id > > my $db = new Bio::DB::EntrezGene; > > my $seq = $db->get_Seq_by_id($id); > > my $ac = $seq->annotation; > > for my $ann ($ac->get_Annotations('dblink')) { > > if ($ann->database eq "Evidence Viewer") { > > # get the sequence identifier, the start, and the stop > > my ($contig,$from,$to) = $ann->url =~ > > /contig=([^&]+).+from=(\d+)&to=(\d+)/; > > print > "$contig\t$from\t$to\n"; > > }} > > > > > > Thank you > > Hermann Norpois > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > ===================================== > Jim Hu > Professor > Dept. of Biochemistry and Biophysics > 2128 TAMU > Texas A&M Univ. > College Station, TX 77843-2128 > 979-862-4054 > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From jimhu at tamu.edu Mon May 7 11:28:01 2012 From: jimhu at tamu.edu (Jim Hu) Date: Mon, 7 May 2012 10:28:01 -0500 Subject: [Bioperl-l] codon usage In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF34CCECE6CB7@exchsth.agresearch.co.nz> References: <20120421025950.8579.qmail@f4mail-235-122.rediffmail.com> <18DF7D20DFEC044098A1062202F5FFF34CCECE6CB7@exchsth.agresearch.co.nz> Message-ID: <07E48B59-7A51-4D75-8D5A-6DD2D529623C@tamu.edu> I was looking at Bio::Tools::SeqStats. It reminds me of the very first bioinformatics program I worked on back when I was in grad school, sequencing was done by Maxam-Gilbert chemistry on gels with radioactive DNA (we were doing short reads, but we didn't call it that and the reads per run was something like 8). We were writing a program in Apple II basic to find restriction sites. Everyone in the group was doing this by putting the target sequence in a string variable and looking for the site as a substring match by whatever it was BASIC used. Loop over all the sites you are looking for. This strikes me as the equivalent of how Bio::Tools::SeqStats works, only with regexes. My roommate at the time, who was a math PhD student doing an MS in CompSci, pointed out to me that this would be more efficient using a discrete finite automaton algorithm, where each site we were looking for would be a state automaton. This has the advantage of being able to process the sequence as a stream. Back when we were working with computers with RAM measured in Kbytes, this was a big help. I'm not sure if it would be worth it today. The slow interpreted implementation of the state machines would likely lose to the fast internal implementation of the regex routines for sequence lengths we are looking at these days. But it might be interesting to compare. Jim On May 6, 2012, at 3:52 PM, Smithies, Russell wrote: > Or use Bio::Tools::SeqStats > (this is straight from the perldocs http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/Tools/SeqStats.pm ) > > $seqobj = Bio::PrimarySeq->new(-seq => 'ACTGTGGCGTCAACTG',-alphabet => 'dna',-id => 'test'); > $seq_stats = Bio::Tools::SeqStats->new(-seq => $seqobj); > > $hash_ref = $seq_stats-> count_codons(); # for nucleic acid sequence > foreach $base (sort keys %$hash_ref) { > print "Number of codons of type ", $base, "= ", %$hash_ref->{$base},"\n"; > } > > > --Russell > > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of subarna thakur > Sent: Saturday, 21 April 2012 3:00 p.m. > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] codon usage > > I am writing a script for determining number of genes containing a particular codon. The codons are mentioned in a separate file. The output is coming all right for the first codon mentioned in the file but for the other codons , the script is not working. Please suggest the error in the script. The script is as follows ---- > > #!/usr/bin/perl -w > > use Bio::SeqIO; > > $file2="table.txt"; > > $codon=0; > > open OUT, ">out-test.txt" or die $!; > > $seqio_obj = Bio::SeqIO->new( -file => "gopi2.txt" , '-format' => 'Fasta'); > > open( my $fh2, $file2 ) or die "$!"; > > while( my $line = <$fh2> ){ > > $acc=$line; > > chomp $acc; > > while ($seq1= $seqio_obj->next_seq){ > > my @output = $seq1->id; > > my $string = $seq1->seq; > > $v=0; > > $l= length($string); > > $t=$l/3; > > $k=0; > > for ($i=1; $i <= $t; $i++){ > > @array2 = substr($string, $k, 3); > > $k=$k+3; > > foreach $value (@array2) > > { > > if ($value eq "$acc") > > { > > print OUT " The sequence id is @output\n"; > > print OUT "$acc codon found in position $i\n\n"; > > $v=$v+1; > > } > > } > > } > > if ($v==0) > > { > > $h=0; > > } > > else > > { > > $h=1; > > } > > $codon=$codon+$h; > > } > > print OUT "Total number of sequences with $acc codon"; > > print OUT "\t"; > > print OUT $codon; > > } > > exit; > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l ===================================== Jim Hu Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054 From cjfields at illinois.edu Mon May 7 12:28:45 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 7 May 2012 16:28:45 +0000 Subject: [Bioperl-l] codon usage In-Reply-To: <07E48B59-7A51-4D75-8D5A-6DD2D529623C@tamu.edu> References: <20120421025950.8579.qmail@f4mail-235-122.rediffmail.com> <18DF7D20DFEC044098A1062202F5FFF34CCECE6CB7@exchsth.agresearch.co.nz> <07E48B59-7A51-4D75-8D5A-6DD2D529623C@tamu.edu> Message-ID: <948BF022-A037-4A11-8FE7-8E217646866B@illinois.edu> On May 7, 2012, at 10:28 AM, Jim Hu wrote: > I was looking at Bio::Tools::SeqStats. It reminds me of the very first bioinformatics program I worked on back when I was in grad school, sequencing was done by Maxam-Gilbert chemistry on gels with radioactive DNA (we were doing short reads, but we didn't call it that and the reads per run was something like 8). We were writing a program in Apple II basic to find restriction sites. Everyone in the group was doing this by putting the target sequence in a string variable and looking for the site as a substring match by whatever it was BASIC used. Loop over all the sites you are looking for. This strikes me as the equivalent of how Bio::Tools::SeqStats works, only with regexes. Yes, I'm sure that was the simplest way to implement it to get things working, I'm guessing. > My roommate at the time, who was a math PhD student doing an MS in CompSci, pointed out to me that this would be more efficient using a discrete finite automaton algorithm, where each site we were looking for would be a state automaton. This has the advantage of being able to process the sequence as a stream. Back when we were working with computers with RAM measured in Kbytes, this was a big help. I'm not sure if it would be worth it today. The slow interpreted implementation of the state machines would likely lose to the fast internal implementation of the regex routines for sequence lengths we are looking at these days. > > But it might be interesting to compare. > > Jim Sure, we're always up for testing. Would you like to run this on a fork on github? Or we can probably set you up with commit access, as long as all this was confined to a branch. chris > On May 6, 2012, at 3:52 PM, Smithies, Russell wrote: > >> Or use Bio::Tools::SeqStats >> (this is straight from the perldocs http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/Tools/SeqStats.pm ) >> >> $seqobj = Bio::PrimarySeq->new(-seq => 'ACTGTGGCGTCAACTG',-alphabet => 'dna',-id => 'test'); >> $seq_stats = Bio::Tools::SeqStats->new(-seq => $seqobj); >> >> $hash_ref = $seq_stats-> count_codons(); # for nucleic acid sequence >> foreach $base (sort keys %$hash_ref) { >> print "Number of codons of type ", $base, "= ", %$hash_ref->{$base},"\n"; >> } >> >> >> --Russell >> >> >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of subarna thakur >> Sent: Saturday, 21 April 2012 3:00 p.m. >> To: bioperl-l at bioperl.org >> Subject: [Bioperl-l] codon usage >> >> I am writing a script for determining number of genes containing a particular codon. The codons are mentioned in a separate file. The output is coming all right for the first codon mentioned in the file but for the other codons , the script is not working. Please suggest the error in the script. The script is as follows ---- >> >> #!/usr/bin/perl -w >> >> use Bio::SeqIO; >> >> $file2="table.txt"; >> >> $codon=0; >> >> open OUT, ">out-test.txt" or die $!; >> >> $seqio_obj = Bio::SeqIO->new( -file => "gopi2.txt" , '-format' => 'Fasta'); >> >> open( my $fh2, $file2 ) or die "$!"; >> >> while( my $line = <$fh2> ){ >> >> $acc=$line; >> >> chomp $acc; >> >> while ($seq1= $seqio_obj->next_seq){ >> >> my @output = $seq1->id; >> >> my $string = $seq1->seq; >> >> $v=0; >> >> $l= length($string); >> >> $t=$l/3; >> >> $k=0; >> >> for ($i=1; $i <= $t; $i++){ >> >> @array2 = substr($string, $k, 3); >> >> $k=$k+3; >> >> foreach $value (@array2) >> >> { >> >> if ($value eq "$acc") >> >> { >> >> print OUT " The sequence id is @output\n"; >> >> print OUT "$acc codon found in position $i\n\n"; >> >> $v=$v+1; >> >> } >> >> } >> >> } >> >> if ($v==0) >> >> { >> >> $h=0; >> >> } >> >> else >> >> { >> >> $h=1; >> >> } >> >> $codon=$codon+$h; >> >> } >> >> print OUT "Total number of sequences with $acc codon"; >> >> print OUT "\t"; >> >> print OUT $codon; >> >> } >> >> exit; >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> ======================================================================= >> Attention: The information contained in this message and/or attachments >> from AgResearch Limited is intended only for the persons or entities >> to which it is addressed and may contain confidential and/or privileged >> material. Any review, retransmission, dissemination or other use of, or >> taking of any action in reliance upon, this information by persons or >> entities other than the intended recipients is prohibited by AgResearch >> Limited. If you have received this message in error, please notify the >> sender immediately. >> ======================================================================= >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > ===================================== > Jim Hu > Professor > Dept. of Biochemistry and Biophysics > 2128 TAMU > Texas A&M Univ. > College Station, TX 77843-2128 > 979-862-4054 > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From jimhu at tamu.edu Mon May 7 13:13:14 2012 From: jimhu at tamu.edu (Jim Hu) Date: Mon, 7 May 2012 12:13:14 -0500 Subject: [Bioperl-l] codon usage In-Reply-To: <948BF022-A037-4A11-8FE7-8E217646866B@illinois.edu> References: <20120421025950.8579.qmail@f4mail-235-122.rediffmail.com> <18DF7D20DFEC044098A1062202F5FFF34CCECE6CB7@exchsth.agresearch.co.nz> <07E48B59-7A51-4D75-8D5A-6DD2D529623C@tamu.edu> <948BF022-A037-4A11-8FE7-8E217646866B@illinois.edu> Message-ID: <8A27E44E-524C-4B7D-A919-1AA6B5091B29@tamu.edu> I wouldn't mind having a branch. I could also use it to share the files for our class. Not sure how soon I'd be able to get around to doing the alternative version of Bio::Tools::SeqStats... if I can find time to do some work on a BioPerl fork, I'd like to get the is_circular stuff working. Jim On May 7, 2012, at 11:28 AM, Fields, Christopher J wrote: > On May 7, 2012, at 10:28 AM, Jim Hu wrote: > >> I was looking at Bio::Tools::SeqStats. It reminds me of the very first bioinformatics program I worked on back when I was in grad school, sequencing was done by Maxam-Gilbert chemistry on gels with radioactive DNA (we were doing short reads, but we didn't call it that and the reads per run was something like 8). We were writing a program in Apple II basic to find restriction sites. Everyone in the group was doing this by putting the target sequence in a string variable and looking for the site as a substring match by whatever it was BASIC used. Loop over all the sites you are looking for. This strikes me as the equivalent of how Bio::Tools::SeqStats works, only with regexes. > > Yes, I'm sure that was the simplest way to implement it to get things working, I'm guessing. > >> My roommate at the time, who was a math PhD student doing an MS in CompSci, pointed out to me that this would be more efficient using a discrete finite automaton algorithm, where each site we were looking for would be a state automaton. This has the advantage of being able to process the sequence as a stream. Back when we were working with computers with RAM measured in Kbytes, this was a big help. I'm not sure if it would be worth it today. The slow interpreted implementation of the state machines would likely lose to the fast internal implementation of the regex routines for sequence lengths we are looking at these days. >> >> But it might be interesting to compare. >> >> Jim > > Sure, we're always up for testing. Would you like to run this on a fork on github? Or we can probably set you up with commit access, as long as all this was confined to a branch. > > chris > >> On May 6, 2012, at 3:52 PM, Smithies, Russell wrote: >> >>> Or use Bio::Tools::SeqStats >>> (this is straight from the perldocs http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/Tools/SeqStats.pm ) >>> >>> $seqobj = Bio::PrimarySeq->new(-seq => 'ACTGTGGCGTCAACTG',-alphabet => 'dna',-id => 'test'); >>> $seq_stats = Bio::Tools::SeqStats->new(-seq => $seqobj); >>> >>> $hash_ref = $seq_stats-> count_codons(); # for nucleic acid sequence >>> foreach $base (sort keys %$hash_ref) { >>> print "Number of codons of type ", $base, "= ", %$hash_ref->{$base},"\n"; >>> } >>> >>> >>> --Russell >>> >>> >>> -----Original Message----- >>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of subarna thakur >>> Sent: Saturday, 21 April 2012 3:00 p.m. >>> To: bioperl-l at bioperl.org >>> Subject: [Bioperl-l] codon usage >>> >>> I am writing a script for determining number of genes containing a particular codon. The codons are mentioned in a separate file. The output is coming all right for the first codon mentioned in the file but for the other codons , the script is not working. Please suggest the error in the script. The script is as follows ---- >>> >>> #!/usr/bin/perl -w >>> >>> use Bio::SeqIO; >>> >>> $file2="table.txt"; >>> >>> $codon=0; >>> >>> open OUT, ">out-test.txt" or die $!; >>> >>> $seqio_obj = Bio::SeqIO->new( -file => "gopi2.txt" , '-format' => 'Fasta'); >>> >>> open( my $fh2, $file2 ) or die "$!"; >>> >>> while( my $line = <$fh2> ){ >>> >>> $acc=$line; >>> >>> chomp $acc; >>> >>> while ($seq1= $seqio_obj->next_seq){ >>> >>> my @output = $seq1->id; >>> >>> my $string = $seq1->seq; >>> >>> $v=0; >>> >>> $l= length($string); >>> >>> $t=$l/3; >>> >>> $k=0; >>> >>> for ($i=1; $i <= $t; $i++){ >>> >>> @array2 = substr($string, $k, 3); >>> >>> $k=$k+3; >>> >>> foreach $value (@array2) >>> >>> { >>> >>> if ($value eq "$acc") >>> >>> { >>> >>> print OUT " The sequence id is @output\n"; >>> >>> print OUT "$acc codon found in position $i\n\n"; >>> >>> $v=$v+1; >>> >>> } >>> >>> } >>> >>> } >>> >>> if ($v==0) >>> >>> { >>> >>> $h=0; >>> >>> } >>> >>> else >>> >>> { >>> >>> $h=1; >>> >>> } >>> >>> $codon=$codon+$h; >>> >>> } >>> >>> print OUT "Total number of sequences with $acc codon"; >>> >>> print OUT "\t"; >>> >>> print OUT $codon; >>> >>> } >>> >>> exit; >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> ======================================================================= >>> Attention: The information contained in this message and/or attachments >>> from AgResearch Limited is intended only for the persons or entities >>> to which it is addressed and may contain confidential and/or privileged >>> material. Any review, retransmission, dissemination or other use of, or >>> taking of any action in reliance upon, this information by persons or >>> entities other than the intended recipients is prohibited by AgResearch >>> Limited. If you have received this message in error, please notify the >>> sender immediately. >>> ======================================================================= >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> ===================================== >> Jim Hu >> Professor >> Dept. of Biochemistry and Biophysics >> 2128 TAMU >> Texas A&M Univ. >> College Station, TX 77843-2128 >> 979-862-4054 >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > ===================================== Jim Hu Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054 From jason.stajich at gmail.com Mon May 7 13:38:15 2012 From: jason.stajich at gmail.com (Jason Stajich) Date: Mon, 7 May 2012 10:38:15 -0700 Subject: [Bioperl-l] piping values into an existing GENBANK file In-Reply-To: <70DA93B804A15C4387B05DEF33BC255701A1CE149E36@MSIGI.stud.ad.uni-graz.at> References: <70DA93B804A15C4387B05DEF33BC255701A1CE149E36@MSIGI.stud.ad.uni-graz.at> Message-ID: <521B68D9-03B2-4A67-A2C8-6F89A9B74D97@gmail.com> your question is unclear, maybe you can show what you want the output to look like. are you trying to conditionally add a COGlist of info to only certain CDSes? Then you need to have a hash or a dataset that defines the CDS you want to add values to and then you need to interrogate each of the CDS features to get their name for example ($locus) = $feat->get_tag_values('locus_tag') and use that info to determine which features you will update. If you want to write it back out as genbank you would just initialize another SeqIO object that writes genbank and pass the sequence object to it - since the feature object is updated it will just be written out with the new info as attached to the sequence. jason On Apr 21, 2012, at 2:22 AM, Alavi, Mohammadali (0313xxx) wrote: > Hello All, > I have a GENBANK file already, to which I need to add some feauture. To be precise, I want to add the data (over the COG function) to the CDSs present in the GENBANK file. The data (COG functions) I need to add is included in an array in a manner that the first value is the value needed to be added to my first CDS in the GENBANK file, the second value needs to be added to the second CDS in the GENBANK file and so on. I tried to add the data in a tag/value style to the CDSs (as described in HOW TO:Feautures-Annotation provided by Biopel), which actually basically works. The Problem is though, I do not know how I could tell Perl/Bioperl to only take one single value at a time and add it in a tag/value style to a CDS and then take the next (and only the next) value and add it to the NEXT CDS and so on. Here is the code I used. As you see, using the for $item(@array) is not appropriate, since it adds all the values of my array to all CDSs! > So is there a way of piping in values one after another into CDSs one after another in a file using Bioperl?! or maybe how about another way of doing it in regular Perl? I would appreciate any help on that very much! > > > Bioperl I'm using: 1.6.1 > The Active Perl I'm using : 5.12.4 (on Windows Vista) > > > #!/bin/perl > use Bio::SeqIO; > use Bio::SeqFeature::Generic; > use warnings; > > @COGlist = qw(motility General metabolism nunknown); # think of this as the #array I would like to add the values of to my file, the real one has ofcourse #as many values as the number of CDSs in the GENBANK file > > > > $seqio_object = Bio::SeqIO -> new(-file => "file.gbk", -format => "genbank"); > $seq_object = $seqio_object -> next_seq; > for $feat_object ($seq_object -> get_SeqFeatures){ > for $item(@COGlist){ # this would add all elements of the array to all of CDSs and is therefore wrong! > $feat_object -> add_tag_value("note", $item); > } > > for $tags ($feat_object -> get_all_tags){ > print "tag:".$tags . "\n"; > for $values ($feat_object -> get_tag_values($tags)){ > print "value: " . $values . "\n"; # as one might imagine this does not give the output I have been looking for :-)) > } > } > } > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason.stajich at gmail.com jason at bioperl.org From hnorpois at googlemail.com Mon May 7 14:03:38 2012 From: hnorpois at googlemail.com (Hermann Norpois) Date: Mon, 7 May 2012 20:03:38 +0200 Subject: [Bioperl-l] (no subject) Message-ID: Hello, I wrote a script to retrieve promoter sequences from genbank (see below) that doesnt work in the get_stream modus. Input is an array (or list) of geneids. In my output file (test4.fasta) only the last header is followed by a sequence (see below). I guess I need a second stream but I do not know how to construct. Thank you very much for helping. Hermann *script: *#!/bin/perl -w use strict; use warnings; use Bio::DB::EntrezGene; use Bio::SeqIO; use Bio::DB::GenBank; my $seqio_obj = Bio::SeqIO->new(-file => ">test4.fasta", -format => 'fasta' ); # outputfile my $db = new Bio::DB::EntrezGene; my $seqio = $db->get_Stream_by_id([54161, 12064, 71661]); # Gene ids while ( my $seq = $seqio->next_seq ) { my $ac = $seq->annotation; for my $ann ($ac->get_Annotations('dblink')) { if ($ann->database eq "Evidence Viewer") { # get the sequence identifier, the start, and the stop my ($contig,$from,$to) = $ann->url =~ /contig=([^&]+).+from=(\d+)&to=(\d+)/; my $chr_start = $from-700; my $chr_stop = $from; my $strand = 1; my $gb = Bio::DB::GenBank->new(-format => 'fasta', -seq_start => $chr_start, -seq_stop => $chr_stop, -strand => $strand ); print "$contig\t$from\t$to\n\t$chr_start\t$chr_stop\n"; # For control: Printing coordinates my $obj = $gb->get_Seq_by_id($contig); $seqio_obj->write_seq($obj); } } } *test4.fasta* >gi|28485536:10326633-10326632 Mus musculus chromosome 2 genomic contig, strain C57BL/6J >gi|28526280:23304000-23303999 Mus musculus chromosome 6 genomic contig, strain C57BL/6J >gi|372099010:6466323-6467023 Mus musculus strain C57BL/6J chromosome 7 genomic contig, GRCm38 C57BL/6J MMCHR7_CTG5_2 CCCCTCCCCGCAGCTTGATTCCTATAAAAACCTGCCATTTTGGATGAATGTGCTGTTCGC CCTTGGCTCCTTTCTTGGTCCACTTGCCCTCTCTCTTCTCTCTCTCTCCTTTCACTTTCT... From maquino at knome.com Wed May 2 21:39:54 2012 From: maquino at knome.com (Mark Aquino) Date: Wed, 2 May 2012 21:39:54 -0400 Subject: [Bioperl-l] bioperl count sam reads Message-ID: Hi all, I'm a little stumped as to how to successfully count the depths of all reads at a specific locus in a sam/bam file. I know I can do this with GATK DepthOfCoverage but I wanted to do some more customized things with my script yet I haven't figured out how to get the right base. I was a bit surprised there wasn't (or that it's not well documented) a method to get the individuals read's base at a specific position while getting the $refbase is quite easy. (I'm betting such a method exists and is just not documented well) At any rate, gaps in the alignment are the cause for my problems, so if anyone knows a simpler way to do call the bases correctly, or a clever algorithm to deal with this issue, it would be much appreciated. Here's what I have for code and it works except in cases where there are multiple gaps in the reference sequence, e.g. the alignment below should be T-T here not C-C but is shifted due to the second gap. #!/progs/bin/perl use strict; use warnings; use Bio::DB::Sam; use Bio::DB::Bam::AlignWrapper; use Pod::Usage; use Getopt::Long; use Bio::DB::Bam::Pileup; use Term::ANSIColor; my $sam = Bio::DB::Sam->new(-bam =>$BAM, -fasta=> $FASTA); getBases($chr, $pos, $pos); sub getBases { my $print = 1; my ($chr, $start_query, $end_query) = @_; my @alignments = $sam->get_features_by_location(-seq_id => $chr, -start => $start_query, -end => $end_query); my $refbase; my ($a_count, $t_count, $g_count, $c_count, $n_count, $del_count, $ins_count) = (0, 0, 0, 0, 0, 0, 0); for my $a (@alignments) { my $start = $a->start; my $end = $a->end; my $query_start = $a->query->start; my $query_end = $a->query->end; my $ref_dna = $a->dna; # reference sequence bases my ($ref, $matches, $query) = $a->padded_alignment; my $offset = 0; if ($ref =~ /^([-]+)[ATCG]+/){ $offset = length($1); } #print "$offset\n"; $refbase = $sam->segment($chr,$start_query,$start_query)->dna; printAlignment($ref, $matches, $query, $start_query, $start, $offset); my $base = substr($query, $start_query-$start+$offset, 1); if (!$base){ next; } $a_count++ if ($base eq "A"); $t_count++ if ($base eq "T"); $c_count++ if ($base eq "C"); $g_count++ if ($base eq "G"); $n_count++ if ($base eq "N"); $del_count++ if ($base eq "-"); my @scores = $a->qscore; # per-base quality scores my $match_qual= $a->qual; # quality of the match } my $total_depth = $a_count + $t_count + $c_count + $g_count + $n_count + $del_count; if ($print == 1){ # print "$start_query\tref base: $ref_base\n"; print "$chr:$start_query($refbase)\t"; print "A:$a_count\t"; print "T:$t_count\t"; print "C:$c_count\t"; print "G:$g_count\t"; print "N:$n_count\t"; print "D:$del_count\t"; print "Total:$total_depth\n"; } return ($a_count, $t_count, $c_count, $g_count, $n_count, $del_count, $ins_count); } sub printAlignment{ my ($ref, $matches, $query, $start_query, $start, $offset) = @_; print substr($ref, 0, $start_query-$start+$offset); print (color("red"), substr($ref, $start_query-$start+$offset, 1), color("reset")); print substr($ref, $start_query-$start+$offset+1),"\n"; print substr($matches, 0, $start_query-$start+$offset); print (color("red"), substr($matches, $start_query-$start+$offset, 1), color("reset")); print substr($matches, $start_query-$start+$offset+1),"\n"; print substr($query, 0, $start_query-$start+$offset); print (color("red"), substr($query, $start_query-$start+$offset, 1), color("reset")); print substr($query, $start_query-$start+$offset+1),"\n"; } -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Untitled.tiff Type: image/tiff Size: 6222 bytes Desc: not available URL: From scott at scottcain.net Mon May 7 15:01:21 2012 From: scott at scottcain.net (Scott Cain) Date: Mon, 7 May 2012 15:01:21 -0400 Subject: [Bioperl-l] bioperl count sam reads In-Reply-To: References: Message-ID: Hi Mark, I can't really help you with your current problem, but I would like to just point out a few things: Bio::DB::Sam isn't actually part of BioPerl (though this mailing list is still a reasonable place to ask quesitons about it), and it was originally developed to support drawing graphical representations of BAM files in BioGraphics and thus GBrowse. As a result, it isn't surprising that some functionality may be missing from the perl wrapper, since it wasn't needed to create the graphical representation. If you would like to add functionality, we can accept patches :-) Scott On Wed, May 2, 2012 at 9:39 PM, Mark Aquino wrote: > Hi all, > > I'm a little stumped as to how to successfully count the depths of all > reads at a specific locus in a sam/bam file. I know I can do this with > GATK DepthOfCoverage but I wanted to do some more customized things with my > script yet I haven't figured out how to get the right base. I was a bit > surprised there wasn't (or that it's not well documented) a method to get > the individuals read's base at a specific position while getting the > $refbase is quite easy. (I'm betting such a method exists and is just not > documented well) > > At any rate, gaps in the alignment are the cause for my problems, so if > anyone knows a simpler way to do call the bases correctly, or a clever > algorithm to deal with this issue, it would be much appreciated. Here's > what I have for code and it works except in cases where there are multiple > gaps in the reference sequence, e.g. the alignment below should be T-T here > not C-C but is shifted due to the second gap. > > > #!/progs/bin/perl > use strict; > use warnings; > use Bio::DB::Sam; > use Bio::DB::Bam::AlignWrapper; > use Pod::Usage; > use Getopt::Long; > use Bio::DB::Bam::Pileup; > use Term::ANSIColor; > > my $sam = Bio::DB::Sam->new(-bam =>$BAM, > -fasta=> $FASTA); > getBases($chr, $pos, $pos); > > > sub getBases { > my $print = 1; > my ($chr, $start_query, $end_query) = @_; > my @alignments = $sam->get_features_by_location(-seq_id => $chr, > -start => $start_query, > -end => $end_query); > my $refbase; > my ($a_count, $t_count, $g_count, $c_count, $n_count, $del_count, > $ins_count) = (0, 0, 0, 0, 0, 0, 0); > for my $a (@alignments) { > > my $start = $a->start; > my $end = $a->end; > > my $query_start = $a->query->start; > my $query_end = $a->query->end; > my $ref_dna = $a->dna; # reference sequence bases > my ($ref, $matches, $query) = $a->padded_alignment; > my $offset = 0; > if ($ref =~ /^([-]+)[ATCG]+/){ > $offset = length($1); > } > #print "$offset\n"; > $refbase = $sam->segment($chr,$start_query,$start_query)->dna; > > printAlignment($ref, $matches, $query, $start_query, $start, > $offset); > my $base = substr($query, $start_query-$start+$offset, 1); > if (!$base){ > next; > } > $a_count++ if ($base eq "A"); > $t_count++ if ($base eq "T"); > $c_count++ if ($base eq "C"); > $g_count++ if ($base eq "G"); > $n_count++ if ($base eq "N"); > $del_count++ if ($base eq "-"); > my @scores = $a->qscore; # per-base quality scores > my $match_qual= $a->qual; # quality of the match > } > my $total_depth = $a_count + $t_count + $c_count + $g_count + $n_count > + $del_count; > if ($print == 1){ > # print "$start_query\tref base: $ref_base\n"; > print "$chr:$start_query($refbase)\t"; > print "A:$a_count\t"; > print "T:$t_count\t"; > print "C:$c_count\t"; > print "G:$g_count\t"; > print "N:$n_count\t"; > print "D:$del_count\t"; > print "Total:$total_depth\n"; > } > return ($a_count, $t_count, $c_count, $g_count, $n_count, $del_count, > $ins_count); > } > sub printAlignment{ > my ($ref, $matches, $query, $start_query, $start, $offset) = @_; > print substr($ref, 0, $start_query-$start+$offset); > print (color("red"), substr($ref, $start_query-$start+$offset, 1), > color("reset")); > print substr($ref, $start_query-$start+$offset+1),"\n"; > print substr($matches, 0, $start_query-$start+$offset); > print (color("red"), substr($matches, $start_query-$start+$offset, 1), > color("reset")); > print substr($matches, $start_query-$start+$offset+1),"\n"; > print substr($query, 0, $start_query-$start+$offset); > print (color("red"), substr($query, $start_query-$start+$offset, 1), > color("reset")); > print substr($query, $start_query-$start+$offset+1),"\n"; > } > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/tiff Size: 6222 bytes Desc: not available URL: From j_martin at lbl.gov Mon May 7 15:15:46 2012 From: j_martin at lbl.gov (Joel Martin) Date: Mon, 7 May 2012 12:15:46 -0700 Subject: [Bioperl-l] codon usage In-Reply-To: <07E48B59-7A51-4D75-8D5A-6DD2D529623C@tamu.edu> References: <20120421025950.8579.qmail@f4mail-235-122.rediffmail.com> <18DF7D20DFEC044098A1062202F5FFF34CCECE6CB7@exchsth.agresearch.co.nz> <07E48B59-7A51-4D75-8D5A-6DD2D529623C@tamu.edu> Message-ID: I think you're under-appreciating the perl regex engine, it operates as a state machine not by matching all occurrences every time. Here is a nice excerpt from pro perl parsing http://www.devshed.com/c/a/Perl/Parsing-and-Regular-Expression-Basics/2/ Joel On Mon, May 7, 2012 at 8:28 AM, Jim Hu wrote: > I was looking at Bio::Tools::SeqStats. ?It reminds me of the very first bioinformatics program I worked on back when I was in grad school, sequencing was done by Maxam-Gilbert chemistry on gels with radioactive DNA (we were doing short reads, but we didn't call it that and the reads per run was something like 8). ?We were writing a program in Apple II basic to find restriction sites. ?Everyone in the group was doing this by putting the target sequence in a string variable and looking for the site as a substring match by whatever it was BASIC used. ?Loop over all the sites you are looking for. ?This strikes me as the equivalent of how Bio::Tools::SeqStats works, only with regexes. > > My roommate at the time, who was a math PhD student doing an MS in CompSci, pointed out to me that this would be more efficient using a discrete finite automaton algorithm, where each site we were looking for would be a state automaton. This has the advantage of being able to process the sequence as a stream. ?Back when we were working with computers with RAM measured in Kbytes, this was a big help. ?I'm not sure if it would be worth it today. ?The slow interpreted implementation of the state machines would likely lose to the fast internal implementation of the regex routines for sequence lengths we are looking at these days. > > But it might be interesting to compare. > > Jim > > On May 6, 2012, at 3:52 PM, Smithies, Russell wrote: > >> Or use Bio::Tools::SeqStats >> (this is straight from the perldocs http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/Tools/SeqStats.pm ) >> >> ? ? ? ? $seqobj = Bio::PrimarySeq->new(-seq ? ? ?=> 'ACTGTGGCGTCAACTG',-alphabet => 'dna',-id ? ? ? => 'test'); >> ? ? ? ? $seq_stats ?= ?Bio::Tools::SeqStats->new(-seq => $seqobj); >> >> ? ? ? ? $hash_ref = $seq_stats-> count_codons(); ?# for nucleic acid sequence >> ? ? ? ? foreach $base (sort keys %$hash_ref) { >> ? ? ? ? ? ? print "Number of codons of type ", $base, "= ", ?%$hash_ref->{$base},"\n"; >> ? ? ? ? } >> >> >> --Russell >> >> >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of subarna thakur >> Sent: Saturday, 21 April 2012 3:00 p.m. >> To: bioperl-l at bioperl.org >> Subject: [Bioperl-l] codon usage >> >> I am writing a script for determining number of genes containing a particular codon. The codons are mentioned in a separate file. The output is coming all right for the first codon mentioned in the file but for the other codons , the script is not working. Please suggest the error in the script. The script is as follows ---- >> >> #!/usr/bin/perl -w >> >> use Bio::SeqIO; >> >> $file2="table.txt"; >> >> $codon=0; >> >> open OUT, ">out-test.txt" or die $!; >> >> $seqio_obj = Bio::SeqIO->new( -file => "gopi2.txt" , '-format' => 'Fasta'); >> >> open( my $fh2, $file2 ) or die "$!"; >> >> while( my $line = <$fh2> ){ >> >> $acc=$line; >> >> chomp $acc; >> >> while ($seq1= $seqio_obj->next_seq){ >> >> my @output = $seq1->id; >> >> my $string = $seq1->seq; >> >> $v=0; >> >> $l= length($string); >> >> $t=$l/3; >> >> $k=0; >> >> for ($i=1; $i <= $t; $i++){ >> >> @array2 = substr($string, $k, 3); >> >> $k=$k+3; >> >> foreach $value (@array2) >> >> { >> >> if ($value eq "$acc") >> >> { >> >> print OUT " The sequence id is @output\n"; >> >> print OUT "$acc codon found in position $i\n\n"; >> >> $v=$v+1; >> >> } >> >> } >> >> } >> >> if ($v==0) >> >> { >> >> $h=0; >> >> } >> >> else >> >> { >> >> $h=1; >> >> } >> >> $codon=$codon+$h; >> >> } >> >> print OUT "Total number of sequences with $acc codon"; >> >> print OUT "\t"; >> >> print OUT $codon; >> >> } >> >> exit; >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> ======================================================================= >> Attention: The information contained in this message and/or attachments >> from AgResearch Limited is intended only for the persons or entities >> to which it is addressed and may contain confidential and/or privileged >> material. Any review, retransmission, dissemination or other use of, or >> taking of any action in reliance upon, this information by persons or >> entities other than the intended recipients is prohibited by AgResearch >> Limited. If you have received this message in error, please notify the >> sender immediately. >> ======================================================================= >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > ===================================== > Jim Hu > Professor > Dept. of Biochemistry and Biophysics > 2128 TAMU > Texas A&M Univ. > College Station, TX 77843-2128 > 979-862-4054 > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From scott at scottcain.net Mon May 7 15:21:57 2012 From: scott at scottcain.net (Scott Cain) Date: Mon, 7 May 2012 15:21:57 -0400 Subject: [Bioperl-l] [Gmod-gbrowse] Gbrowse file uploads, bigwig and chromosome sizes files In-Reply-To: <1F4B23DC-2CD1-4D61-A6F4-D823B4C7C7D1@tamu.edu> References: <1F4B23DC-2CD1-4D61-A6F4-D823B4C7C7D1@tamu.edu> Message-ID: Hi Jim, In my test yeast database (created using the GFF from SGD), my locationlist table only has chromosomes in it. It's not clear to me why you would have genes in there too. Do you by any chance have the GFF that you created this database with? Scott On Mon, Apr 30, 2012 at 1:38 PM, Jim Hu wrote: > I'm not sure how many of our issues are gbrowse-specific vs. more general bioperl issues, so I'm cross-posting to both lists. > > We think we've traced our problems uploading wiggle files to our gbrowse to the failure to create the chromosome.size file. > > Short version: > - what is supposed to be in the locationlist? Chromosomes only or just genes? > - why does the chromosome sizes try to get everything in the locationlist, whether or not it's a chromosome? > > Long version: > > Our E. coli MG1655 database was loaded several years ago with > > bp_seqfeature_load.pl -d gb_MG1655_jh -f -c NC_000913.gb.gff NC_000913.gb.fasta -u -p > > The mysql database has 4,146 entries in the locationlist where the first one is for the chromosome and the others are named for genes. ?When we ask Gbrowse to generate the chromosome sizes file, instead of doing what I expect (look up the reference feature names), it tries to get the size of every feature in the locationlist. ?I can't actually find the fasta file I used. > > When this happens, the eval in Bio::Graphics::Broser2::Dataloader dies because it does not seem to be passing allow_aliases to this subroutine in Bio::DB::Seqfeature::Store:: DBI::mysql > > > sub _name_sql { > ?my $self = shift; > ?my ($name,$allow_aliases,$join) = @_; > ?my $name_table ? = $self->_name_table; > > ?my $from ?= "$name_table as n"; > ?my ($match,$string) = $self->_match_sql($name); > > ?my $where = "n.id=$join AND n.name $match"; > ?$where ? .= " AND n.display_name>0" unless $allow_aliases; > ?return ($from,$where,'',$string); > } > > Here's the backtrace: > > CHROMOSOME SIZES at /usr/local/share/perl/5.10.1/Bio/DB/SeqFeature/Store/DBI/mysql.pm line 942, referer: > > Bio::DB::SeqFeature::Store::DBI::mysql::_name_sql('Bio::DB::SeqFeature::Store::DBI::mysql=HASH(0xb2bfed0)', 'b0001', undef, 'f.id') called at /usr/local/share/perl/5.10.1/Bio/DB/SeqFeature/Store/DBI/mysql.pm > > Bio::DB::SeqFeature::Store::DBI::mysql::_features('Bio::DB::SeqFeature::Store::DBI::mysql=HASH(0xb2bfed0)', '-name', 'b0001', '-class', undef, '-aliases', undef, > > Bio::DB::SeqFeature::Store::get_features_by_name('Bio::DB::SeqFeature::Store::DBI::mysql=HASH(0xb2bfed0)', '-name', 'b0001') called at /usr/local/share/perl/5.10.1/Bio/DB/SeqFeature/Store.pm line > > Bio::DB::SeqFeature::Store::segment('Bio::DB::SeqFeature::Store::DBI::mysql=HASH(0xb2bfed0)', 'b0001') called at /usr/local/lib/perl/5.10.1/Bio/Graphics/Browser2/DataLoader.pm line 171, > > eval {...} called at /usr/local/lib/perl/5.10.1/Bio/Graphics/Browser2/DataLoader.pm line 169, > > Bio::Graphics::Browser2::DataLoader::generate_chrom_sizes('Bio::Graphics::Browser2::DataLoader=HASH(0xb2bfbd0)', '/var/tmp/gbrowse2/chrom_sizes/MG1655.sizes') called at /usr/local/lib/perl/5.10.1/Bio/Graphics/Browser2/DataLoader.pm line 143, > > Bio::Graphics::Browser2::DataLoader::chrom_sizes('Bio::Graphics::Browser2::DataLoader=HASH(0xb2bfbd0)') called at /usr/local/lib/perl/5.10.1/Bio/Graphics/Browser2/Action.pm line 1117, referer: > > Bio::Graphics::Browser2::Action::ACTION_chrom_sizes('Bio::Graphics::Browser2::Action=REF(0xa993ea0)', 'CGI=HASH(0xaf57450)') called at /usr/local/lib/perl/5.10.1/Bio/Graphics/Browser2/Render.pm line 427, > > Bio::Graphics::Browser2::Render::asynchronous_event('Bio::Graphics::Browser2::Render::HTML=HASH(0xaf590d8)') called at /usr/local/lib/perl/5.10.1/Bio/Graphics/Browser2/Render.pm line 356, referer: > > Bio::Graphics::Browser2::Render::run_asynchronous_event('Bio::Graphics::Browser2::Render::HTML=HASH(0xaf590d8)') called at /usr/local/lib/perl/5.10.1/Bio/Graphics/Browser2/Render.pm line 274, referer: > > Bio::Graphics::Browser2::Render::run('Bio::Graphics::Browser2::Render::HTML=HASH(0xaf590d8)') called at /usr/lib/cgi-bin/gb2/gbrowse line 50, referer: > > > > ===================================== > Jim Hu > Professor > Dept. of Biochemistry and Biophysics > 2128 TAMU > Texas A&M Univ. > College Station, TX 77843-2128 > 979-862-4054 > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Gmod-gbrowse mailing list > Gmod-gbrowse at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse -- ------------------------------------------------------------------------ Scott Cain, Ph. D.? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? scott at scottcain dot net GMOD Coordinator (http://gmod.org/)? ? ? ? ? ? ? ? ? ?? 216-392-3087 Ontario Institute for Cancer Research From hnorpois at googlemail.com Mon May 7 16:45:24 2012 From: hnorpois at googlemail.com (Hermann Norpois) Date: Mon, 7 May 2012 22:45:24 +0200 Subject: [Bioperl-l] next_seq problem Message-ID: I apologise that I forgot to mention a subject. 2012/5/7 Hermann Norpois > Hello, > > I wrote a script to retrieve promoter sequences from genbank (see below) > that doesnt work in the get_stream modus. Input is an array (or list) of > geneids. In my output file (test4.fasta) only the last header is followed > by a sequence (see below). I guess I need a second stream but I do not know > how to construct. > > Thank you very much for helping. > > Hermann > > *script: > *#!/bin/perl -w > use strict; > use warnings; > use Bio::DB::EntrezGene; > use Bio::SeqIO; > use Bio::DB::GenBank; > > > my $seqio_obj = Bio::SeqIO->new(-file => ">test4.fasta", -format => 'fasta' > ); # outputfile > > my $db = new Bio::DB::EntrezGene; > > my $seqio = $db->get_Stream_by_id([54161, 12064, 71661]); # Gene ids > > while ( my $seq = $seqio->next_seq ) { > > my $ac = $seq->annotation; > > for my $ann ($ac->get_Annotations('dblink')) { > if ($ann->database eq "Evidence Viewer") { > # get the sequence identifier, the start, and the stop > my ($contig,$from,$to) = $ann->url =~ > /contig=([^&]+).+from=(\d+)&to=(\d+)/; > my $chr_start = $from-700; > my $chr_stop = $from; > my $strand = 1; > > my $gb = Bio::DB::GenBank->new(-format => 'fasta', > -seq_start => $chr_start, > -seq_stop => $chr_stop, > -strand => $strand > ); > print "$contig\t$from\t$to\n\t$chr_start\t$chr_stop\n"; # > For control: Printing coordinates > my $obj = $gb->get_Seq_by_id($contig); > > > > $seqio_obj->write_seq($obj); > > } > } > } > > *test4.fasta* > > >gi|28485536:10326633-10326632 Mus musculus chromosome 2 genomic contig, > strain C57BL/6J > > >gi|28526280:23304000-23303999 Mus musculus chromosome 6 genomic contig, > strain C57BL/6J > > >gi|372099010:6466323-6467023 Mus musculus strain C57BL/6J chromosome 7 > genomic contig, GRCm38 C57BL/6J MMCHR7_CTG5_2 > CCCCTCCCCGCAGCTTGATTCCTATAAAAACCTGCCATTTTGGATGAATGTGCTGTTCGC > CCTTGGCTCCTTTCTTGGTCCACTTGCCCTCTCTCTTCTCTCTCTCTCCTTTCACTTTCT... > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From jimhu at tamu.edu Mon May 7 19:08:36 2012 From: jimhu at tamu.edu (Jim Hu) Date: Mon, 7 May 2012 18:08:36 -0500 Subject: [Bioperl-l] codon usage In-Reply-To: References: <20120421025950.8579.qmail@f4mail-235-122.rediffmail.com> <18DF7D20DFEC044098A1062202F5FFF34CCECE6CB7@exchsth.agresearch.co.nz> <07E48B59-7A51-4D75-8D5A-6DD2D529623C@tamu.edu> Message-ID: Actually, I was looking at the wrong method for the topic of codon usage. Count codons doesn't use the regex; it just peels off triplets and uses a hash to track the counts. I was looking at count_monomers, which uses this: ... # For each letter, count the number of times it appears in # the sequence LETTER: foreach $element (@$alphabet) { # skip terminator symbol which may confuse regex next LETTER if $element eq '*'; $count{$element} = ( $seqstring =~ s/$element/$element/g); } if ($_is_instance) { $self->{'_monomer_count'} =\% count; # Save in case called again later } return\% count; In this case the theoretical problem is that the regex engine is calling the $seqstring repeatedly, so that even if the engine is using a state machine, its repeating the scan of the sequence for each symbol in the alphabet. This is still likely to be way faster than an interpreted state machine, since the elements are just single characters and there are not many of them. As I said in the first email, a state machine written for the interpreter running a more efficient algorthm probably would still lose to beat a built-in engine running a less efficient algorithm. And being able to stream instead of holding the whole sequence in memory is not a concern now, as it was when we had 4K of RAM. JH On May 7, 2012, at 2:15 PM, Joel Martin wrote: > I think you're under-appreciating the perl regex engine, it operates > as a state machine not by matching all occurrences every time. Here > is a nice excerpt from pro perl parsing > http://www.devshed.com/c/a/Perl/Parsing-and-Regular-Expression-Basics/2/ > > Joel > > On Mon, May 7, 2012 at 8:28 AM, Jim Hu wrote: >> I was looking at Bio::Tools::SeqStats. It reminds me of the very first bioinformatics program I worked on back when I was in grad school, sequencing was done by Maxam-Gilbert chemistry on gels with radioactive DNA (we were doing short reads, but we didn't call it that and the reads per run was something like 8). We were writing a program in Apple II basic to find restriction sites. Everyone in the group was doing this by putting the target sequence in a string variable and looking for the site as a substring match by whatever it was BASIC used. Loop over all the sites you are looking for. This strikes me as the equivalent of how Bio::Tools::SeqStats works, only with regexes. >> >> My roommate at the time, who was a math PhD student doing an MS in CompSci, pointed out to me that this would be more efficient using a discrete finite automaton algorithm, where each site we were looking for would be a state automaton. This has the advantage of being able to process the sequence as a stream. Back when we were working with computers with RAM measured in Kbytes, this was a big help. I'm not sure if it would be worth it today. The slow interpreted implementation of the state machines would likely lose to the fast internal implementation of the regex routines for sequence lengths we are looking at these days. >> >> But it might be interesting to compare. >> >> Jim >> >> On May 6, 2012, at 3:52 PM, Smithies, Russell wrote: >> >>> Or use Bio::Tools::SeqStats >>> (this is straight from the perldocs http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/Tools/SeqStats.pm ) >>> >>> $seqobj = Bio::PrimarySeq->new(-seq => 'ACTGTGGCGTCAACTG',-alphabet => 'dna',-id => 'test'); >>> $seq_stats = Bio::Tools::SeqStats->new(-seq => $seqobj); >>> >>> $hash_ref = $seq_stats-> count_codons(); # for nucleic acid sequence >>> foreach $base (sort keys %$hash_ref) { >>> print "Number of codons of type ", $base, "= ", %$hash_ref->{$base},"\n"; >>> } >>> >>> >>> --Russell >>> >>> >>> -----Original Message----- >>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of subarna thakur >>> Sent: Saturday, 21 April 2012 3:00 p.m. >>> To: bioperl-l at bioperl.org >>> Subject: [Bioperl-l] codon usage >>> >>> I am writing a script for determining number of genes containing a particular codon. The codons are mentioned in a separate file. The output is coming all right for the first codon mentioned in the file but for the other codons , the script is not working. Please suggest the error in the script. The script is as follows ---- >>> >>> #!/usr/bin/perl -w >>> >>> use Bio::SeqIO; >>> >>> $file2="table.txt"; >>> >>> $codon=0; >>> >>> open OUT, ">out-test.txt" or die $!; >>> >>> $seqio_obj = Bio::SeqIO->new( -file => "gopi2.txt" , '-format' => 'Fasta'); >>> >>> open( my $fh2, $file2 ) or die "$!"; >>> >>> while( my $line = <$fh2> ){ >>> >>> $acc=$line; >>> >>> chomp $acc; >>> >>> while ($seq1= $seqio_obj->next_seq){ >>> >>> my @output = $seq1->id; >>> >>> my $string = $seq1->seq; >>> >>> $v=0; >>> >>> $l= length($string); >>> >>> $t=$l/3; >>> >>> $k=0; >>> >>> for ($i=1; $i <= $t; $i++){ >>> >>> @array2 = substr($string, $k, 3); >>> >>> $k=$k+3; >>> >>> foreach $value (@array2) >>> >>> { >>> >>> if ($value eq "$acc") >>> >>> { >>> >>> print OUT " The sequence id is @output\n"; >>> >>> print OUT "$acc codon found in position $i\n\n"; >>> >>> $v=$v+1; >>> >>> } >>> >>> } >>> >>> } >>> >>> if ($v==0) >>> >>> { >>> >>> $h=0; >>> >>> } >>> >>> else >>> >>> { >>> >>> $h=1; >>> >>> } >>> >>> $codon=$codon+$h; >>> >>> } >>> >>> print OUT "Total number of sequences with $acc codon"; >>> >>> print OUT "\t"; >>> >>> print OUT $codon; >>> >>> } >>> >>> exit; >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> ======================================================================= >>> Attention: The information contained in this message and/or attachments >>> from AgResearch Limited is intended only for the persons or entities >>> to which it is addressed and may contain confidential and/or privileged >>> material. Any review, retransmission, dissemination or other use of, or >>> taking of any action in reliance upon, this information by persons or >>> entities other than the intended recipients is prohibited by AgResearch >>> Limited. If you have received this message in error, please notify the >>> sender immediately. >>> ======================================================================= >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> ===================================== >> Jim Hu >> Professor >> Dept. of Biochemistry and Biophysics >> 2128 TAMU >> Texas A&M Univ. >> College Station, TX 77843-2128 >> 979-862-4054 >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l ===================================== Jim Hu Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054 From hnorpois at googlemail.com Tue May 8 12:16:28 2012 From: hnorpois at googlemail.com (Hermann Norpois) Date: Tue, 8 May 2012 18:16:28 +0200 Subject: [Bioperl-l] some contigs do not work for sequence retrievel Message-ID: Hello, for getting a sequence 5 prime upstream of TTS I wrote a script that works for some geneids but not for all. I always get a contig and coordinates. I do not have an idea why I do not get a sequence ( I only get fasta headers). Actually the sequence ID should be out of importance if I see that a contig is detected. Has anybody an idea? Thanks Hermann Norpois #!/bin/perl -w use strict; use Bio::DB::EntrezGene; use Bio::SeqIO; use Bio::DB::GenBank; my $id = "12064"; #Works with geneid 18619 (Penk1) but not with 54161 (copg) or 12064 (bdnf) my $seqio_obj = Bio::SeqIO->new(-file => ">bdnf.fasta", -format => 'fasta' ); my $db = new Bio::DB::EntrezGene; my $seq = $db->get_Seq_by_id($id); my $ac = $seq->annotation; for my $ann ($ac->get_Annotations('dblink')) { if ($ann->database eq "Evidence Viewer") { # get the sequence identifier, the start, and the stop my ($contig,$from,$to) = $ann->url =~ /contig=([^&]+).+from=(\d+)&to=(\d+)/; my $chr_start = $from-700; my $chr_stop = $from; # my $strand = 1; print "CONTIG:\t$contig\tFROM\t$from\tTO\t$to\n\tFETCHING SEQUENCE FROM\t$chr_start\tTO\t$chr_stop\n"; # Control that something was detected. my $gb = Bio::DB::GenBank->new(-format => 'fasta', -seq_start => $chr_start, -seq_stop => $chr_stop, # -strand => $strand # -complexity => 1 ); # $gb->request_format('fasta'); my $obj = $gb->get_Seq_by_id($contig); $seqio_obj->write_seq($obj); } } From jason.stajich at gmail.com Tue May 8 13:02:14 2012 From: jason.stajich at gmail.com (Jason Stajich) Date: Tue, 8 May 2012 10:02:14 -0700 Subject: [Bioperl-l] illustrating get_Seqfeatures vs get_all_SeqFeatures In-Reply-To: <2DC14616-F54E-45D0-93B8-E0D89C42B61F@tamu.edu> References: <2DC14616-F54E-45D0-93B8-E0D89C42B61F@tamu.edu> Message-ID: <7CF80FF2-40F2-4FAC-9D00-4A25CCD8BD2C@gmail.com> Any eukaryote with introns, here is a scaffold from the fungus Fusarium graminearum. ftp://ftp.ncbi.nih.gov/genomes/Fungi/Gibberella_zeae_PH-1_uid243/NT_086522.gbk On May 6, 2012, at 11:40 AM, Jim Hu wrote: > Is there a good example of a small genome record, such as a viral genome, where the difference between the flattened and unflattened versions can be examined? The Genbank records of the bacteriophages i like to use as examples are mostly flat to begin with. > > Thanks, > > Jim > ===================================== > Jim Hu > Professor > Dept. of Biochemistry and Biophysics > 2128 TAMU > Texas A&M Univ. > College Station, TX 77843-2128 > 979-862-4054 > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason.stajich at gmail.com jason at bioperl.org From hrh at fmi.ch Tue May 8 12:47:16 2012 From: hrh at fmi.ch (Hans-Rudolf Hotz) Date: Tue, 8 May 2012 18:47:16 +0200 Subject: [Bioperl-l] some contigs do not work for sequence retrievel In-Reply-To: References: Message-ID: <4FA94E14.8070402@fmi.ch> Hi Hermann I can't give you the full answer, as I am not familiar enough with the inner works of the "Bio::DB::GenBank" module. However, as first idea, you might wanna check the NCBI annotation: for the "Evidence Viewer" (why are you using this link?): " 54161" links to: http://www.ncbi.nlm.nih.gov/sutils/evv.cgi?taxid=10090&contig=NT_039353.1&gene=Copg&lid=54161&from=57085128&to=57110783 NT_039353.1 is no longer the current sequence version, see: http://www.ncbi.nlm.nih.gov/nuccore/NT_039353.1 "18619" links to: http://www.ncbi.nlm.nih.gov/sutils/evv.cgi?taxid=10090&contig=NT_187032.1&gene=Penk&lid=18619&from=1083535&to=1088444 NT_187032.1 IS the current sequence version, see: http://www.ncbi.nlm.nih.gov/nuccore/NT_187032.1 maybe someone can jump in and explain, why in this particular case fetching of an old sequence version is not possible. It usually just works for me. Regards, Hans On 05/08/2012 06:16 PM, Hermann Norpois wrote: > Hello, > > for getting a sequence 5 prime upstream of TTS I wrote a script that works > for some geneids but not for all. I always get a contig and coordinates. I > do not have an idea why I do not get a sequence ( I only get fasta > headers). Actually the sequence ID should be out of importance if I see > that a contig is detected. Has anybody an idea? > > Thanks > Hermann Norpois > > > #!/bin/perl -w > use strict; > use Bio::DB::EntrezGene; > use Bio::SeqIO; > use Bio::DB::GenBank; > > my $id = "12064"; #Works with geneid 18619 (Penk1) but not with 54161 > (copg) or 12064 (bdnf) > > my $seqio_obj = Bio::SeqIO->new(-file => ">bdnf.fasta", -format => 'fasta' > ); > > my $db = new Bio::DB::EntrezGene; > > my $seq = $db->get_Seq_by_id($id); > > my $ac = $seq->annotation; > > for my $ann ($ac->get_Annotations('dblink')) { > if ($ann->database eq "Evidence Viewer") { > # get the sequence identifier, the start, and the stop > my ($contig,$from,$to) = $ann->url =~ > /contig=([^&]+).+from=(\d+)&to=(\d+)/; > my $chr_start = $from-700; > my $chr_stop = $from; > # my $strand = 1; > print "CONTIG:\t$contig\tFROM\t$from\tTO\t$to\n\tFETCHING > SEQUENCE FROM\t$chr_start\tTO\t$chr_stop\n"; # Control that something was > detected. > my $gb = Bio::DB::GenBank->new(-format => 'fasta', > -seq_start => $chr_start, > -seq_stop => $chr_stop, > # -strand => $strand > # -complexity => 1 > ); > # $gb->request_format('fasta'); > my $obj = $gb->get_Seq_by_id($contig); > > $seqio_obj->write_seq($obj); > > } > } > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From Dallas.Thomas at AGR.GC.CA Tue May 8 16:04:00 2012 From: Dallas.Thomas at AGR.GC.CA (Thomas, Dallas) Date: Tue, 8 May 2012 14:04:00 -0600 Subject: [Bioperl-l] hmmer3 to hmmer2 Message-ID: Hello, I was wondering if you could use the updated Bio::SearchIO::hmmer to take as input the out file of hmmer3 and output its equivalent in hmmer2 format. Thanks Dallas From hnorpois at googlemail.com Tue May 8 16:30:47 2012 From: hnorpois at googlemail.com (Hermann Norpois) Date: Tue, 8 May 2012 22:30:47 +0200 Subject: [Bioperl-l] bioperl 1.6. and Perl API Message-ID: Hello, I installed bioperl 1.6.901-1 on ubuntu. Is it compatible with Perl API? Ensembl seems to prefer an older version: http://www.ensembl.org/info/docs/api/api_installation.html. I downloaded the four API packages and put them in /usr/share/perl5 (location of the already installes bio-perl-modules). If I start my testscript I get: Can't locate Bio/EnsEMBL/Registry.pm in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.12.4 /usr/local/share/perl/5.12.4 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.12 /usr/share/perl/5.12 /usr/local/lib/site_perl .) But @INV contains: /etc/perl /usr/local/lib/perl/5.12.4 /usr/local/share/perl/5.12.4 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.12 /usr/share/perl/5.12 /usr/local/lib/site_perl So principally the module should be found. Thanks Hermann Norpois From cjfields at illinois.edu Tue May 8 16:37:49 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 8 May 2012 20:37:49 +0000 Subject: [Bioperl-l] hmmer3 to hmmer2 In-Reply-To: References: Message-ID: <335533C6-CA09-45AC-B0E4-9D4FB4D4A5E2@illinois.edu> Thomas, I highly doubt it. Most of the SearchIO parsers are very input-centric (in fact, there isn't a write_result method for SearchIO that I know of). You could possiby write one up, using the Bio::SearchIO::Writer modules as a model. Any reason why you need this? Seems a bit unusual. chris On May 8, 2012, at 3:04 PM, Thomas, Dallas wrote: > Hello, > > > > I was wondering if you could use the updated Bio::SearchIO::hmmer to > take as input the out file of hmmer3 and output its equivalent in hmmer2 > format. > > > > Thanks > > Dallas > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Tue May 8 16:45:47 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 8 May 2012 20:45:47 +0000 Subject: [Bioperl-l] bioperl 1.6. and Perl API In-Reply-To: References: Message-ID: The below error is from the Ensembl code, not from bioperl (they're two separate things). You need to install the Ensembl Perl API. Re:bioperl versions, there is an explanation out there somewhere lurking in the Ensembl mail lists, I can't recall the details, but from what I understand more recent bioperl versions work unless you intend using the code for setting up Ensembl locally and processing your own data. Otherwise if you are just accessing the ensembl database it should just work. chris On May 8, 2012, at 3:30 PM, Hermann Norpois wrote: > Hello, > > I installed bioperl 1.6.901-1 on ubuntu. Is it compatible with Perl API? > Ensembl seems to prefer an older version: > http://www.ensembl.org/info/docs/api/api_installation.html. I downloaded > the four API packages and put them in /usr/share/perl5 (location of the > already installes bio-perl-modules). > If I start my testscript I get: > > Can't locate Bio/EnsEMBL/Registry.pm in @INC (@INC contains: /etc/perl > /usr/local/lib/perl/5.12.4 /usr/local/share/perl/5.12.4 /usr/lib/perl5 > /usr/share/perl5 /usr/lib/perl/5.12 /usr/share/perl/5.12 > /usr/local/lib/site_perl .) > > > But @INV contains: > /etc/perl > /usr/local/lib/perl/5.12.4 > /usr/local/share/perl/5.12.4 > /usr/lib/perl5 > /usr/share/perl5 > /usr/lib/perl/5.12 > /usr/share/perl/5.12 > /usr/local/lib/site_perl > > So principally the module should be found. > > Thanks > Hermann Norpois > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From tristan.lefebure at gmail.com Wed May 9 11:23:18 2012 From: tristan.lefebure at gmail.com (Tristan Lefebure) Date: Wed, 09 May 2012 17:23:18 +0200 Subject: [Bioperl-l] Codon boostraping Message-ID: <2084056.yurtXCQ8PC@picodon> Hi there, Just submitted the following patch to do codon bootstrapping: https://redmine.open-bio.org/issues/3350 I'll appreciate your comments on this tiny proposed addition to Bio::Align::Utilities Thanks! -- Tristan Lefebure From p.j.a.cock at googlemail.com Wed May 9 13:44:37 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 9 May 2012 18:44:37 +0100 Subject: [Bioperl-l] BioPerl BuildBot Message-ID: Hi all, I've retitled this and sent it to the BioPerl list, continuing from this thread on the BioRuby list: http://lists.open-bio.org/pipermail/bioruby/2012-May/002247.html On Wed, May 9, 2012 at 6:35 PM, Pjotr Prins wrote: > On Wed, May 09, 2012 at 05:29:49PM +0000, Fields, Christopher J wrote: >> *sigh* >> >> Anyone know of a way I can clone myself a few times, so one of my clones can get bioperl set up on buildbot? :P > > Peter knows someone in Scotland who can help! Now I got to see a man > about a sheep... > > Pj. You mean Dolly The Sheep? ;) Tiago or I can assist on the BuilBot server side for BioPerl - in fact Tiago had already made a start (CC'd). We'll need help from a BioPerl developer with a spare machine or two to use as a buildslave (and I can probably borrow some of my employer's which are already nightly tests) to help with how we setup the BuildSlaves - essentially how to get BioPerl and relevant dependencies installed, and then what needs to be done from a fresh git checkout to build and run the tests. Tiago has got this currently: perl Build.PL --accepts ./Build test Once that is working on a single buildslave we can talk about different targets which is where BuildBot is really helpful (e.g. versions of Perl, different OS, etc) Peter From cjfields at illinois.edu Wed May 9 13:51:55 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 9 May 2012 17:51:55 +0000 Subject: [Bioperl-l] BioPerl BuildBot In-Reply-To: References: Message-ID: On May 9, 2012, at 12:44 PM, Peter Cock wrote: > Hi all, > > I've retitled this and sent it to the BioPerl list, continuing from > this thread on > the BioRuby list: > > http://lists.open-bio.org/pipermail/bioruby/2012-May/002247.html > > On Wed, May 9, 2012 at 6:35 PM, Pjotr Prins wrote: >> On Wed, May 09, 2012 at 05:29:49PM +0000, Fields, Christopher J wrote: >>> *sigh* >>> >>> Anyone know of a way I can clone myself a few times, so one of my clones can get bioperl set up on buildbot? :P >> >> Peter knows someone in Scotland who can help! Now I got to see a man >> about a sheep... >> >> Pj. > > You mean Dolly The Sheep? ;) > > Tiago or I can assist on the BuilBot server side for BioPerl - in fact Tiago > had already made a start (CC'd). > > We'll need help from a BioPerl developer with a spare machine or two > to use as a buildslave (and I can probably borrow some of my employer's > which are already nightly tests) to help with how we setup the BuildSlaves > - essentially how to get BioPerl and relevant dependencies installed, > and then what needs to be done from a fresh git checkout to build > and run the tests. Tiago has got this currently: > > perl Build.PL --accepts > ./Build test > > Once that is working on a single buildslave we can talk about different > targets which is where BuildBot is really helpful (e.g. versions of Perl, > different OS, etc) > > Peter Thanks Peter. Yes, if anyone has spare cycles it would be very nice to get this up and running, my hands are a bit full the next few months. The key thing we'll need to track are: 1) Adding the various bioperl distributions 2) How to add a new bioperl-related distribution chris From jason.stajich at gmail.com Wed May 9 15:52:43 2012 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 9 May 2012 12:52:43 -0700 Subject: [Bioperl-l] Codon boostraping In-Reply-To: <2084056.yurtXCQ8PC@picodon> References: <2084056.yurtXCQ8PC@picodon> Message-ID: <016136AC-2472-4FA4-8C3C-E193B0D3264E@gmail.com> Tristan - it looks pretty good -- have you had any luck forking code on github? this way your patch can be rolled in with your name as the developer if you submit the changes as a pull request. http://help.github.com/fork-a-repo/ We can certainly commit things without this, but just trying to give you opportunity to be more closely involved if you would like. Jason On May 9, 2012, at 8:23 AM, Tristan Lefebure wrote: > Hi there, > > Just submitted the following patch to do codon bootstrapping: > > https://redmine.open-bio.org/issues/3350 > > I'll appreciate your comments on this tiny proposed addition to > Bio::Align::Utilities > > Thanks! > > -- > Tristan Lefebure > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason.stajich at gmail.com jason at bioperl.org From cjfields at illinois.edu Wed May 9 17:49:46 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 9 May 2012 21:49:46 +0000 Subject: [Bioperl-l] BioPerl BuildBot In-Reply-To: <20120509213932.GF31329@thebird.nl> References: <20120509213932.GF31329@thebird.nl> Message-ID: On May 9, 2012, at 4:39 PM, Pjotr Prins wrote: > On Wed, May 09, 2012 at 05:51:55PM +0000, Fields, Christopher J wrote: >> Thanks Peter. Yes, if anyone has spare cycles it would be very nice to get this up and running, my hands are a bit full the next few months. The key thing we'll need to track are: >> >> 1) Adding the various bioperl distributions >> 2) How to add a new bioperl-related distribution > > I always think that BioPerl has the largest community. No pair of > hands who can help Chris and Tiago set this up? > > Pj. Large, but very busy :) Let's give it a little time. chris From tiagoantao at gmail.com Wed May 9 14:54:41 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 9 May 2012 18:54:41 +0000 Subject: [Bioperl-l] BioPerl BuildBot In-Reply-To: References: Message-ID: Hello from Dolly the Sheep (not in Scotland - a bit further South), I happen to have some time in my hands (not common), so if any of you BioPerl code monkeys wants to go ahead with this, I will gladly support. I can help configure the first runs and help the person(s) with the initial learning curve. Tiago On Wed, May 9, 2012 at 5:51 PM, Fields, Christopher J wrote: > On May 9, 2012, at 12:44 PM, Peter Cock wrote: > >> Hi all, >> >> I've retitled this and sent it to the BioPerl list, continuing from >> this thread on >> the BioRuby list: >> >> http://lists.open-bio.org/pipermail/bioruby/2012-May/002247.html >> >> On Wed, May 9, 2012 at 6:35 PM, Pjotr Prins wrote: >>> On Wed, May 09, 2012 at 05:29:49PM +0000, Fields, Christopher J wrote: >>>> *sigh* >>>> >>>> Anyone know of a way I can clone myself a few times, so one of my clones can get bioperl set up on buildbot? :P >>> >>> Peter knows someone in Scotland who can help! Now I got to see a man >>> about a sheep... >>> >>> Pj. >> >> You mean Dolly The Sheep? ;) >> >> Tiago or I can assist on the BuilBot server side for BioPerl - in fact Tiago >> had already made a start (CC'd). >> >> We'll need help from a BioPerl developer with a spare machine or two >> to use as a buildslave (and I can probably borrow some of my employer's >> which are already nightly tests) to help with how we setup the BuildSlaves >> - essentially how to get BioPerl and relevant dependencies installed, >> and then what needs to be done from a fresh git checkout to build >> and run the tests. Tiago has got this currently: >> >> perl Build.PL --accepts >> ./Build test >> >> Once that is working on a single buildslave we can talk about different >> targets which is where BuildBot is really helpful (e.g. versions of Perl, >> different OS, etc) >> >> Peter > > Thanks Peter. ?Yes, if anyone has spare cycles it would be very nice to get this up and running, my hands are a bit full the next few months. ?The key thing we'll need to track are: > > 1) Adding the various bioperl distributions > 2) How to add a new bioperl-related distribution > > chris > -- "Liberty for wolves is death to the lambs" - Isaiah Berlin From pjotr.public21 at thebird.nl Wed May 9 17:39:32 2012 From: pjotr.public21 at thebird.nl (Pjotr Prins) Date: Wed, 9 May 2012 23:39:32 +0200 Subject: [Bioperl-l] BioPerl BuildBot In-Reply-To: References: Message-ID: <20120509213932.GF31329@thebird.nl> On Wed, May 09, 2012 at 05:51:55PM +0000, Fields, Christopher J wrote: > Thanks Peter. Yes, if anyone has spare cycles it would be very nice to get this up and running, my hands are a bit full the next few months. The key thing we'll need to track are: > > 1) Adding the various bioperl distributions > 2) How to add a new bioperl-related distribution I always think that BioPerl has the largest community. No pair of hands who can help Chris and Tiago set this up? Pj. From prateek.vit at gmail.com Thu May 10 03:47:09 2012 From: prateek.vit at gmail.com (prateek.vit at gmail.com) Date: Thu, 10 May 2012 03:47:09 -0400 Subject: [Bioperl-l] (no subject) Message-ID: <4fabc6e2.5051cd0a.66d1.ffff9467@mx.google.com> hey dont put this off for later you gotta make this your priority http://www.cnbc7.net/0412/ this is the method of the future From cjfields at illinois.edu Thu May 10 16:56:27 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 10 May 2012 20:56:27 +0000 Subject: [Bioperl-l] Bio::SeqIO::tab deletes gap characters when reading sequences, which is inconvenient In-Reply-To: <4F8E03FE.7000506@gmail.com> References: <4F8E03FE.7000506@gmail.com> Message-ID: Tim, This one got stuck in my drafts folder :P Easy enough to do. I've added this in to the master branch, commit eece9dd. chris On Apr 17, 2012, at 6:59 PM, Tim White wrote: > Hi, > > Bio::SeqIO::tab (what you get when specifying -format => 'tab' to Bio::SeqIO->new()) is perfect for converting sequences into a one-per-line format, so that standard line-oriented UNIX tools (grep, comm etc.) work as expected. Except... I just discovered that it deletes gap ("-") characters when reading sequences, so it can't be used to round-trip any files that contain these. This is a source of grief as I frequently work with FASTA files that contain aligned sequences, and thus gap characters. > > This is all because the next_seq() function in Bio::SeqIO::tab.pm contains the line: > > $seq =~ s/\W//g; > > which removes all non-alphanumeric characters from the sequence data. IMHO it would be *much* better if this was changed to: > > $seq =~ s/\s//g; > > which simply removes all whitespace characters (particularly including the \r that often appears at the ends of lines on text files that have visited Windows), enabling gap characters (and, for example, periods and asterisks) to be preserved. Alternatively, you could simply get rid of this line of code and allow whitespace characters through. > > I'm not sure whether this counts as a "bug", as a cursory search didn't turn up any docs explaining precisely what characters are and aren't preserved by classes implementing Bio::SeqIO, but it's certainly inconsistent (at least Bio::SeqIO::fasta, and Bio::SeqIO::table, with columns and delimiters set up appropriately, allow round-tripping of files containing gap characters) as well as extremely inconvenient for me personally, and I suspect for others. Assuming no harm would be done by making the above change, what's the best thing to do to get this changed? I've simply edited my own local copy of tab.pm to make the above change, but obviously if others agree I'd like to get the change done upstream. > > Thanks, > Tim > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From jason.stajich at gmail.com Fri May 11 15:07:25 2012 From: jason.stajich at gmail.com (Jason Stajich) Date: Fri, 11 May 2012 12:07:25 -0700 Subject: [Bioperl-l] small project Message-ID: <33072336-6D0E-4091-8F74-766CA23D3487@gmail.com> HMMER folks have contributed a module to BioJava to simplify submission of a protein sequence to the HMMER RESTful API http://xfam.wordpress.com/2012/05/09/pdb-pfam-mapping/ http://hmmer.janelia.org/help/api Perhaps a BioPerl similar module that wraps up the existing code to submit Bio::Seq objects similar to other Bio::DB:: or probably better to do Bio::Tools interfaces for interaction with remote webservices. The code to add this to bioperl is basically already written - the question would be if you wanted to populate a Bio::Search object with the XML results (Writing a parser for the XML) http://hmmer.janelia.org/help/api#sending Jason Jason Stajich jason.stajich at gmail.com jason at bioperl.org From carandraug+dev at gmail.com Sun May 13 09:58:57 2012 From: carandraug+dev at gmail.com (=?ISO-8859-1?Q?Carn=EB_Draug?=) Date: Sun, 13 May 2012 14:58:57 +0100 Subject: [Bioperl-l] Writing bp_grep Message-ID: Hi everyone I'm starting to write a grep tool for sequences (bp_grep). The idea is to have something just like grep but for DNA and protein sequences with most of the options that make sense in this context (print the filename or sequence name only, position, without match search, count, etc). I was wondering if anyone has any piece of code that could fit in it or started something similar but just never finished. Thanks, Carn? From jovel_juan at hotmail.com Sun May 13 10:20:30 2012 From: jovel_juan at hotmail.com (Juan Jovel) Date: Sun, 13 May 2012 14:20:30 +0000 Subject: [Bioperl-l] Writing bp_grep In-Reply-To: References: Message-ID: Hi Carn?, I do not have and incomplete code for the project you have in mind, but I encourage you to go ahead. It will be useful. Cheers, Juan > From: carandraug+dev at gmail.com > Date: Sun, 13 May 2012 14:58:57 +0100 > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Writing bp_grep > > Hi everyone > > I'm starting to write a grep tool for sequences (bp_grep). The idea is > to have something just like grep but for DNA and protein sequences > with most of the options that make sense in this context (print the > filename or sequence name only, position, without match search, count, > etc). I was wondering if anyone has any piece of code that could fit > in it or started something similar but just never finished. > > Thanks, > Carn? > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From p.j.a.cock at googlemail.com Sun May 13 13:26:36 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 13 May 2012 18:26:36 +0100 Subject: [Bioperl-l] Writing bp_grep In-Reply-To: References: Message-ID: On Sunday, May 13, 2012, Carn? Draug wrote: > Hi everyone > > I'm starting to write a grep tool for sequences (bp_grep). The idea is > to have something just like grep but for DNA and protein sequences > with most of the options that make sense in this context (print the > filename or sequence name only, position, without match search, count, > etc). I was wondering if anyone has any piece of code that could fit > in it or started something similar but just never finished. > > Thanks, > Carn? > This sounds like EMBOSS preg, http://emboss.open-bio.org/wiki/Appdoc:Preg I thought they had a nucleotide equivalent but I can't see it right now.... Peter From hlapp at drycafe.net Sun May 13 13:36:39 2012 From: hlapp at drycafe.net (Hilmar Lapp) Date: Sun, 13 May 2012 13:36:39 -0400 Subject: [Bioperl-l] Writing bp_grep In-Reply-To: References: Message-ID: <129589EC-4E7C-4692-BC07-013734B22BFA@drycafe.net> And of course there's tacg (though I have never used it - not sure about its status): Mangalam HJ (2002) tacg ? a grep for DNA. BMC Bioinformatics 3:8 doi:10.1186/1471-2105-3-8 -hilmar On May 13, 2012, at 1:26 PM, Peter Cock wrote: > On Sunday, May 13, 2012, Carn? Draug wrote: > >> Hi everyone >> >> I'm starting to write a grep tool for sequences (bp_grep). The idea is >> to have something just like grep but for DNA and protein sequences >> with most of the options that make sense in this context (print the >> filename or sequence name only, position, without match search, count, >> etc). I was wondering if anyone has any piece of code that could fit >> in it or started something similar but just never finished. >> >> Thanks, >> Carn? >> > > This sounds like EMBOSS preg, > http://emboss.open-bio.org/wiki/Appdoc:Preg > > I thought they had a nucleotide equivalent but I can't see > it right now.... > > Peter > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From cjfields at illinois.edu Sun May 13 13:54:09 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sun, 13 May 2012 17:54:09 +0000 Subject: [Bioperl-l] Writing bp_grep In-Reply-To: <129589EC-4E7C-4692-BC07-013734B22BFA@drycafe.net> References: <129589EC-4E7C-4692-BC07-013734B22BFA@drycafe.net> Message-ID: Bio::Tools::SeqPattern could be used (and improved upon if needed) for pattern generation. chris On May 13, 2012, at 12:36 PM, Hilmar Lapp wrote: > And of course there's tacg (though I have never used it - not sure about its status): > > Mangalam HJ (2002) tacg ? a grep for DNA. BMC Bioinformatics 3:8 > doi:10.1186/1471-2105-3-8 > > -hilmar > > On May 13, 2012, at 1:26 PM, Peter Cock wrote: > >> On Sunday, May 13, 2012, Carn? Draug wrote: >> >>> Hi everyone >>> >>> I'm starting to write a grep tool for sequences (bp_grep). The idea is >>> to have something just like grep but for DNA and protein sequences >>> with most of the options that make sense in this context (print the >>> filename or sequence name only, position, without match search, count, >>> etc). I was wondering if anyone has any piece of code that could fit >>> in it or started something similar but just never finished. >>> >>> Thanks, >>> Carn? >>> >> >> This sounds like EMBOSS preg, >> http://emboss.open-bio.org/wiki/Appdoc:Preg >> >> I thought they had a nucleotide equivalent but I can't see >> it right now.... >> >> Peter >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Sun May 13 15:09:12 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sun, 13 May 2012 19:09:12 +0000 Subject: [Bioperl-l] REQUEST: Re: BioPerl BuildBot In-Reply-To: References: <20120509213932.GF31329@thebird.nl> Message-ID: On May 9, 2012, at 4:49 PM, Fields, Christopher J wrote: > On May 9, 2012, at 4:39 PM, Pjotr Prins wrote: > >> On Wed, May 09, 2012 at 05:51:55PM +0000, Fields, Christopher J wrote: >>> Thanks Peter. Yes, if anyone has spare cycles it would be very nice to get this up and running, my hands are a bit full the next few months. The key thing we'll need to track are: >>> >>> 1) Adding the various bioperl distributions >>> 2) How to add a new bioperl-related distribution >> >> I always think that BioPerl has the largest community. No pair of >> hands who can help Chris and Tiago set this up? >> >> Pj. > > Large, but very busy :) Let's give it a little time. > > chris Re-pinging in case this is missed. Is anyone interested in helping to get this set up? It is unlikely I can do this anytime soon. chris From l.m.timmermans at students.uu.nl Mon May 14 08:38:42 2012 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Mon, 14 May 2012 14:38:42 +0200 Subject: [Bioperl-l] small project In-Reply-To: <33072336-6D0E-4091-8F74-766CA23D3487@gmail.com> References: <33072336-6D0E-4091-8F74-766CA23D3487@gmail.com> Message-ID: On Fri, May 11, 2012 at 9:07 PM, Jason Stajich wrote: > HMMER folks have contributed a module to BioJava to simplify submission of a protein sequence to the HMMER RESTful API > http://xfam.wordpress.com/2012/05/09/pdb-pfam-mapping/ > http://hmmer.janelia.org/help/api > > > Perhaps a BioPerl similar module that wraps up the existing code to submit Bio::Seq objects similar to other Bio::DB:: or probably better to do Bio::Tools interfaces for interaction with remote webservices. The code to add this to bioperl is basically already written - the question would be if you wanted to populate a Bio::Search object with the XML results (Writing a parser for the XML) > > http://hmmer.janelia.org/help/api#sending > > Jason I rather like the Spore concept for this kind of issue. It's a framework for RESTful webservices that only requires you to write a description of a webservice (often in json or yaml form) and it will then generate the code to interact with it from that. A major advantage is that it's implemented for a range of programming languages (currently Perl, Python, Ruby, Lua, Clojure, Javascript). Thinking about this, a wider Bio-Spore project may be a good idea. Leon From Scott.Markel at accelrys.com Mon May 14 12:42:11 2012 From: Scott.Markel at accelrys.com (Scott Markel) Date: Mon, 14 May 2012 09:42:11 -0700 Subject: [Bioperl-l] Writing bp_grep In-Reply-To: References: Message-ID: <5ACBA19439E77B43A06F4CAB897EC977046696DE54@EXCH1-COLO.accelrys.net> The nucleotide equivalent is dreg. http://emboss.open-bio.org/wiki/Appdoc:Dreg Scott -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Peter Cock Sent: Sunday, 13 May 13 2012 10:27 AM To: Carn? Draug Cc: bioperl mailing list Subject: Re: [Bioperl-l] Writing bp_grep On Sunday, May 13, 2012, Carn? Draug wrote: > Hi everyone > > I'm starting to write a grep tool for sequences (bp_grep). The idea is > to have something just like grep but for DNA and protein sequences > with most of the options that make sense in this context (print the > filename or sequence name only, position, without match search, count, > etc). I was wondering if anyone has any piece of code that could fit > in it or started something similar but just never finished. > > Thanks, > Carn? > This sounds like EMBOSS preg, http://emboss.open-bio.org/wiki/Appdoc:Preg I thought they had a nucleotide equivalent but I can't see it right now.... Peter _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From carandraug+dev at gmail.com Tue May 15 07:57:20 2012 From: carandraug+dev at gmail.com (=?ISO-8859-1?Q?Carn=EB_Draug?=) Date: Tue, 15 May 2012 12:57:20 +0100 Subject: [Bioperl-l] Writing bp_grep In-Reply-To: <5ACBA19439E77B43A06F4CAB897EC977046696DE54@EXCH1-COLO.accelrys.net> References: <5ACBA19439E77B43A06F4CAB897EC977046696DE54@EXCH1-COLO.accelrys.net> Message-ID: Well, with EMBOSS's dreg and preg I don't think I'll bother to implement it again then. It's not exactly what I had in mind to write but it's close enough that sounds like a waste of time to write it. Thanks for the links, Carn? From jimhu at tamu.edu Fri May 18 01:07:14 2012 From: jimhu at tamu.edu (Jim Hu) Date: Fri, 18 May 2012 00:07:14 -0500 Subject: [Bioperl-l] Bio::Seq->subseq documentation Message-ID: In the page for Bio::Seq, http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/Seq.html#POD5 I think the usage should match the documentation for Bio::PrimarySeq http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/PrimarySeq.html#POD4 indicating that the arguments can be integers OR location objects. Is that correct? Jim ===================================== Jim Hu Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054 From j_martin at lbl.gov Fri May 18 10:49:59 2012 From: j_martin at lbl.gov (Joel Martin) Date: Fri, 18 May 2012 07:49:59 -0700 Subject: [Bioperl-l] Bio::Seq->subseq documentation In-Reply-To: References: Message-ID: the documentation for Bio::Seq looks out of sync to me, the code for Bio::Seq is just sub subseq { return shift->primary_seq()->subseq(@_); } so, it takes what Bio::PrimarySeq takes. Joel On Thu, May 17, 2012 at 10:07 PM, Jim Hu wrote: > In the page for Bio::Seq, > > ? ? ? ?http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/Seq.html#POD5 > > I think the usage should match the documentation for Bio::PrimarySeq > > ? ? ? ?http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/PrimarySeq.html#POD4 > > indicating that the arguments can be integers OR location objects. ?Is that correct? > > Jim > ===================================== > Jim Hu > Professor > Dept. of Biochemistry and Biophysics > 2128 TAMU > Texas A&M Univ. > College Station, TX 77843-2128 > 979-862-4054 > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From jimhu at tamu.edu Fri May 18 16:56:50 2012 From: jimhu at tamu.edu (Jim Hu) Date: Fri, 18 May 2012 15:56:50 -0500 Subject: [Bioperl-l] Bio::Seq->subseq documentation In-Reply-To: References: Message-ID: <3AE41ED1-9CB6-432C-ABFC-B01E6B30C1A8@tamu.edu> Is the doc generated from bioperl-live or from the last stable version? If the former, I suppose I could change it in git (or ask Nathan in my lab to do it). On May 18, 2012, at 9:49 AM, Joel Martin wrote: > the documentation for Bio::Seq looks out of sync to me, the code for > Bio::Seq is just > sub subseq { > return shift->primary_seq()->subseq(@_); > } > > so, it takes what Bio::PrimarySeq takes. > > Joel > > On Thu, May 17, 2012 at 10:07 PM, Jim Hu wrote: >> In the page for Bio::Seq, >> >> http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/Seq.html#POD5 >> >> I think the usage should match the documentation for Bio::PrimarySeq >> >> http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/PrimarySeq.html#POD4 >> >> indicating that the arguments can be integers OR location objects. Is that correct? >> >> Jim >> ===================================== >> Jim Hu >> Professor >> Dept. of Biochemistry and Biophysics >> 2128 TAMU >> Texas A&M Univ. >> College Station, TX 77843-2128 >> 979-862-4054 >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l ===================================== Jim Hu Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054 From yang.liu0508 at gmail.com Sat May 19 10:34:04 2012 From: yang.liu0508 at gmail.com (yang liu) Date: Sat, 19 May 2012 10:34:04 -0400 Subject: [Bioperl-l] modify sequence names Message-ID: Dear colleagues, Would anyone please help me to modify sequence names with bioperl? I am editing them manually now, is there a easier way? I have a bunch of sequences in the format: >lcl|NC_017840.1_cdsid_YP_006280919.1 [gene=cox1] [protein=cytochrome c oxidase subunit 1] [protein_id=YP_006280919.1] [location=1..1584] ATGACAAATCCGGTCCGATGGCTGTTCTCCACTAACCACAAGGATATAGGTACTCTATATTTCATCTTCG GTGCCATTGCTGGAGTGATGGGCACATGCTTCTCAGTACTGATTCGTATGGAATTAGCACGACCCGGCGA TCAAATTCTTGGTGGGAATCATCAACTTTATAATGTTTTAATAACGGCTCACGCTTTTTTAATGATCTTT >lcl|NC_017840.1_cdsid_YP_006280920.1 [gene=ccmFn] [protein=cytochrome c biogenesis FN] [protein_id=YP_006280920.1] [location=2225..3940] ATGTCAATAAATGCATTTTCTCATTATTCGTTCTTTCCGGGTCTTTTCGTTGCATTCACTTACAACAAGA AAGAACCACCAGCGTTTGGTGCAGCCCCTGCATTTTGGTGCATTCTTCTTTCTTTCCTTGGTCTTTCGTT CCGTCATATTCCTAATAACTTATCCAATTACAGCGTATTAACCGCTAATGCACCTTTCTTTTATCAAATC I hope to keep only the gene name, which means the word behind "gene=", like: >cox1 ATGACAAATCCGGTCCGATGGCTGTTCTCCACTAACCACAAGGATATAGGTACTCTATATTTCATCTTCG GTGCCATTGCTGGAGTGATGGGCACATGCTTCTCAGTACTGATTCGTATGGAATTAGCACGACCCGGCGA TCAAATTCTTGGTGGGAATCATCAACTTTATAATGTTTTAATAACGGCTCACGCTTTTTTAATGATCTTT >ccmFn ATGTCAATAAATGCATTTTCTCATTATTCGTTCTTTCCGGGTCTTTTCGTTGCATTCACTTACAACAAGA AAGAACCACCAGCGTTTGGTGCAGCCCCTGCATTTTGGTGCATTCTTCTTTCTTTCCTTGGTCTTTCGTT CCGTCATATTCCTAATAACTTATCCAATTACAGCGTATTAACCGCTAATGCACCTTTCTTTTATCAAATC Any help would be appreciated. Thanks, Yang. From asjo at koldfront.dk Sat May 19 11:13:03 2012 From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=) Date: Sat, 19 May 2012 17:13:03 +0200 Subject: [Bioperl-l] modify sequence names In-Reply-To: (yang liu's message of "Sat, 19 May 2012 10:34:04 -0400") References: Message-ID: <87ehqgyzk0.fsf@topper.koldfront.dk> On Sat, 19 May 2012 10:34:04 -0400, yang wrote: > Would anyone please help me to modify sequence names with bioperl? I am > editing them manually now, is there a easier way? You don't need BioPerl specifically to do simple text manipulation. >> lcl|NC_017840.1_cdsid_YP_006280919.1 [gene=cox1] [protein=cytochrome >> coxidase subunit 1] [protein_id=YP_006280919.1] [location=1..1584] [... to ...] >> cox1 Maybe you can use something like: $ sed 's/^>.*\[gene=\([^]]*\)\].*$/\1/g' >lcl|NC_017840.1_cdsid_YP_006280919.1 [gene=cox1] [protein=cytochrome coxidase subunit 1] [protein_id=YP_006280919.1] [location=1..1584] ATGACAAATCCGGTCCGATGGCTGTTCTCCACTAACCACAAGGATATAGGTACTCTATATTTCATCTTCG GTGCCATTGCTGGAGTGATGGGCACATGCTTCTCAGTACTGATTCGTATGGAATTAGCACGACCCGGCGA TCAAATTCTTGGTGGGAATCATCAACTTTATAATGTTTTAATAACGGCTCACGCTTTTTTAATGATCTTT cox1 ATGACAAATCCGGTCCGATGGCTGTTCTCCACTAACCACAAGGATATAGGTACTCTATATTTCATCTTCG GTGCCATTGCTGGAGTGATGGGCACATGCTTCTCAGTACTGATTCGTATGGAATTAGCACGACCCGGCGA TCAAATTCTTGGTGGGAATCATCAACTTTATAATGTTTTAATAACGGCTCACGCTTTTTTAATGATCTTT $ If you need to use Perl rather than sed, you can use: $ perl -pe 's/^>.*\[gene=([^]]+).*$/>$1/' instead. The easiest way is probably to learn a little programming and/or regular expressions. Learning Perl by Randal L. Schwartz, brian d foy, and Tom Phoenix could be a starting point, so could many online tutorials. Best regards, Adam -- "Hur l?ngt man ?n har kommit Adam Sj?gren ?r det alltid l?ngre kvar" asjo at koldfront.dk From asjo at koldfront.dk Sat May 19 11:53:23 2012 From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=) Date: Sat, 19 May 2012 17:53:23 +0200 Subject: [Bioperl-l] modify sequence names In-Reply-To: <87ehqgyzk0.fsf@topper.koldfront.dk> ("Adam =?iso-8859-1?Q?Sj?= =?iso-8859-1?Q?=F8gren=22's?= message of "Sat, 19 May 2012 17:13:03 +0200") References: <87ehqgyzk0.fsf@topper.koldfront.dk> Message-ID: <87aa14yxos.fsf@topper.koldfront.dk> On Sat, 19 May 2012 17:13:03 +0200, Adam wrote: > $ sed 's/^>.*\[gene=\([^]]*\)\].*$/\1/g' And there I forgot the '>'; sorry, it should read: $ sed 's/^>.*\[gene=\([^]]*\)\].*$/>\1/g' Best regards, Adam -- "Hur l?ngt man ?n har kommit Adam Sj?gren ?r det alltid l?ngre kvar" asjo at koldfront.dk From Russell.Smithies at agresearch.co.nz Sun May 20 16:57:24 2012 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Mon, 21 May 2012 08:57:24 +1200 Subject: [Bioperl-l] modify sequence names In-Reply-To: <87ehqgyzk0.fsf@topper.koldfront.dk> References: <87ehqgyzk0.fsf@topper.koldfront.dk> Message-ID: <18DF7D20DFEC044098A1062202F5FFF34CCEE5BAB6@exchsth.agresearch.co.nz> Or a Perl inline replace - saves on temp files. perl -npi -e 's/^>.*\[gene=([^]]+).*$/>$1/' --Russell -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Adam Sj?gren Sent: Sunday, 20 May 2012 3:13 a.m. To: bioperl-l at bioperl.org Subject: Re: [Bioperl-l] modify sequence names On Sat, 19 May 2012 10:34:04 -0400, yang wrote: > Would anyone please help me to modify sequence names with bioperl? I > am editing them manually now, is there a easier way? You don't need BioPerl specifically to do simple text manipulation. >> lcl|NC_017840.1_cdsid_YP_006280919.1 [gene=cox1] [protein=cytochrome >> coxidase subunit 1] [protein_id=YP_006280919.1] [location=1..1584] [... to ...] >> cox1 Maybe you can use something like: $ sed 's/^>.*\[gene=\([^]]*\)\].*$/\1/g' >lcl|NC_017840.1_cdsid_YP_006280919.1 [gene=cox1] [protein=cytochrome coxidase subunit 1] [protein_id=YP_006280919.1] [location=1..1584] ATGACAAATCCGGTCCGATGGCTGTTCTCCACTAACCACAAGGATATAGGTACTCTATATTTCATCTTCG GTGCCATTGCTGGAGTGATGGGCACATGCTTCTCAGTACTGATTCGTATGGAATTAGCACGACCCGGCGA TCAAATTCTTGGTGGGAATCATCAACTTTATAATGTTTTAATAACGGCTCACGCTTTTTTAATGATCTTT cox1 ATGACAAATCCGGTCCGATGGCTGTTCTCCACTAACCACAAGGATATAGGTACTCTATATTTCATCTTCG GTGCCATTGCTGGAGTGATGGGCACATGCTTCTCAGTACTGATTCGTATGGAATTAGCACGACCCGGCGA TCAAATTCTTGGTGGGAATCATCAACTTTATAATGTTTTAATAACGGCTCACGCTTTTTTAATGATCTTT $ If you need to use Perl rather than sed, you can use: $ perl -pe 's/^>.*\[gene=([^]]+).*$/>$1/' instead. The easiest way is probably to learn a little programming and/or regular expressions. Learning Perl by Randal L. Schwartz, brian d foy, and Tom Phoenix could be a starting point, so could many online tutorials. Best regards, Adam -- "Hur l?ngt man ?n har kommit Adam Sj?gren ?r det alltid l?ngre kvar" asjo at koldfront.dk _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From florent.angly at gmail.com Sun May 20 19:41:39 2012 From: florent.angly at gmail.com (Florent Angly) Date: Mon, 21 May 2012 09:41:39 +1000 Subject: [Bioperl-l] modify sequence names In-Reply-To: References: Message-ID: <4FB98133.9010906@gmail.com> Hi Yang, If you'd rather learn Bioperl and use it to solve your problem, start here: http://www.bioperl.org/wiki/HOWTO:Beginners Florent On 20/05/12 00:34, yang liu wrote: > Dear colleagues, > > Would anyone please help me to modify sequence names with bioperl? I am > editing them manually now, is there a easier way? > I have a bunch of sequences in the format: > >> lcl|NC_017840.1_cdsid_YP_006280919.1 [gene=cox1] [protein=cytochrome c > oxidase subunit 1] [protein_id=YP_006280919.1] [location=1..1584] > ATGACAAATCCGGTCCGATGGCTGTTCTCCACTAACCACAAGGATATAGGTACTCTATATTTCATCTTCG > GTGCCATTGCTGGAGTGATGGGCACATGCTTCTCAGTACTGATTCGTATGGAATTAGCACGACCCGGCGA > TCAAATTCTTGGTGGGAATCATCAACTTTATAATGTTTTAATAACGGCTCACGCTTTTTTAATGATCTTT > >> lcl|NC_017840.1_cdsid_YP_006280920.1 [gene=ccmFn] [protein=cytochrome c > biogenesis FN] [protein_id=YP_006280920.1] [location=2225..3940] > ATGTCAATAAATGCATTTTCTCATTATTCGTTCTTTCCGGGTCTTTTCGTTGCATTCACTTACAACAAGA > AAGAACCACCAGCGTTTGGTGCAGCCCCTGCATTTTGGTGCATTCTTCTTTCTTTCCTTGGTCTTTCGTT > CCGTCATATTCCTAATAACTTATCCAATTACAGCGTATTAACCGCTAATGCACCTTTCTTTTATCAAATC > > I hope to keep only the gene name, which means the word behind "gene=", > like: >> cox1 > ATGACAAATCCGGTCCGATGGCTGTTCTCCACTAACCACAAGGATATAGGTACTCTATATTTCATCTTCG > GTGCCATTGCTGGAGTGATGGGCACATGCTTCTCAGTACTGATTCGTATGGAATTAGCACGACCCGGCGA > TCAAATTCTTGGTGGGAATCATCAACTTTATAATGTTTTAATAACGGCTCACGCTTTTTTAATGATCTTT > >> ccmFn > ATGTCAATAAATGCATTTTCTCATTATTCGTTCTTTCCGGGTCTTTTCGTTGCATTCACTTACAACAAGA > AAGAACCACCAGCGTTTGGTGCAGCCCCTGCATTTTGGTGCATTCTTCTTTCTTTCCTTGGTCTTTCGTT > CCGTCATATTCCTAATAACTTATCCAATTACAGCGTATTAACCGCTAATGCACCTTTCTTTTATCAAATC > > Any help would be appreciated. Thanks, > > Yang. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From jovel_juan at hotmail.com Mon May 21 09:32:22 2012 From: jovel_juan at hotmail.com (Juan Jovel) Date: Mon, 21 May 2012 13:32:22 +0000 Subject: [Bioperl-l] Little question about Bio::SeqIO::fastq In-Reply-To: <4FB98133.9010906@gmail.com> References: , <4FB98133.9010906@gmail.com> Message-ID: Hello All! When using 'Bio::SeqIO::fastq' how can I access each of the lines of a fastq entries? I need to split multiplexed libraries from MiSeq runs (which outputs in fastq format), and therefore I want to access the 'id' line of each sequence (@) to be able to split the file according to Illumina indices. How to do that? thanks a lot in advance. Juan From h.montenegro at gmail.com Mon May 21 13:28:12 2012 From: h.montenegro at gmail.com (=?ISO-8859-1?Q?Hor=E1cio_Montenegro?=) Date: Mon, 21 May 2012 14:28:12 -0300 Subject: [Bioperl-l] BioPerl 1.6.901 and prot4est Message-ID: Dear all, I am trying to set up prot4est and ran into problems with the bioperl from debian testing repositories (1.6.901). It breaks one of the scripts (contructSMAT.pl) from prot4est: ~/bin/p4e3.1b/exampleData$ ../bin/constructSMAT.pl --access p4e_access.txt --config ALC_smat.config CLEAN => 1 SPECIES => Ascaris lumbricoides FSA_FILE => ~/bin/p4e3.1b/exampleData/./A.lumbricoides_sim.fsa EMBL_SEARCH => 1 This dataset is from Ascaris lumbricoides (6252) Can't call method "ancestor" on an undefined value at /usr/local/share/perl/5.12.4/Bio/Taxon.pm line 513, line 1. The culprit is sub ancestor at Taxon.pm. Debugging a bit I found that a call to write_seq($emblO) on sub fsa2embl (from emblConnect.pl, another prot4est script) fires the bug. Anyway, the workaround so far is to use bioperl 1.6.1. In fact, if I use bioperl 1.6.901, but manually replace sub ancestor with the one from bioperl 1.6.1, prot4est runs normally. I do not know if this is a new bioperl bug, or if the changes in sub ancestor revealed some bug in emblConnect.pl. best regards, Horacio From cjfields at illinois.edu Mon May 21 13:43:38 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 21 May 2012 17:43:38 +0000 Subject: [Bioperl-l] BioPerl 1.6.901 and prot4est In-Reply-To: References: Message-ID: <67A7D548-E684-4C7D-82B7-59C834EE4747@illinois.edu> Hor?cio, Re: where the bug lies, it's hard to say. There have been several significant changes with bioperl's Taxonomy/Tree code over the last several years. Does the prot4est author (James Wasmuth) know about this? http://www.compsysbio.org/lab/?q=prot4EST You are also more than welcome to submit a bug report on this so we can track it. The following page describes how and where to do so: http://www.bioperl.org/wiki/Bugs (if you hear anything from the prot4est folks let us know) chris On May 21, 2012, at 12:28 PM, Hor?cio Montenegro wrote: > Dear all, > > I am trying to set up prot4est and ran into problems with the > bioperl from debian testing repositories (1.6.901). It breaks one of > the scripts (contructSMAT.pl) from prot4est: > > ~/bin/p4e3.1b/exampleData$ ../bin/constructSMAT.pl --access > p4e_access.txt --config ALC_smat.config > CLEAN => 1 > SPECIES => Ascaris lumbricoides > FSA_FILE => ~/bin/p4e3.1b/exampleData/./A.lumbricoides_sim.fsa > EMBL_SEARCH => 1 > This dataset is from Ascaris lumbricoides (6252) > Can't call method "ancestor" on an undefined value at > /usr/local/share/perl/5.12.4/Bio/Taxon.pm line 513, line 1. > > The culprit is sub ancestor at Taxon.pm. Debugging a bit I found > that a call to write_seq($emblO) on sub fsa2embl (from emblConnect.pl, > another prot4est script) fires the bug. Anyway, the workaround so far > is to use bioperl 1.6.1. In fact, if I use bioperl 1.6.901, but > manually replace sub ancestor with the one from bioperl 1.6.1, > prot4est runs normally. > > I do not know if this is a new bioperl bug, or if the changes in > sub ancestor revealed some bug in emblConnect.pl. > > best regards, > Horacio > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From jason.stajich at gmail.com Mon May 21 14:29:57 2012 From: jason.stajich at gmail.com (Jason Stajich) Date: Mon, 21 May 2012 11:29:57 -0700 Subject: [Bioperl-l] Bio::Seq->subseq documentation In-Reply-To: <3AE41ED1-9CB6-432C-ABFC-B01E6B30C1A8@tamu.edu> References: <3AE41ED1-9CB6-432C-ABFC-B01E6B30C1A8@tamu.edu> Message-ID: It should be changed in git, but I am not sure if the autogeneration of the pdoc is still going on on the main website host. Jason On May 18, 2012, at 1:56 PM, Jim Hu wrote: > Is the doc generated from bioperl-live or from the last stable version? If the former, I suppose I could change it in git (or ask Nathan in my lab to do it). > > On May 18, 2012, at 9:49 AM, Joel Martin wrote: > >> the documentation for Bio::Seq looks out of sync to me, the code for >> Bio::Seq is just >> sub subseq { >> return shift->primary_seq()->subseq(@_); >> } >> >> so, it takes what Bio::PrimarySeq takes. >> >> Joel >> >> On Thu, May 17, 2012 at 10:07 PM, Jim Hu wrote: >>> In the page for Bio::Seq, >>> >>> http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/Seq.html#POD5 >>> >>> I think the usage should match the documentation for Bio::PrimarySeq >>> >>> http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/PrimarySeq.html#POD4 >>> >>> indicating that the arguments can be integers OR location objects. Is that correct? >>> >>> Jim >>> ===================================== >>> Jim Hu >>> Professor >>> Dept. of Biochemistry and Biophysics >>> 2128 TAMU >>> Texas A&M Univ. >>> College Station, TX 77843-2128 >>> 979-862-4054 >>> >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > ===================================== > Jim Hu > Professor > Dept. of Biochemistry and Biophysics > 2128 TAMU > Texas A&M Univ. > College Station, TX 77843-2128 > 979-862-4054 > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason.stajich at gmail.com jason at bioperl.org From florent.angly at gmail.com Mon May 21 19:14:11 2012 From: florent.angly at gmail.com (Florent Angly) Date: Tue, 22 May 2012 09:14:11 +1000 Subject: [Bioperl-l] Little question about Bio::SeqIO::fastq In-Reply-To: References: , <4FB98133.9010906@gmail.com> Message-ID: <4FBACC43.60508@gmail.com> Hi Juan, Have a look at the beginners howto and the SeqIO howto here: http://www.bioperl.org/wiki/HOWTOs Regardless of what sequence format you use (fastq, fasta, etc), SeqIO can be used in the same fashion to read, edit and write sequence ID, description and sequence string. Best, Florent On 21/05/12 23:32, Juan Jovel wrote: > Hello All! > When using 'Bio::SeqIO::fastq' how can I access each of the lines of a fastq entries? > I need to split multiplexed libraries from MiSeq runs (which outputs in fastq format), and therefore I want to access the 'id' line of each sequence (@) to be able to split the file according to Illumina indices. How to do that? > thanks a lot in advance. > Juan > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From limericksean at gmail.com Wed May 16 14:05:28 2012 From: limericksean at gmail.com (Sean O'Keeffe) Date: Wed, 16 May 2012 14:05:28 -0400 Subject: [Bioperl-l] fastq splitter - working but not before xmas!! Message-ID: So now I've got a bunch of fastq's all about 17GB in size. The script is puttering away but this is tediously slow. I tried the the fastq-dump tool from sra toolkit but it didn't like my commands (fastq-dump --split-files ) - my ignorance no doubt. Any ideas out there on speeding up Bio::SeqIO::fastq output? Thanks. On 1 March 2012 03:16, Joel Martin wrote: > Just a caution to double check that the read1 and read2 names match after > splitting. I don't know if this thread jinxed me or what, but I just for > the first time received a concatenated fastq file formatted as you > describe, except the first read1 doesn't match the first read2. zut alores! > > came up with converting to scarf, /usr/bin/sort the scarf, then read that > with tossing into single or paired files and reconverting to fastq in the > process. it wasn't too bad, but I don't think bioperl has a scarf > conversion, it's basically fastq with : substituted for \n. most > delimeters that aren't : would work better but i already had a fastq2scarf > from early solexa days ( i think ). > > # this was the last step, if it's handy for this plague of hideous files, > the fixed fields for : would need adjusting > use strict; > > open( my $oph, '>', 'paired.fq' ) or die $!; > open( my $osh, '>', 'single.fq' ) or die $!; > > my ( $pend, $pname, $pline ); > > while ( <>) { > my ( $name, $end ) = /^(\S+)\s(\d)/; > > if ( $end == 1 ) { > if ( $pend ) { > print_reads( $osh, $pline ); > } > $pend = $end; > $pname = $name; > $pline = $_; > } > elsif ( $end == 2 ) { > my $fh = $pend == 1 && $pname eq $name ? $oph : $osh; > print_reads( $fh, $pline, $_ ); > $pend = ''; > } > else { > die "ERROR: can't interpret line $. $_"; > } > } > sub print_reads { > my ( $fh, @reads ) = @_; > for my $scarf ( @reads ) { > my @stuff = split /:/,$scarf,12; > print $fh '@',join(':', at stuff[0..9]),"\n$stuff[10]\n+\n$stuff[11]"; > } > } > > Joel > > On Wed, Feb 29, 2012 at 11:52 AM, George Hartzell wrote: > >> Fields, Christopher J writes: >> > Just want to say, if you can set up a local perl and local::lib it >> > makes your life a LOT easier. Particularly if you are running jobs >> > on older versions of RHEL, which notoriously stuck with >> > outdated/broken versions of perl (as well as other tools). >> > [...] >> >> And Perlbrew takes away your last excuse for not building perls and >> setting up local::lib's. >> >> http://perlbrew.pl/ >> >> g. >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > From exceptlowang at gmail.com Thu May 17 07:07:50 2012 From: exceptlowang at gmail.com (Tim White) Date: Thu, 17 May 2012 23:07:50 +1200 Subject: [Bioperl-l] Bio::SeqIO::tab deletes gap characters when reading sequences, which is inconvenient In-Reply-To: References: <4F8E03FE.7000506@gmail.com> Message-ID: Wonderful thanks Chris! Tim On Fri, May 11, 2012 at 8:56 AM, Fields, Christopher J < cjfields at illinois.edu> wrote: > Tim, > > This one got stuck in my drafts folder :P > > Easy enough to do. I've added this in to the master branch, commit > eece9dd. > > chris > > On Apr 17, 2012, at 6:59 PM, Tim White wrote: > > > Hi, > > > > Bio::SeqIO::tab (what you get when specifying -format => 'tab' to > Bio::SeqIO->new()) is perfect for converting sequences into a one-per-line > format, so that standard line-oriented UNIX tools (grep, comm etc.) work as > expected. Except... I just discovered that it deletes gap ("-") > characters when reading sequences, so it can't be used to round-trip any > files that contain these. This is a source of grief as I frequently work > with FASTA files that contain aligned sequences, and thus gap characters. > > > > This is all because the next_seq() function in Bio::SeqIO::tab.pmcontains the line: > > > > $seq =~ s/\W//g; > > > > which removes all non-alphanumeric characters from the sequence data. > IMHO it would be *much* better if this was changed to: > > > > $seq =~ s/\s//g; > > > > which simply removes all whitespace characters (particularly including > the \r that often appears at the ends of lines on text files that have > visited Windows), enabling gap characters (and, for example, periods and > asterisks) to be preserved. Alternatively, you could simply get rid of > this line of code and allow whitespace characters through. > > > > I'm not sure whether this counts as a "bug", as a cursory search didn't > turn up any docs explaining precisely what characters are and aren't > preserved by classes implementing Bio::SeqIO, but it's certainly > inconsistent (at least Bio::SeqIO::fasta, and Bio::SeqIO::table, with > columns and delimiters set up appropriately, allow round-tripping of files > containing gap characters) as well as extremely inconvenient for me > personally, and I suspect for others. Assuming no harm would be done by > making the above change, what's the best thing to do to get this changed? > I've simply edited my own local copy of tab.pm to make the above change, > but obviously if others agree I'd like to get the change done upstream. > > > > Thanks, > > Tim > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From lincoln.stein at gmail.com Tue May 22 10:19:32 2012 From: lincoln.stein at gmail.com (Lincoln Stein) Date: Tue, 22 May 2012 10:19:32 -0400 Subject: [Bioperl-l] bioperl count sam reads In-Reply-To: References: Message-ID: You are doing it the hard way. The easy way is to use the pileup() method and provide a callback that counts the type of base that you wish (i.e. don't count gaps). Lincoln On Wed, May 2, 2012 at 9:39 PM, Mark Aquino wrote: > Hi all, > > I'm a little stumped as to how to successfully count the depths of all > reads at a specific locus in a sam/bam file. I know I can do this with > GATK DepthOfCoverage but I wanted to do some more customized things with my > script yet I haven't figured out how to get the right base. I was a bit > surprised there wasn't (or that it's not well documented) a method to get > the individuals read's base at a specific position while getting the > $refbase is quite easy. (I'm betting such a method exists and is just not > documented well) > > At any rate, gaps in the alignment are the cause for my problems, so if > anyone knows a simpler way to do call the bases correctly, or a clever > algorithm to deal with this issue, it would be much appreciated. Here's > what I have for code and it works except in cases where there are multiple > gaps in the reference sequence, e.g. the alignment below should be T-T here > not C-C but is shifted due to the second gap. > > > #!/progs/bin/perl > use strict; > use warnings; > use Bio::DB::Sam; > use Bio::DB::Bam::AlignWrapper; > use Pod::Usage; > use Getopt::Long; > use Bio::DB::Bam::Pileup; > use Term::ANSIColor; > > my $sam = Bio::DB::Sam->new(-bam =>$BAM, > -fasta=> $FASTA); > getBases($chr, $pos, $pos); > > > sub getBases { > my $print = 1; > my ($chr, $start_query, $end_query) = @_; > my @alignments = $sam->get_features_by_location(-seq_id => $chr, > -start => $start_query, > -end => $end_query); > my $refbase; > my ($a_count, $t_count, $g_count, $c_count, $n_count, $del_count, > $ins_count) = (0, 0, 0, 0, 0, 0, 0); > for my $a (@alignments) { > > my $start = $a->start; > my $end = $a->end; > > my $query_start = $a->query->start; > my $query_end = $a->query->end; > my $ref_dna = $a->dna; # reference sequence bases > my ($ref, $matches, $query) = $a->padded_alignment; > my $offset = 0; > if ($ref =~ /^([-]+)[ATCG]+/){ > $offset = length($1); > } > #print "$offset\n"; > $refbase = $sam->segment($chr,$start_query,$start_query)->dna; > > printAlignment($ref, $matches, $query, $start_query, $start, > $offset); > my $base = substr($query, $start_query-$start+$offset, 1); > if (!$base){ > next; > } > $a_count++ if ($base eq "A"); > $t_count++ if ($base eq "T"); > $c_count++ if ($base eq "C"); > $g_count++ if ($base eq "G"); > $n_count++ if ($base eq "N"); > $del_count++ if ($base eq "-"); > my @scores = $a->qscore; # per-base quality scores > my $match_qual= $a->qual; # quality of the match > } > my $total_depth = $a_count + $t_count + $c_count + $g_count + $n_count > + $del_count; > if ($print == 1){ > # print "$start_query\tref base: $ref_base\n"; > print "$chr:$start_query($refbase)\t"; > print "A:$a_count\t"; > print "T:$t_count\t"; > print "C:$c_count\t"; > print "G:$g_count\t"; > print "N:$n_count\t"; > print "D:$del_count\t"; > print "Total:$total_depth\n"; > } > return ($a_count, $t_count, $c_count, $g_count, $n_count, $del_count, > $ins_count); > } > sub printAlignment{ > my ($ref, $matches, $query, $start_query, $start, $offset) = @_; > print substr($ref, 0, $start_query-$start+$offset); > print (color("red"), substr($ref, $start_query-$start+$offset, 1), > color("reset")); > print substr($ref, $start_query-$start+$offset+1),"\n"; > print substr($matches, 0, $start_query-$start+$offset); > print (color("red"), substr($matches, $start_query-$start+$offset, 1), > color("reset")); > print substr($matches, $start_query-$start+$offset+1),"\n"; > print substr($query, 0, $start_query-$start+$offset); > print (color("red"), substr($query, $start_query-$start+$offset, 1), > color("reset")); > print substr($query, $start_query-$start+$offset+1),"\n"; > } > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- Lincoln D. Stein Director, Informatics and Biocomputing Platform Ontario Institute for Cancer Research 101 College St., Suite 800 Toronto, ON, Canada M5G0A3 416 673-8514 Assistant: Renata Musa -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/tiff Size: 6222 bytes Desc: not available URL: From h.montenegro at gmail.com Tue May 22 12:04:57 2012 From: h.montenegro at gmail.com (=?ISO-8859-1?Q?Hor=E1cio_Montenegro?=) Date: Tue, 22 May 2012 13:04:57 -0300 Subject: [Bioperl-l] BioPerl 1.6.901 and prot4est In-Reply-To: <67A7D548-E684-4C7D-82B7-59C834EE4747@illinois.edu> References: <67A7D548-E684-4C7D-82B7-59C834EE4747@illinois.edu> Message-ID: hi Chris, yes, I reported this behaviour to the prot4est author a couple of months ago, but got only an automated reply - it says he is the only developer and is not always able to answer. I also tried to contact Sendu Bala, the maintainer (as found on Bioperl 1.6.1 source), but got no reply either. I will try do debug a bit more, and then I will submit a bug report. thanks, Horacio On Mon, May 21, 2012 at 2:43 PM, Fields, Christopher J wrote: > Hor?cio, > > Re: where the bug lies, it's hard to say. ?There have been several significant changes with bioperl's Taxonomy/Tree code over the last several years. ?Does the prot4est author (James Wasmuth) know about this? > > ? http://www.compsysbio.org/lab/?q=prot4EST > > You are also more than welcome to submit a bug report on this so we can track it. ?The following page describes how and where to do so: > > ? http://www.bioperl.org/wiki/Bugs > > (if you hear anything from the prot4est folks let us know) > > chris > > From cjfields at illinois.edu Tue May 22 12:13:38 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 22 May 2012 16:13:38 +0000 Subject: [Bioperl-l] BioPerl 1.6.901 and prot4est In-Reply-To: References: <67A7D548-E684-4C7D-82B7-59C834EE4747@illinois.edu> Message-ID: Horacio, Actually, for the 1.6.x releases I have been the release manager, Sendu release the latter 1.5.x releases. If you can whittle down the problem to a manageably reproducible issue you can submit it along with any relevant code/data to the bug report system, just let me know when you do so I can have a look. chris On May 22, 2012, at 11:04 AM, Hor?cio Montenegro wrote: > hi Chris, > > yes, I reported this behaviour to the prot4est author a couple of > months ago, but got only an automated reply - it says he is the only > developer and is not always able to answer. I also tried to contact > Sendu Bala, the maintainer (as found on Bioperl 1.6.1 source), but got > no reply either. > > I will try do debug a bit more, and then I will submit a bug report. > > thanks, > Horacio > > On Mon, May 21, 2012 at 2:43 PM, Fields, Christopher J > wrote: >> Hor?cio, >> >> Re: where the bug lies, it's hard to say. There have been several significant changes with bioperl's Taxonomy/Tree code over the last several years. Does the prot4est author (James Wasmuth) know about this? >> >> http://www.compsysbio.org/lab/?q=prot4EST >> >> You are also more than welcome to submit a bug report on this so we can track it. The following page describes how and where to do so: >> >> http://www.bioperl.org/wiki/Bugs >> >> (if you hear anything from the prot4est folks let us know) >> >> chris >> >> From tristan.lefebure at gmail.com Tue May 22 13:05:12 2012 From: tristan.lefebure at gmail.com (Tristan Lefebure) Date: Tue, 22 May 2012 19:05:12 +0200 Subject: [Bioperl-l] Codon boostraping In-Reply-To: <016136AC-2472-4FA4-8C3C-E193B0D3264E@gmail.com> References: <2084056.yurtXCQ8PC@picodon> <016136AC-2472-4FA4-8C3C-E193B0D3264E@gmail.com> Message-ID: <8110211.WKv5L8tyG6@picodon> Hi Jason, so I am trying to find my way into the github world, but well I am stuck after doing this: git clone git at github.com:TristanLefebure/bioperl-live.git cd bioperl-live/ git remote add upstream git://github.com/bioperl/bioperl-live.git git fetch upstream git push origin master #doing some editing to Bio/Align/Utilities.pm git add Bio/Align/Utilities.pm git commit -m "Little patch to Utilities.pm to allow codon bootstraping" git show #looks good, feeling quite happy, for the moment Then I moved to github bioperl webpage -> pool request, but it says: "Oops! The bioperl:master branch is already up-to-date with TristanLefebure:master ? maybe you want to try something else?" #now I feel like a cow with a smartphone... What is the stupid thing I am missing? Thanks! -- Tristan On Wednesday 09 May 2012 12:52:43 Jason Stajich wrote: > Tristan - it looks pretty good -- have you had any luck forking code on > github? this way your patch can be rolled in with your name as the > developer if you submit the changes as a pull request. > > http://help.github.com/fork-a-repo/ > > We can certainly commit things without this, but just trying to give you > opportunity to be more closely involved if you would like. > > Jason > > On May 9, 2012, at 8:23 AM, Tristan Lefebure wrote: > > Hi there, > > > > Just submitted the following patch to do codon bootstrapping: > > > > https://redmine.open-bio.org/issues/3350 > > > > I'll appreciate your comments on this tiny proposed addition to > > Bio::Align::Utilities > > > > Thanks! > > > > -- > > Tristan Lefebure > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Jason Stajich > jason.stajich at gmail.com > jason at bioperl.org From p.j.a.cock at googlemail.com Tue May 22 13:15:53 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 May 2012 18:15:53 +0100 Subject: [Bioperl-l] Codon boostraping In-Reply-To: <8110211.WKv5L8tyG6@picodon> References: <2084056.yurtXCQ8PC@picodon> <016136AC-2472-4FA4-8C3C-E193B0D3264E@gmail.com> <8110211.WKv5L8tyG6@picodon> Message-ID: On Tue, May 22, 2012 at 6:05 PM, Tristan Lefebure wrote: > Hi Jason, so I am trying to find my way into the github world, but well I am > stuck after doing this: > > git clone git at github.com:TristanLefebure/bioperl-live.git > cd bioperl-live/ > git remote add upstream git://github.com/bioperl/bioperl-live.git > git fetch upstream > git push origin master > #doing some editing to Bio/Align/Utilities.pm > git add Bio/Align/Utilities.pm > git commit -m "Little patch to Utilities.pm to allow codon bootstraping" > git show > #looks good, feeling quite happy, for the moment At that point the change only exists on your local hard drive. As an aside, at this stage it is still safe to amend the 'unpublished' commits (which should should never do once they have been made public - aka rewriting history). e.g. To fix the spelling error in the message: git commit --amend -m "Little patch to Utilities.pm to allow codon bootstrapping" You need to push the change to your repository on github. However, it would have been good to have made a new branch first... I think this would do the trick: git checkout -b codonbs #Create new branch (from current code) called codonbs git push origin codonbs #'Copy' this branch to your repo on github After that you should be able to see the new branch on github, and from there issue a pull request. This is what we've written for Biopython on using git and github - which we should probably review now we've been using github for a while: http://biopython.org/wiki/GitUsage Peter From cjfields at illinois.edu Tue May 22 13:25:48 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 22 May 2012 17:25:48 +0000 Subject: [Bioperl-l] Codon boostraping In-Reply-To: References: <2084056.yurtXCQ8PC@picodon> <016136AC-2472-4FA4-8C3C-E193B0D3264E@gmail.com> <8110211.WKv5L8tyG6@picodon> Message-ID: <464BAA25-2526-4D05-821A-56CB8BDF9266@illinois.edu> On May 22, 2012, at 12:15 PM, Peter Cock wrote: > On Tue, May 22, 2012 at 6:05 PM, Tristan Lefebure > wrote: >> Hi Jason, so I am trying to find my way into the github world, but well I am >> stuck after doing this: >> >> git clone git at github.com:TristanLefebure/bioperl-live.git >> cd bioperl-live/ >> git remote add upstream git://github.com/bioperl/bioperl-live.git >> git fetch upstream >> git push origin master >> #doing some editing to Bio/Align/Utilities.pm >> git add Bio/Align/Utilities.pm >> git commit -m "Little patch to Utilities.pm to allow codon bootstraping" >> git show >> #looks good, feeling quite happy, for the moment > > At that point the change only exists on your local hard drive. > > As an aside, at this stage it is still safe to amend the 'unpublished' > commits (which should should never do once they have been made > public - aka rewriting history). e.g. To fix the spelling error in the > message: > > git commit --amend -m "Little patch to Utilities.pm to allow codon > bootstrapping" > > You need to push the change to your repository on github. However, > it would have been good to have made a new branch first... I think > this would do the trick: > > git checkout -b codonbs > #Create new branch (from current code) called codonbs > > git push origin codonbs > #'Copy' this branch to your repo on github > > After that you should be able to see the new branch on github, > and from there issue a pull request. This is what we've written > for Biopython on using git and github - which we should probably > review now we've been using github for a while: > http://biopython.org/wiki/GitUsage > > Peter Regarding git/github doucmentation, IMO it would be a good idea to consolidate documentation where necessary to a common place we can all refer to for the basics (maybe place it on the open-bio wiki), and then leave project-specific stuff for the various Bio* wikis. The more eyes on it the better overall documentation we'll all have. Same could be said for a lot of the non-bioperl stuff (format/app/etc) on bioperl.org though the final site should probably be wikipedia. chris From p.j.a.cock at googlemail.com Tue May 22 13:29:57 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 May 2012 18:29:57 +0100 Subject: [Bioperl-l] Codon boostraping In-Reply-To: <464BAA25-2526-4D05-821A-56CB8BDF9266@illinois.edu> References: <2084056.yurtXCQ8PC@picodon> <016136AC-2472-4FA4-8C3C-E193B0D3264E@gmail.com> <8110211.WKv5L8tyG6@picodon> <464BAA25-2526-4D05-821A-56CB8BDF9266@illinois.edu> Message-ID: On Tue, May 22, 2012 at 6:25 PM, Fields, Christopher J wrote: > > On May 22, 2012, at 12:15 PM, Peter Cock wrote: > >> ... This is what we've written >> for Biopython on using git and github - which we should probably >> review now we've been using github for a while: >> http://biopython.org/wiki/GitUsage >> >> Peter > > Regarding git/github doucmentation, IMO it would be a good idea to > consolidate documentation where necessary to a common place we > can all refer to for the basics (maybe place it on the open-bio wiki), > and then leave project-specific stuff for the various Bio* wikis. ?The > more eyes on it the better overall documentation we'll all have. Sounds good. Are there already any equivalent BioPerl (etc) pages? > Same could be said for a lot of the non-bioperl stuff (format/app/etc) > on bioperl.org though the final site should probably be wikipedia. Well, if we link it in to the common file format names we (the OBF Bio* projects) use, then again, maybe on open-bio.org. Peter From cjfields at illinois.edu Tue May 22 14:07:13 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 22 May 2012 18:07:13 +0000 Subject: [Bioperl-l] Codon boostraping In-Reply-To: References: <2084056.yurtXCQ8PC@picodon> <016136AC-2472-4FA4-8C3C-E193B0D3264E@gmail.com> <8110211.WKv5L8tyG6@picodon> <464BAA25-2526-4D05-821A-56CB8BDF9266@illinois.edu> Message-ID: <5F330949-6B0A-4FA9-B057-9AD27CFE220A@illinois.edu> On May 22, 2012, at 12:29 PM, Peter Cock wrote: > On Tue, May 22, 2012 at 6:25 PM, Fields, Christopher J > wrote: >> >> On May 22, 2012, at 12:15 PM, Peter Cock wrote: >> >>> ... This is what we've written >>> for Biopython on using git and github - which we should probably >>> review now we've been using github for a while: >>> http://biopython.org/wiki/GitUsage >>> >>> Peter >> >> Regarding git/github doucmentation, IMO it would be a good idea to >> consolidate documentation where necessary to a common place we >> can all refer to for the basics (maybe place it on the open-bio wiki), >> and then leave project-specific stuff for the various Bio* wikis. The >> more eyes on it the better overall documentation we'll all have. > > Sounds good. Are there already any equivalent BioPerl (etc) pages? Work in progress, but it has a few useful bits: http://www.bioperl.org/wiki/Using_Git >> Same could be said for a lot of the non-bioperl stuff (format/app/etc) >> on bioperl.org though the final site should probably be wikipedia. > > Well, if we link it in to the common file format names we (the OBF > Bio* projects) use, then again, maybe on open-bio.org. > > Peter I'm open either way (no pun intended) but I think the need for information on these might extend beyond OBF, hence my wikipedia suggestion. Jason has mentioned this in the past as well. chris From tristan.lefebure at gmail.com Wed May 23 07:39:43 2012 From: tristan.lefebure at gmail.com (Tristan Lefebure) Date: Wed, 23 May 2012 13:39:43 +0200 Subject: [Bioperl-l] Codon boostraping In-Reply-To: References: <2084056.yurtXCQ8PC@picodon> <016136AC-2472-4FA4-8C3C-E193B0D3264E@gmail.com> <8110211.WKv5L8tyG6@picodon> Message-ID: So I did what Peter suggested: see the commit c7b3f10220 in the branch codonbs Thanks for your help, -- Tristan On Tue, May 22, 2012 at 7:15 PM, Peter Cock wrote: > On Tue, May 22, 2012 at 6:05 PM, Tristan Lefebure > wrote: >> Hi Jason, so I am trying to find my way into the github world, but well I am >> stuck after doing this: >> >> git clone git at github.com:TristanLefebure/bioperl-live.git >> cd bioperl-live/ >> git remote add upstream git://github.com/bioperl/bioperl-live.git >> git fetch upstream >> git push origin master >> #doing some editing to Bio/Align/Utilities.pm >> git add Bio/Align/Utilities.pm >> git commit -m "Little patch to Utilities.pm to allow codon bootstraping" >> git show >> #looks good, feeling quite happy, for the moment > > At that point the change only exists on your local hard drive. > > As an aside, at this stage it is still safe to amend the 'unpublished' > commits (which should should never do once they have been made > public - aka rewriting history). e.g. To fix the spelling error in the > message: > > git commit --amend -m "Little patch to Utilities.pm to allow codon > bootstrapping" > > You need to push the change to your repository on github. However, > it would have been good to have made a new branch first... I think > this would do the trick: > > git checkout -b codonbs > #Create new branch (from current code) called codonbs > > git push origin codonbs > #'Copy' this branch to your repo on github > > After that you should be able to see the new branch on github, > and from there issue a pull request. This is what we've written > for Biopython on using git and github - which we should probably > review now we've been using github for a while: > http://biopython.org/wiki/GitUsage > > Peter From b.m.forde at umail.ucc.ie Thu May 24 06:27:36 2012 From: b.m.forde at umail.ucc.ie (Brian Forde) Date: Thu, 24 May 2012 11:27:36 +0100 Subject: [Bioperl-l] Extracting sequences from Genbank files In-Reply-To: References: Message-ID: Hello, I have been modifying a script which extracts all the protein sequences from a genbank file and saves them in a multi-fasta file. I wish the fasta header to have both the locus_tag of the protein and the product. However I cannot get the product tag to write to the fasta header this is the relevant section of the script $s->display_id($f->has_tag('locus_tag') ? join(',',sort $f->each_tag_value('locus_tag')) : $f->has_tag('product') ? join(',',$f->each_tag_value('product')): $s->display_id); is "product" not an actual tag regards Brian -- Brian Forde Microbiology Dept. Bioscience Institute. Room 4.11 University College Cork Cork Ireland tel:+353 21 4901306 email: b.m.forde at umail.ucc.ie From roy.chaudhuri at gmail.com Thu May 24 06:49:08 2012 From: roy.chaudhuri at gmail.com (Roy Chaudhuri) Date: Thu, 24 May 2012 11:49:08 +0100 Subject: [Bioperl-l] Extracting sequences from Genbank files In-Reply-To: References: Message-ID: <4FBE1224.5030603@gmail.com> Isn't it "get_tag_values" not "each_tag_value"? Maybe you are using an old version, in which case you should probably upgrade to the most recent BioPerl (1.6.901). You could also look at "get_tagset_values", which does not throw an error if the tag is not present, so saves having to call has_tag. Cheers, Roy. On 24/05/2012 11:27, Brian Forde wrote: > Hello, > > I have been modifying a script which extracts all the protein sequences > from a genbank file and saves them in a multi-fasta file. > > I wish the fasta header to have both the locus_tag of the protein and the > product. However I cannot get the product tag to write to the fasta header > > this is the relevant section of the script > > $s->display_id($f->has_tag('locus_tag') ? join(',',sort > $f->each_tag_value('locus_tag')) : > $f->has_tag('product') ? > join(',',$f->each_tag_value('product')): > $s->display_id); > > is "product" not an actual tag > > regards > > Brian > > > From liam.elbourne at mq.edu.au Thu May 24 06:57:58 2012 From: liam.elbourne at mq.edu.au (Liam Elbourne) Date: Thu, 24 May 2012 20:57:58 +1000 Subject: [Bioperl-l] Extracting sequences from Genbank files In-Reply-To: References: Message-ID: <84B23D8D-4D0E-46DD-9901-70ED2A0E2603@mq.edu.au> Hi Brian, Check out: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation#Getting_the_Features But I think get_tag_values is the method you require. Regards, Liam Elbourne. On 24/05/2012, at 8:27 PM, Brian Forde wrote: > Hello, > > I have been modifying a script which extracts all the protein sequences > from a genbank file and saves them in a multi-fasta file. > > I wish the fasta header to have both the locus_tag of the protein and the > product. However I cannot get the product tag to write to the fasta header > > this is the relevant section of the script > > $s->display_id($f->has_tag('locus_tag') ? join(',',sort > $f->each_tag_value('locus_tag')) : > $f->has_tag('product') ? > join(',',$f->each_tag_value('product')): > $s->display_id); > > is "product" not an actual tag > > regards > > Brian > > > > -- > Brian Forde > Microbiology Dept. > Bioscience Institute. Room 4.11 > University College Cork > Cork > Ireland > tel:+353 21 4901306 > email: b.m.forde at umail.ucc.ie > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Thu May 24 15:05:40 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 24 May 2012 19:05:40 +0000 Subject: [Bioperl-l] fastq splitter - working but not before xmas!! In-Reply-To: References: Message-ID: Sean, I have been working on a XS-based interface that basically just returns hashrefs, this uses Heng Li's kseq.h library. I can probably push this out to CPAN sometime in the next week or so. I did some initial (very rough) benchmarks and when using a simple count it parsed 30M reads in about 25-30 seconds. chris On May 16, 2012, at 1:05 PM, Sean O'Keeffe wrote: > So now I've got a bunch of fastq's all about 17GB in size. The script is > puttering away but this is tediously slow. > I tried the the fastq-dump tool from sra toolkit but it didn't like my > commands (fastq-dump --split-files ) - my ignorance no > doubt. > > Any ideas out there on speeding up Bio::SeqIO::fastq output? > Thanks. > > On 1 March 2012 03:16, Joel Martin wrote: > >> Just a caution to double check that the read1 and read2 names match after >> splitting. I don't know if this thread jinxed me or what, but I just for >> the first time received a concatenated fastq file formatted as you >> describe, except the first read1 doesn't match the first read2. zut alores! >> >> came up with converting to scarf, /usr/bin/sort the scarf, then read that >> with tossing into single or paired files and reconverting to fastq in the >> process. it wasn't too bad, but I don't think bioperl has a scarf >> conversion, it's basically fastq with : substituted for \n. most >> delimeters that aren't : would work better but i already had a fastq2scarf >> from early solexa days ( i think ). >> >> # this was the last step, if it's handy for this plague of hideous files, >> the fixed fields for : would need adjusting >> use strict; >> >> open( my $oph, '>', 'paired.fq' ) or die $!; >> open( my $osh, '>', 'single.fq' ) or die $!; >> >> my ( $pend, $pname, $pline ); >> >> while ( <>) { >> my ( $name, $end ) = /^(\S+)\s(\d)/; >> >> if ( $end == 1 ) { >> if ( $pend ) { >> print_reads( $osh, $pline ); >> } >> $pend = $end; >> $pname = $name; >> $pline = $_; >> } >> elsif ( $end == 2 ) { >> my $fh = $pend == 1 && $pname eq $name ? $oph : $osh; >> print_reads( $fh, $pline, $_ ); >> $pend = ''; >> } >> else { >> die "ERROR: can't interpret line $. $_"; >> } >> } >> sub print_reads { >> my ( $fh, @reads ) = @_; >> for my $scarf ( @reads ) { >> my @stuff = split /:/,$scarf,12; >> print $fh '@',join(':', at stuff[0..9]),"\n$stuff[10]\n+\n$stuff[11]"; >> } >> } >> >> Joel >> >> On Wed, Feb 29, 2012 at 11:52 AM, George Hartzell wrote: >> >>> Fields, Christopher J writes: >>>> Just want to say, if you can set up a local perl and local::lib it >>>> makes your life a LOT easier. Particularly if you are running jobs >>>> on older versions of RHEL, which notoriously stuck with >>>> outdated/broken versions of perl (as well as other tools). >>>> [...] >>> >>> And Perlbrew takes away your last excuse for not building perls and >>> setting up local::lib's. >>> >>> http://perlbrew.pl/ >>> >>> g. >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >> >> > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From limericksean at gmail.com Thu May 24 15:23:54 2012 From: limericksean at gmail.com (Sean O'Keeffe) Date: Thu, 24 May 2012 15:23:54 -0400 Subject: [Bioperl-l] fastq splitter - working but not before xmas!! In-Reply-To: References: Message-ID: Thanks Chris. I'd be happy to give it a whirl whenever it's ready. I used a one liner grep command which got the job done: grep -A 3 '1:[NY]:0' > grep -A 3 '2:[NY]:0' > Sean. On 24 May 2012 15:05, Fields, Christopher J wrote: > Sean, > > I have been working on a XS-based interface that basically just returns hashrefs, this uses Heng Li's kseq.h library. ?I can probably push this out to CPAN sometime in the next week or so. ?I did some initial (very rough) benchmarks and when using a simple count it parsed 30M reads in about 25-30 seconds. > > chris > > On May 16, 2012, at 1:05 PM, Sean O'Keeffe wrote: > >> So now I've got a bunch of fastq's all about 17GB in size. The script is >> puttering away but this is tediously slow. >> I tried the the fastq-dump tool from sra toolkit but it didn't like my >> commands (fastq-dump --split-files ) - my ignorance no >> doubt. >> >> Any ideas out there on speeding up Bio::SeqIO::fastq output? >> Thanks. >> >> On 1 March 2012 03:16, Joel Martin wrote: >> >>> Just a caution to double check that the read1 and read2 names match after >>> splitting. ?I don't know if this thread jinxed me or what, but I just for >>> the first time received a concatenated fastq file formatted as you >>> describe, except the first read1 doesn't match the first read2. ?zut alores! >>> >>> came up with converting to scarf, /usr/bin/sort the scarf, then read that >>> with tossing into single or paired files and reconverting to fastq in the >>> process. ?it wasn't too bad, but I don't think bioperl has a scarf >>> conversion, it's basically fastq with : substituted for \n. ?most >>> delimeters that aren't : would work better but i already had a fastq2scarf >>> from early solexa days ( i think ). >>> >>> # this was the last step, if it's handy for this plague of hideous files, >>> the fixed fields for : would need adjusting >>> use strict; >>> >>> open( my $oph, '>', 'paired.fq' ) or die $!; >>> open( my $osh, '>', 'single.fq' ) or die $!; >>> >>> my ( $pend, $pname, $pline ); >>> >>> while ( <>) { >>> ?my ( $name, $end ) = /^(\S+)\s(\d)/; >>> >>> ?if ( $end == 1 ) { >>> ? ?if ( $pend ) { >>> ? ? ?print_reads( $osh, $pline ); >>> ? ?} >>> ? ?$pend = $end; >>> ? ?$pname = $name; >>> ? ?$pline = $_; >>> ?} >>> ?elsif ( $end == 2 ) { >>> ? ?my $fh = $pend == 1 && $pname eq $name ? $oph : $osh; >>> ? ?print_reads( $fh, $pline, $_ ); >>> ? ?$pend = ''; >>> ?} >>> ?else { >>> ? ?die "ERROR: can't interpret line $. $_"; >>> ?} >>> } >>> sub print_reads { >>> ?my ( $fh, @reads ) = @_; >>> ?for my $scarf ( @reads ) { >>> ? ?my @stuff = split /:/,$scarf,12; >>> ? ?print $fh '@',join(':', at stuff[0..9]),"\n$stuff[10]\n+\n$stuff[11]"; >>> ?} >>> } >>> >>> Joel >>> >>> On Wed, Feb 29, 2012 at 11:52 AM, George Hartzell wrote: >>> >>>> Fields, Christopher J writes: >>>>> Just want to say, if you can set up a local perl and local::lib it >>>>> makes your life a LOT easier. ?Particularly if you are running jobs >>>>> on older versions of RHEL, which notoriously stuck with >>>>> outdated/broken versions of perl (as well as other tools). >>>>> [...] >>>> >>>> And Perlbrew takes away your last excuse for not building perls and >>>> setting up local::lib's. >>>> >>>> http://perlbrew.pl/ >>>> >>>> g. >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>> >>> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > From saladi1 at illinois.edu Tue May 29 14:11:23 2012 From: saladi1 at illinois.edu (Shyam Saladi) Date: Tue, 29 May 2012 11:11:23 -0700 Subject: [Bioperl-l] Working with GenBank file Message-ID: Hi, I want to extract certain genes from a genomic genbank file. I put together the following, but get the error: ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: asking for tag value that does not exist translation STACK: Error::throw STACK: Bio::Root::Root::throw /usr/local/share/perl/5.10.1/Bio/Root/Root.pm:472 STACK: Bio::SeqFeature::Generic::get_tag_values /usr/local/share/perl/5.10.1/Bio/SeqFeature/Generic.pm:522 STACK: readfromgbk.pl:91 ----------------------------------------------------------- @cds_features = grep { $_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $inFile)->next_seq->get_SeqFeatures; my %gene_sequences = map {$_->get_tag_values('gene'), $_->get_tag_values('translation')} @cds_features; I think the error has to do with ->get_SeqFeatures, but don't understand what exactly. Could someone please advise? Thanks very much, Shyam From saladi1 at illinois.edu Tue May 29 15:26:13 2012 From: saladi1 at illinois.edu (Shyam Saladi) Date: Tue, 29 May 2012 12:26:13 -0700 Subject: [Bioperl-l] Working with GenBank file In-Reply-To: References: Message-ID: Hi, As a followup, if I do the following: print join(" ", $cds_features[0]->get_all_tags()) . "\n"; I get the following output: GO_process codon_start db_xref function gene gene_synonym locus_tag product protein_id transl_table translation which I think would suggest that doing $_->get_tag_values('translation') should be valid. Thanks, Shyam On Tue, May 29, 2012 at 11:11 AM, Shyam Saladi wrote: > Hi, > > I want to extract certain genes from a genomic genbank file. I put > together the following, but get the error: > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: asking for tag value that does not exist translation > STACK: Error::throw > STACK: Bio::Root::Root::throw > /usr/local/share/perl/5.10.1/Bio/Root/Root.pm:472 > STACK: Bio::SeqFeature::Generic::get_tag_values > /usr/local/share/perl/5.10.1/Bio/SeqFeature/Generic.pm:522 > STACK: readfromgbk.pl:91 > ----------------------------------------------------------- > > @cds_features = grep { $_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => > $inFile)->next_seq->get_SeqFeatures; > my %gene_sequences = map {$_->get_tag_values('gene'), > $_->get_tag_values('translation')} @cds_features; > > I think the error has to do with ->get_SeqFeatures, but don't understand > what exactly. > > Could someone please advise? > > Thanks very much, > Shyam > From jason.stajich at gmail.com Tue May 29 17:14:32 2012 From: jason.stajich at gmail.com (Jason Stajich) Date: Tue, 29 May 2012 14:14:32 -0700 Subject: [Bioperl-l] volunteers needed: HOWTO documentation improvements Message-ID: <547710E5-9DAF-49E5-9FAA-BCDBD7880368@gmail.com> Looking at some of the HOWTOs I think we could do a better job explaining some things with more examples. Anyone hit their head against things and wished there was more descriptions or examples? Be great if you could help out the project and contribute to this by suggesting places for more examples or providing some of your own. For example, I think more extensive description of how to use Bio::DB::Fasta -- which is really the best module for sequence indexing and retrieval could be added to this HOWTO: http://bioperl.org/wiki/HOWTO:Local_Databases More examples and explanations of problems like when the sequence lines are uneven and how to fix it, how to use some of the dynamic call backs to extract the sequence IDs from complicated IDs e.g. >gi|1234|gb|ABCD.1|ABCD Being able to query on the GI number or the locus or the accession number--- anyone want to put such a thing into the howto? -Jason -- Jason Stajich jason.stajich at gmail.com jason at bioperl.org From paolo.pavan at gmail.com Thu May 31 08:56:29 2012 From: paolo.pavan at gmail.com (Paolo Pavan) Date: Thu, 31 May 2012 14:56:29 +0200 Subject: [Bioperl-l] Google groups bioperl-l mirror issue Message-ID: Hello everybody, This message is just to point out that google groups bioperl-l mirror seems to not receive threads since about a month. There is any reason that I have missed for that or is just an issue? Regards, Paolo http://groups.google.com/group/bioperl-l From cjfields at illinois.edu Thu May 31 10:12:47 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 31 May 2012 14:12:47 +0000 Subject: [Bioperl-l] volunteers needed: HOWTO documentation improvements In-Reply-To: <547710E5-9DAF-49E5-9FAA-BCDBD7880368@gmail.com> References: <547710E5-9DAF-49E5-9FAA-BCDBD7880368@gmail.com> Message-ID: +1. Much of this could be taken from the synopsis itself. ( note this is not me volunteering, my hands are full ATM :) chris On May 29, 2012, at 4:14 PM, Jason Stajich wrote: > Looking at some of the HOWTOs I think we could do a better job explaining some things with more examples. > > Anyone hit their head against things and wished there was more descriptions or examples? Be great if you could help out the project and contribute to this by suggesting places for more examples or providing some of your own. > > For example, I think more extensive description of how to use Bio::DB::Fasta -- which is really the best module for sequence indexing and retrieval could be added to this HOWTO: > http://bioperl.org/wiki/HOWTO:Local_Databases > > More examples and explanations of problems like when the sequence lines are uneven and how to fix it, how to use some of the dynamic call backs to extract the sequence IDs from complicated IDs e.g. >> gi|1234|gb|ABCD.1|ABCD > > Being able to query on the GI number or the locus or the accession number--- anyone want to put such a thing into the howto? > > -Jason > -- > Jason Stajich > jason.stajich at gmail.com > jason at bioperl.org > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From paolo.pavan at gmail.com Thu May 31 11:34:11 2012 From: paolo.pavan at gmail.com (Paolo Pavan) Date: Thu, 31 May 2012 17:34:11 +0200 Subject: [Bioperl-l] Bio::Seq issue In-Reply-To: References: Message-ID: Hello everyone again, By the way, I think I have encountered a minor issue in the method Bio::Seq->is_circular(), that is defined in the file Bio/Seq.pm as a pure getter as: sub is_circular { shift->primary_seq->is_circular } while it's counterpart in Bio/PrimarySeq.pm is defined as a getter/setter as: sub is_circular{ my $self = shift; return $self->{'is_circular'} = shift if @_; return $self->{'is_circular'}; } the result is that if you have a Bio::Seq object, for instance read by Bio::SeqIO you can't change any more the property (well, unless you do $seq->primary_seq->is_circular($is_circular) ). Do someone agree that they should have the same behaviour? It the case, attached patch file applied to Bio/Seq.pm file should do the job. Best regards, Paolo -------------- next part -------------- A non-text attachment was scrubbed... Name: Seq.pm.patch Type: application/octet-stream Size: 239 bytes Desc: not available URL: