From jdeuts01 at students.poly.edu Thu Dec 1 09:09:19 2011 From: jdeuts01 at students.poly.edu (jdeuts01 at students.poly.edu) Date: Thu, 1 Dec 2011 14:09:19 +0000 Subject: [Bioperl-l] question Message-ID: Dear Bioperl, This is my first experience with bioperl and I need help please. 1. The version of bioperl is 1.6.1 and have also installed Bundle-BioPerl 2.1.8 and mGen 1.03. I was unable to install Bribes and trouchelle DB. Will this prevent the BioPerl package from functioning correctly? 2. The operating platform is windows 7 - 64 bit using ActiveState - Perl v5.12.2 3. The script is as follows: #!/usr/bin/perl # Write a script using OOP to write protein sequences to the file fasta.txt.use strict;use warnings;use Bio::SeqIO::fasta; # Declare and initialize input and output files.my $protein_fasta = "protein.fa";my $protein_out = ">fasta.txt"; # Setup objects for input and output.my $seq_in = Bio::SeqIO->new(-file => "$protein_fasta", -format => 'Fasta');my $seq_out = Bio::SeqIO->new(-file => "$protein_out", -format => 'fasta'); # Establish while loop using "next_seq" method # to read in multiple sequences from "protein.fa" file# one by one until none were remaining.while(my $seq = $seq_in -> next_seq){ $seq_out->write_seq($seq);} The information is successfully written to the file: fasta.txt. 4. Receiving the following error messages: Replacement list is longer than search list at C:/Perl64/site/lib/Bio/Range.pm line 251.Subroutine _initialize redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 92.Subroutine next_seq redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 112.Subroutine write_seq redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 193.Subroutine width redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 272.Subroutine preferred_id_type redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 295. Thanks in advance for your help.John Deutsch From jboddu at illinois.edu Thu Dec 1 11:38:00 2011 From: jboddu at illinois.edu (Boddu, Jayanand) Date: Thu, 1 Dec 2011 16:38:00 +0000 Subject: [Bioperl-l] Chromosome coordinates Message-ID: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu> Hello I am newbie to Perl scripts. I have a file with short reads mapped to the MAIZE genome The format is a simple BLASTN output. READ_ID Chr % Similarity Alignment Mismatches Gaps READ Start READ End Chr Start Chr End E Value Score READ1 chrPt 100 17 0 0 1 17 35021 35037 0.21 34.2 READ1 chr10 100 17 0 0 1 17 128587356 128587372 0.21 34.2 READ1 chr6 100 17 0 0 1 17 160769803 160769787 0.21 34.2 READ1 chr5 100 17 0 0 1 17 172103083 172103067 0.21 34.2 READ1 chr4 100 17 0 0 1 17 213173683 213173699 0.21 34.2 READ1 chr3 100 17 0 0 1 17 23689132 23689116 0.21 34.2 READ2 chr8 100 17 0 0 1 17 161048603 161048587 0.21 34.2 READ2 chr6 100 17 0 0 1 17 155768884 155768868 0.21 34.2 READ2 chr5 100 17 0 0 1 17 32958812 32958828 0.21 34.2 READ2 chr3 100 17 0 0 1 17 212451090 212451074 0.21 34.2 READ2 chr2 100 17 0 0 1 17 2046449 2046465 0.21 34.2 READ2 chr1 100 17 0 0 1 17 223233801 223233785 0.21 34.2 READ2 chr1 100 17 0 0 1 17 277573037 277573021 0.21 34.2 As expected the same read maps to multiple places on the same/different chromosome. I have a GFF file with annotated coordinates. I would like to run a PERL script to find out READS that are within the GENES in the GFF file and that are not. The anticipated script should; 1. Take the READ coordinates on the genome (by chromosome); 2. Go the GFF file; 3. Find the Chromosome; 4. Find the GENE (by coordinates); 5. and report READ-its coordinates-Chromosome-GENE-and its coordinates. It doesn't need to be in the same order. After this, I guess I could use simple Microsoft ACCESS query to pull out READS that are not mapped to the GENEs. I would greatly appreciate if anyone can has a script that more or less similar job. Thanks Jay From scott at scottcain.net Thu Dec 1 11:59:56 2011 From: scott at scottcain.net (Scott Cain) Date: Thu, 1 Dec 2011 11:59:56 -0500 Subject: [Bioperl-l] Chromosome coordinates In-Reply-To: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu> References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu> Message-ID: Hi Jay, Since the maize GFF file is likely to be fairly large, I would consider putting it in a database, using either Bio::DB::GFF if it is GFF2 or Bio::DB::SeqFeature::Store if it is gff3. Then you can use the methods that come along with either of those modules to search regions for for genes. They both support a get_features_by_location method, so you could get the range for each of the regions you want to look at, and check the database with that method to see if anything is there. Scott On Thu, Dec 1, 2011 at 11:38 AM, Boddu, Jayanand wrote: > Hello > I am newbie to Perl scripts. > I have a file with short reads mapped to the MAIZE genome > The format is a simple BLASTN output. > READ_ID > > Chr > > % Similarity > > Alignment > > Mismatches > > Gaps > > READ Start > > READ End > > Chr Start > > Chr End > > E Value > > Score > > READ1 > > chrPt > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 35021 > > 35037 > > 0.21 > > 34.2 > > READ1 > > chr10 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 128587356 > > 128587372 > > 0.21 > > 34.2 > > READ1 > > chr6 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 160769803 > > 160769787 > > 0.21 > > 34.2 > > READ1 > > chr5 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 172103083 > > 172103067 > > 0.21 > > 34.2 > > READ1 > > chr4 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 213173683 > > 213173699 > > 0.21 > > 34.2 > > READ1 > > chr3 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 23689132 > > 23689116 > > 0.21 > > 34.2 > > READ2 > > chr8 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 161048603 > > 161048587 > > 0.21 > > 34.2 > > READ2 > > chr6 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 155768884 > > 155768868 > > 0.21 > > 34.2 > > READ2 > > chr5 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 32958812 > > 32958828 > > 0.21 > > 34.2 > > READ2 > > chr3 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 212451090 > > 212451074 > > 0.21 > > 34.2 > > READ2 > > chr2 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 2046449 > > 2046465 > > 0.21 > > 34.2 > > READ2 > > chr1 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 223233801 > > 223233785 > > 0.21 > > 34.2 > > READ2 > > chr1 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 277573037 > > 277573021 > > 0.21 > > 34.2 > > > > > > > > > > > > > > > > > > > > > > > > > > As expected the same read maps to multiple places on the same/different > chromosome. > I have a GFF file with annotated coordinates. > I would like to run a PERL script to find out READS that are within the > GENES in the GFF file and that are not. > The anticipated script should; > > 1. Take the READ coordinates on the genome (by chromosome); > > 2. Go the GFF file; > > 3. Find the Chromosome; > > 4. Find the GENE (by coordinates); > > 5. and report READ-its coordinates-Chromosome-GENE-and its > coordinates. > > It doesn't need to be in the same order. > After this, I guess I could use simple Microsoft ACCESS query to pull out > READS that are not mapped to the GENEs. > I would greatly appreciate if anyone can has a script that more or less > similar job. > > Thanks > Jay > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research From jason.stajich at gmail.com Thu Dec 1 12:31:29 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Thu, 1 Dec 2011 09:31:29 -0800 Subject: [Bioperl-l] Chromosome coordinates In-Reply-To: References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu> Message-ID: <470A08F7-E238-44E1-B0D1-69FDBC0BA2A3@gmail.com> You might try using BEDtools, intersectBED which can do a lot of what you are doing in simple command line program. Jason On Dec 1, 2011, at 8:59 AM, Scott Cain wrote: > Hi Jay, > > Since the maize GFF file is likely to be fairly large, I would consider > putting it in a database, using either Bio::DB::GFF if it is GFF2 or > Bio::DB::SeqFeature::Store if it is gff3. Then you can use the methods > that come along with either of those modules to search regions for for > genes. They both support a get_features_by_location method, so you could > get the range for each of the regions you want to look at, and check the > database with that method to see if anything is there. > > Scott > > > On Thu, Dec 1, 2011 at 11:38 AM, Boddu, Jayanand wrote: > >> Hello >> I am newbie to Perl scripts. >> I have a file with short reads mapped to the MAIZE genome >> The format is a simple BLASTN output. >> READ_ID >> >> Chr >> >> % Similarity >> >> Alignment >> >> Mismatches >> >> Gaps >> >> READ Start >> >> READ End >> >> Chr Start >> >> Chr End >> >> E Value >> >> Score >> >> READ1 >> >> chrPt >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 35021 >> >> 35037 >> >> 0.21 >> >> 34.2 >> >> READ1 >> >> chr10 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 128587356 >> >> 128587372 >> >> 0.21 >> >> 34.2 >> >> READ1 >> >> chr6 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 160769803 >> >> 160769787 >> >> 0.21 >> >> 34.2 >> >> READ1 >> >> chr5 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 172103083 >> >> 172103067 >> >> 0.21 >> >> 34.2 >> >> READ1 >> >> chr4 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 213173683 >> >> 213173699 >> >> 0.21 >> >> 34.2 >> >> READ1 >> >> chr3 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 23689132 >> >> 23689116 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr8 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 161048603 >> >> 161048587 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr6 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 155768884 >> >> 155768868 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr5 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 32958812 >> >> 32958828 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr3 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 212451090 >> >> 212451074 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr2 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 2046449 >> >> 2046465 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr1 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 223233801 >> >> 223233785 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr1 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 277573037 >> >> 277573021 >> >> 0.21 >> >> 34.2 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> As expected the same read maps to multiple places on the same/different >> chromosome. >> I have a GFF file with annotated coordinates. >> I would like to run a PERL script to find out READS that are within the >> GENES in the GFF file and that are not. >> The anticipated script should; >> >> 1. Take the READ coordinates on the genome (by chromosome); >> >> 2. Go the GFF file; >> >> 3. Find the Chromosome; >> >> 4. Find the GENE (by coordinates); >> >> 5. and report READ-its coordinates-Chromosome-GENE-and its >> coordinates. >> >> It doesn't need to be in the same order. >> After this, I guess I could use simple Microsoft ACCESS query to pull out >> READS that are not mapped to the GENEs. >> I would greatly appreciate if anyone can has a script that more or less >> similar job. >> >> Thanks >> Jay >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > > > -- > ------------------------------------------------------------------------ > Scott Cain, Ph. D. scott at scottcain dot > net > GMOD Coordinator (http://gmod.org/) 216-392-3087 > Ontario Institute for Cancer Research > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From jovel_juan at hotmail.com Thu Dec 1 12:36:32 2011 From: jovel_juan at hotmail.com (Juan Jovel) Date: Thu, 1 Dec 2011 17:36:32 +0000 Subject: [Bioperl-l] Error when using SearchIO In-Reply-To: References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>, Message-ID: Hello Everybody! I have been running endless standalone Blasts these days, in two different clusters. The first one runs on Rock Linux, while the other one on Devian. In the Rock Linux one, when I use SeachIO to parse my blastp or blastx reports, I get invariably a message: "Replacement list is longer than search list at /share/apps/lib/perl5/site_perl/5.14.1/Bio/Range.pm line 251" What it does mean? Would it have any effect on my parsing results? Thanks, JUAN From cjfields at illinois.edu Thu Dec 1 14:03:45 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 1 Dec 2011 19:03:45 +0000 Subject: [Bioperl-l] Error when using SearchIO In-Reply-To: References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>, Message-ID: <7233AC75-E03B-401A-8D0A-260ED76956A4@illinois.edu> On Dec 1, 2011, at 11:36 AM, Juan Jovel wrote: > Hello Everybody! > I have been running endless standalone Blasts these days, in two different clusters. The first one runs on Rock Linux, while the other one on Devian. In the Rock Linux one, when I use SeachIO to parse my blastp or blastx reports, I get invariably a message: > "Replacement list is longer than search list at /share/apps/lib/perl5/site_perl/5.14.1/Bio/Range.pm line 251" > What it does mean? Would it have any effect on my parsing results? > Thanks, > JUAN This is a bug that was fixed, I think this was in the latest BioPerl release (1.6.901). There was a transliteration error that produced this warning, but otherwise it's harmless. The warning is perl version dependent, I think it pops up in perl 5.12 and up. This user post (about halfway down) runs into the same issue: http://www.sysarchitects.com/bioperl. chris From David.Messina at sbc.su.se Thu Dec 1 17:02:20 2011 From: David.Messina at sbc.su.se (Dave Messina) Date: Thu, 1 Dec 2011 23:02:20 +0100 Subject: [Bioperl-l] re trieving blast multiple alignment in fasta form In-Reply-To: <32886592.post@talk.nabble.com> References: <32886592.post@talk.nabble.com> Message-ID: Hi Eric, Wait, do you want multiple pairwise alignments in your output FASTA file, or a single multiple alignment of your query and all the hits? If the former, get_aln() will give you one pairwise alignment per hsp, but you'll need to move the output file creation statement (my $alnIO = ...) before the loops so it gets created only once. Then, when you do the write statement ($alnIO->write_aln($aln);), all of the alignments will go to the same file. If on the other hand you'd like to have a multiple alignment between a query and all of its hits, you'll have to take the IDs of the hits, pull the corresponding sequences out of the database, and then run a multiple alignment algorithm on them. Dave From scuoppo at gmail.com Fri Dec 2 17:50:28 2011 From: scuoppo at gmail.com (Claudio Scuoppo) Date: Fri, 2 Dec 2011 17:50:28 -0500 Subject: [Bioperl-l] List of genes from genomic intervals Message-ID: Hi, I am new to BioPerl. I was wondering what`s the best strategy to get the genes contained in a a series of human genomic interval. Basically, I have a table with: Chromosome Start End Which module should I be looking at? Thanks, Claudio From awitney at sgul.ac.uk Mon Dec 5 06:09:39 2011 From: awitney at sgul.ac.uk (Adam Witney) Date: Mon, 5 Dec 2011 11:09:39 +0000 Subject: [Bioperl-l] Bio::Graphics imagemap and padding Message-ID: <44A27378-CC4B-4396-817E-AA31004847C7@sgul.ac.uk> Hi, Image maps seem to be out of position if you use padding in the Panel, like this: my $panel = Bio::Graphics::Panel->new( ?.. -pad_left => 20, -pad_right => 20 ?? ); Without these options, the image map is fine. Is this a known issue? Also on a side note, I noticed that when using Bio::Graphics with Dancer, some of the CGI code was blocking somewhere (I found a reference to a similar problem with CGI and Catalyst), swapping CGI with HTML::Entities fixes it: sub create_web_map { ?. eval "require HTML::Entities" unless HTML::Entities->can('encode_entities'); ?. my $title = HTML::Entities::encode_entities($self->make_link($tr,$feature,1)); my $target = HTML::Entities::encode_entities($self->make_link($tgr,$feature,1)); ?.. } Thanks Adam From momin.amin at gmail.com Mon Dec 5 18:00:23 2011 From: momin.amin at gmail.com (Amin Momin) Date: Mon, 5 Dec 2011 15:00:23 -0800 (PST) Subject: [Bioperl-l] SimpleAlign and consensus_string Message-ID: Hi , I am generating a consensus sequence by aligning two protein homologs using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to understand the criteria consensus_string() method of simpleAlign uses to determine the consensus at position with dissimilar aminoacids/ nucleotide. Also how would the % cutoffs provided to consensus_string() affect the outcome. Thanks, Amin From jason.stajich at gmail.com Mon Dec 5 18:58:59 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Mon, 5 Dec 2011 15:58:59 -0800 Subject: [Bioperl-l] SimpleAlign and consensus_string In-Reply-To: References: Message-ID: <4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com> There are several methods that do related things. Below is the documentation for the method - it allows a more strict calling besides majority rule voting for the position - if you set the cutoff at 50% then the called AA or NT has to appear at least 50% of the time. There is also an IUPAC consensus string which is useful for DNA consensus generating patterns. If you want a summary as to whether or not there are only conservative amino acid changes in the column use match_line method which generates the similar string that CLUSTALW reports as a summary string for each column. =head2 consensus_string Title : consensus_string Usage : $str = $ali->consensus_string($threshold_percent) Function : Makes a strict consensus Returns : Consensus string Argument : Optional treshold ranging from 0 to 100. The consensus residue has to appear at least threshold % of the sequences at a given location, otherwise a '?' character will be placed at that location. (Default value = 0%) =cut On Dec 5, 2011, at 3:00 PM, Amin Momin wrote: > Hi , > > I am generating a consensus sequence by aligning two protein homologs > using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to > understand the criteria consensus_string() method of simpleAlign uses > to determine the consensus at position with dissimilar aminoacids/ > nucleotide. Also how would the % cutoffs provided to > consensus_string() affect the outcome. > > > Thanks, > Amin > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From wenbinmei at gmail.com Tue Dec 6 11:09:35 2011 From: wenbinmei at gmail.com (wenbin mei) Date: Tue, 6 Dec 2011 11:09:35 -0500 Subject: [Bioperl-l] revcom the multiple sequence alignment Message-ID: Hi, I have a question about revcom the multiple sequence alignment. One way I can do convert the format into fasta and revcom individual sequences. I wonder is there a easy way to convert the multiple sequence alignment as a whole. Thank you for help. -best, wenbin From jason.stajich at gmail.com Tue Dec 6 12:40:37 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Tue, 6 Dec 2011 09:40:37 -0800 Subject: [Bioperl-l] revcom the multiple sequence alignment In-Reply-To: References: Message-ID: I think this would work to update it in place though I haven't tried it myself for my $seq ( $aln->each_seq ) { $seq->seq( $seq->revcom->seq ); } $out->write_aln($aln); This may also work - not entirely sure if there is any extra work done on the meta data (start/end) of the Seq object when this is done. You may want to flip start/end for the sequences (the seqs are Bio::LocatableSeq objects) explicitly if not. Or you may not care about those data and can ignore. $seq = $seq->revcom Jason On Dec 6, 2011, at 8:09 AM, wenbin mei wrote: > Hi, > > I have a question about revcom the multiple sequence alignment. One way I > can do convert the format into fasta and revcom individual sequences. I > wonder is there a easy way to convert the multiple sequence alignment as a > whole. Thank you for help. > > -best, > wenbin > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From wenbinmei at gmail.com Tue Dec 6 12:51:18 2011 From: wenbinmei at gmail.com (wenbin mei) Date: Tue, 6 Dec 2011 12:51:18 -0500 Subject: [Bioperl-l] revcom the multiple sequence alignment In-Reply-To: References: Message-ID: I think I might not explain clearly my questions. I extract the individual gene alignment from the whole genome alignment. Since some gene are on the reverse strand, I want to revcom the gene alignment. There is part of my scripts. I can read the strand information from another file. my $newstart = $refseq->column_from_residue_number($start); my $newend = $refseq->column_from_residue_number($end); $seq{$genename} = $aln->slice($newstart, $newend); Any suggestion to help me revcom some gene alignment on the minus strand is helpful. Thank you. -best, wenbin On Tue, Dec 6, 2011 at 12:40 PM, Jason Stajich wrote: > I think this would work to update it in place though I haven't tried it > myself > > for my $seq ( $aln->each_seq ) { > $seq->seq( $seq->revcom->seq ); > } > $out->write_aln($aln); > > This may also work - not entirely sure if there is any extra work done on > the meta data (start/end) of the Seq object when this is done. You may > want to flip start/end for the sequences (the seqs are Bio::LocatableSeq > objects) explicitly if not. Or you may not care about those data and can > ignore. > > $seq = $seq->revcom > > Jason > On Dec 6, 2011, at 8:09 AM, wenbin mei wrote: > > > Hi, > > > > I have a question about revcom the multiple sequence alignment. One way I > > can do convert the format into fasta and revcom individual sequences. I > > wonder is there a easy way to convert the multiple sequence alignment as > a > > whole. Thank you for help. > > > > -best, > > wenbin > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Wenbin Mei Ph.D. Student Dr. Brad Barbazuk's Lab Department of Biology University of Florida 509-899-3067 wmei at ufl.edu From kellert at ohsu.edu Tue Dec 6 13:21:39 2011 From: kellert at ohsu.edu (Tom Keller) Date: Tue, 6 Dec 2011 10:21:39 -0800 Subject: [Bioperl-l] Bioperl-l Digest, Vol 104, Issue 3 In-Reply-To: References: Message-ID: I'd start by looking for the section "Searching for genes in genomic DNA" in the HOWTO:Beginners - BioPerl website. Thomas (Tom) Keller, PhD kellert at ohsu.edu 503.494.2442 6588 R Jones Hall (BSc/CROET) MMI DNA Services Member of OHSU Shared Resources On Dec 3, 2011, at 9:00 AM, wrote: > Send Bioperl-l mailing list submissions to > bioperl-l at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/bioperl-l > or, via email, send a message with subject or body 'help' to > bioperl-l-request at lists.open-bio.org > > You can reach the person managing the list at > bioperl-l-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Bioperl-l digest..." > > > Today's Topics: > > 1. List of genes from genomic intervals (Claudio Scuoppo) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 2 Dec 2011 17:50:28 -0500 > From: Claudio Scuoppo > Subject: [Bioperl-l] List of genes from genomic intervals > To: bioperl-l at lists.open-bio.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > Hi, > > I am new to BioPerl. I was wondering what`s the best strategy to get > the genes contained in a a series of human genomic interval. > Basically, I have a table with: > > Chromosome Start End > > Which module should I be looking at? > Thanks, > Claudio > > > ------------------------------ > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > End of Bioperl-l Digest, Vol 104, Issue 3 > ***************************************** From wenbinmei at gmail.com Tue Dec 6 17:54:51 2011 From: wenbinmei at gmail.com (wenbin mei) Date: Tue, 6 Dec 2011 17:54:51 -0500 Subject: [Bioperl-l] revcom the multiple sequence alignment In-Reply-To: References: Message-ID: Figured out! Thanks for help. -best, wenbin On Tue, Dec 6, 2011 at 12:40 PM, Jason Stajich wrote: > I think this would work to update it in place though I haven't tried it > myself > > for my $seq ( $aln->each_seq ) { > $seq->seq( $seq->revcom->seq ); > } > $out->write_aln($aln); > > This may also work - not entirely sure if there is any extra work done on > the meta data (start/end) of the Seq object when this is done. You may > want to flip start/end for the sequences (the seqs are Bio::LocatableSeq > objects) explicitly if not. Or you may not care about those data and can > ignore. > > $seq = $seq->revcom > > Jason > On Dec 6, 2011, at 8:09 AM, wenbin mei wrote: > > > Hi, > > > > I have a question about revcom the multiple sequence alignment. One way I > > can do convert the format into fasta and revcom individual sequences. I > > wonder is there a easy way to convert the multiple sequence alignment as > a > > whole. Thank you for help. > > > > -best, > > wenbin > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Wenbin Mei Ph.D. Student Dr. Brad Barbazuk's Lab Department of Biology University of Florida 509-899-3067 wmei at ufl.edu From momin.amin at gmail.com Tue Dec 6 12:37:16 2011 From: momin.amin at gmail.com (Amin Momin) Date: Tue, 6 Dec 2011 11:37:16 -0600 Subject: [Bioperl-l] SimpleAlign and consensus_string In-Reply-To: <4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com> References: <4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com> Message-ID: Thanks Jason On Mon, Dec 5, 2011 at 5:58 PM, Jason Stajich wrote: > There are several methods that do related things. > > Below is the documentation for the method - it allows a more strict calling besides majority rule voting for the position - if you set the cutoff at 50% then the called AA or NT has to appear at least 50% of the time. There is also an IUPAC consensus string which is useful for DNA consensus generating patterns. > > If you want a summary as to whether or not there are only conservative amino acid changes in the column use match_line method which generates the similar string that CLUSTALW reports as a summary string for each column. > > =head2 consensus_string > > ?Title ? ? : consensus_string > ?Usage ? ? : $str = $ali->consensus_string($threshold_percent) > ?Function ?: Makes a strict consensus > ?Returns ? : Consensus string > ?Argument ?: Optional treshold ranging from 0 to 100. > ? ? ? ? ? ? The consensus residue has to appear at least threshold % > ? ? ? ? ? ? of the sequences at a given location, otherwise a '?' > ? ? ? ? ? ? character will be placed at that location. > ? ? ? ? ? ? (Default value = 0%) > > =cut > > On Dec 5, 2011, at 3:00 PM, Amin Momin wrote: > >> Hi , >> >> I am generating a consensus sequence by aligning two protein homologs >> using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to >> understand the criteria consensus_string() method of simpleAlign uses >> to determine the consensus at position with dissimilar aminoacids/ >> nucleotide. Also how would the % cutoffs provided to >> consensus_string() affect the outcome. >> >> >> Thanks, >> Amin >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > From sunwukong at potc.net Wed Dec 7 14:05:20 2011 From: sunwukong at potc.net (sunwukong) Date: Wed, 07 Dec 2011 11:05:20 -0800 Subject: [Bioperl-l] DNA Sequencing two questions Message-ID: <4EDFB8F0.8080001@potc.net> I am not a medical professional but I have two DNA related questions. A year or so ago I realized that if the standard building blocks of life were the amino acids GATC then they could be represented as a base 4 number system (e.g., 0,1,2 and 3). Then any life form could be represented by a number (it would be very long). So I set out on a quest to do this with a small life form. For fun I chose the Spanish Flu which I believe I found on an NIH site. Then I set out and realized that there was no standard. And I did not know if the number would be built with the most significant digit on the left or right. 1. Is there a standard method for representing the ATCD molecules as numbers g = 0 a = 1 t = 2 c = 3 2. is the sequence read left to right or right to left? note: It may be biologically significant if the right values are assigned to the letters GATC, there could be a pattern somewhere that holds significant information. One idea might be to look at DNA sequences in bases other than 4 to see if something jumps out. http://www.insectscience.org/2.10/ref/fig5a.gif VR Pat Kirol 509 442-2214 From Russell.Smithies at agresearch.co.nz Wed Dec 7 16:59:18 2011 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Thu, 8 Dec 2011 10:59:18 +1300 Subject: [Bioperl-l] DNA Sequencing two questions In-Reply-To: <4EDFB8F0.8080001@potc.net> References: <4EDFB8F0.8080001@potc.net> Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz> I did something similar a few years ago (after watching the movie "Contact" I think) and encoded codons as RGB values and drew an image of a genome. Looked much like random noise but I might try it again and draw as a space filling curve. I guess if you're looking for "hidden messages", why restrict yourself to 2 dimensions? Perhaps something pops out as a single-image stereogram eg. http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Random_Dot_Shark.png Perhaps it's a 3D "object" represented by slices drawn in a series of 2D planes? But you need a bit of biological background as there will be patterns simply because of the way genes "work" and are laid out in chromosomes. You need to remember that DNA is effectively a 2D representation of a 3D protein structure and there is already much hidden information we know we don't understand - a "simple" task like how proteins fold is barely understood and why some become prions is still a mystery. But don't let this stop you uncovering the great secret hidden in our genes :-) On a similar note, have a look at http://medgadget.com/2011/10/send-your-secret-message-hidden-in-bacteria.html --Russell > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of sunwukong > Sent: Thursday, 8 December 2011 8:05 a.m. > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] DNA Sequencing two questions > > I am not a medical professional but I have two DNA related questions. > > A year or so ago I realized that if the standard building blocks of life were the > amino acids GATC then they could be represented as a base 4 number > system (e.g., 0,1,2 and 3). Then any life form could be represented by a > number (it would be very long). So I set out on a quest to do this with a small > life form. For fun I chose the Spanish Flu which I believe I found on an NIH > site. Then I set out and realized that there was no standard. And I did not > know if the number would be built with the most significant digit on the left > or right. > > 1. Is there a standard method for representing the ATCD molecules as > numbers g = 0 a = 1 t = 2 c = 3 > > 2. is the sequence read left to right or right to left? > > note: It may be biologically significant if the right values are assigned to the > letters GATC, there could be a pattern somewhere that holds significant > information. One idea might be to look at DNA sequences in bases other > than 4 to see if something jumps out. > > http://www.insectscience.org/2.10/ref/fig5a.gif > > VR > Pat Kirol > 509 442-2214 > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From jason.stajich at gmail.com Wed Dec 7 17:53:10 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 7 Dec 2011 14:53:10 -0800 Subject: [Bioperl-l] DNA Sequencing two questions In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz> References: <4EDFB8F0.8080001@potc.net> <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz> Message-ID: <9BDA2BFA-264F-45B2-9234-A6BC9402FBA8@gmail.com> For other fun picture games -- You can look at patterns of motifs/words in a chaos game representation of genomes. http://mbe.oxfordjournals.org/content/16/10/1391.long http://mbe.oxfordjournals.org/content/20/6/901.long On Dec 7, 2011, at 1:59 PM, Smithies, Russell wrote: > I did something similar a few years ago (after watching the movie "Contact" I think) and encoded codons as RGB values and drew an image of a genome. Looked much like random noise but I might try it again and draw as a space filling curve. > I guess if you're looking for "hidden messages", why restrict yourself to 2 dimensions? Perhaps something pops out as a single-image stereogram eg. http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Random_Dot_Shark.png > Perhaps it's a 3D "object" represented by slices drawn in a series of 2D planes? > > But you need a bit of biological background as there will be patterns simply because of the way genes "work" and are laid out in chromosomes. You need to remember that DNA is effectively a 2D representation of a 3D protein structure and there is already much hidden information we know we don't understand - a "simple" task like how proteins fold is barely understood and why some become prions is still a mystery. > > But don't let this stop you uncovering the great secret hidden in our genes :-) > > On a similar note, have a look at http://medgadget.com/2011/10/send-your-secret-message-hidden-in-bacteria.html > > --Russell > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of sunwukong >> Sent: Thursday, 8 December 2011 8:05 a.m. >> To: bioperl-l at bioperl.org >> Subject: [Bioperl-l] DNA Sequencing two questions >> >> I am not a medical professional but I have two DNA related questions. >> >> A year or so ago I realized that if the standard building blocks of life were the >> amino acids GATC then they could be represented as a base 4 number >> system (e.g., 0,1,2 and 3). Then any life form could be represented by a >> number (it would be very long). So I set out on a quest to do this with a small >> life form. For fun I chose the Spanish Flu which I believe I found on an NIH >> site. Then I set out and realized that there was no standard. And I did not >> know if the number would be built with the most significant digit on the left >> or right. >> >> 1. Is there a standard method for representing the ATCD molecules as >> numbers g = 0 a = 1 t = 2 c = 3 >> >> 2. is the sequence read left to right or right to left? >> >> note: It may be biologically significant if the right values are assigned to the >> letters GATC, there could be a pattern somewhere that holds significant >> information. One idea might be to look at DNA sequences in bases other >> than 4 to see if something jumps out. >> >> http://www.insectscience.org/2.10/ref/fig5a.gif >> >> VR >> Pat Kirol >> 509 442-2214 >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From Russell.Smithies at agresearch.co.nz Wed Dec 7 19:29:47 2011 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Thu, 8 Dec 2011 13:29:47 +1300 Subject: [Bioperl-l] DNA Sequencing two questions In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz> References: <4EDFB8F0.8080001@potc.net> <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz> Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF245@exchsth.agresearch.co.nz> I tried again and came up with this: http://www.bioperl.org/w/images/7/7a/Autostereogram.png If you look carefully, you can see the answer to life, the universe, and everything!! --Russell > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Smithies, Russell > Sent: Thursday, 8 December 2011 10:59 a.m. > To: 'sunwukong'; bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] DNA Sequencing two questions > > I did something similar a few years ago (after watching the movie "Contact" I > think) and encoded codons as RGB values and drew an image of a genome. > Looked much like random noise but I might try it again and draw as a space > filling curve. > I guess if you're looking for "hidden messages", why restrict yourself to 2 > dimensions? Perhaps something pops out as a single-image stereogram eg. > http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Ra > ndom_Dot_Shark.png > Perhaps it's a 3D "object" represented by slices drawn in a series of 2D > planes? > > But you need a bit of biological background as there will be patterns simply > because of the way genes "work" and are laid out in chromosomes. You > need to remember that DNA is effectively a 2D representation of a 3D > protein structure and there is already much hidden information we know we > don't understand - a "simple" task like how proteins fold is barely understood > and why some become prions is still a mystery. > > But don't let this stop you uncovering the great secret hidden in our genes :-) > > On a similar note, have a look at http://medgadget.com/2011/10/send-your- > secret-message-hidden-in-bacteria.html > > --Russell > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of sunwukong > > Sent: Thursday, 8 December 2011 8:05 a.m. > > To: bioperl-l at bioperl.org > > Subject: [Bioperl-l] DNA Sequencing two questions > > > > I am not a medical professional but I have two DNA related questions. > > > > A year or so ago I realized that if the standard building blocks of > > life were the amino acids GATC then they could be represented as a > > base 4 number system (e.g., 0,1,2 and 3). Then any life form could be > > represented by a number (it would be very long). So I set out on a > > quest to do this with a small life form. For fun I chose the Spanish > > Flu which I believe I found on an NIH site. Then I set out and > > realized that there was no standard. And I did not know if the number > > would be built with the most significant digit on the left or right. > > > > 1. Is there a standard method for representing the ATCD molecules as > > numbers g = 0 a = 1 t = 2 c = 3 > > > > 2. is the sequence read left to right or right to left? > > > > note: It may be biologically significant if the right values are > > assigned to the letters GATC, there could be a pattern somewhere that > > holds significant information. One idea might be to look at DNA > > sequences in bases other than 4 to see if something jumps out. > > > > http://www.insectscience.org/2.10/ref/fig5a.gif > > > > VR > > Pat Kirol > > 509 442-2214 > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > ========================================================== > ============= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities to which > it is addressed and may contain confidential and/or privileged material. Any > review, retransmission, dissemination or other use of, or taking of any action > in reliance upon, this information by persons or entities other than the > intended recipients is prohibited by AgResearch Limited. If you have received > this message in error, please notify the sender immediately. > ========================================================== > ============= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From lumos.lumos.lumos at gmail.com Fri Dec 9 11:47:36 2011 From: lumos.lumos.lumos at gmail.com (lumos lumos) Date: Fri, 9 Dec 2011 08:47:36 -0800 Subject: [Bioperl-l] Mouse->Human homologues ? Message-ID: Hello, Is there a way to get human homologues for a mouse gene list where I get all human genes(symbols) as text output ? Thank you LM From cjfields at illinois.edu Fri Dec 9 12:17:20 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 9 Dec 2011 17:17:20 +0000 Subject: [Bioperl-l] Mouse->Human homologues ? In-Reply-To: References: Message-ID: There are lots of databases that have this capability (ensembl, orthodb, homologene, oma, to name only a few). Have you tried a simple search for this, or did you want expert opinion on the matter? chris PS - Just to note, there is a lot of controversy swirling about re: the ortholog conjecture and some recently published papers calling it into question using human-mouse data, worth a look if you're trotting this path to know the current situation. If you have access to F1000, see the following (paper itself is open :) Faculty of 1000 evaluations, dissents and comments for [Nehrt NL et al. Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Comput Biol. 2011 Jun; 7(6):e1002073; doi: 10.1371/journal.pcbi.1002073]. Faculty of 1000, 31 Aug 2011. F1000.com/12462957 On Dec 9, 2011, at 10:47 AM, lumos lumos wrote: > Hello, > > Is there a way to get human homologues for a mouse gene list where I get > all human genes(symbols) as text output ? > > Thank you > LM > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From lumos.lumos.lumos at gmail.com Fri Dec 9 12:29:24 2011 From: lumos.lumos.lumos at gmail.com (lumos lumos) Date: Fri, 9 Dec 2011 09:29:24 -0800 Subject: [Bioperl-l] Mouse->Human homologues ? In-Reply-To: References: Message-ID: Hi Chris, Thanks for your reply. I wanted to know if there is anyway you can do it via script/automatically in perl for a list of mouse genes whose human homologues I require. LM On Fri, Dec 9, 2011 at 9:17 AM, Fields, Christopher J wrote: > There are lots of databases that have this capability (ensembl, orthodb, > homologene, oma, to name only a few). Have you tried a simple search for > this, or did you want expert opinion on the matter? > > chris > > PS - Just to note, there is a lot of controversy swirling about re: the > ortholog conjecture and some recently published papers calling it into > question using human-mouse data, worth a look if you're trotting this path > to know the current situation. If you have access to F1000, see the > following (paper itself is open :) > > Faculty of 1000 evaluations, dissents and comments for [Nehrt NL et al. > Testing the ortholog conjecture with comparative functional genomic data > from mammals. PLoS Comput Biol. 2011 Jun; 7(6):e1002073; doi: > 10.1371/journal.pcbi.1002073]. Faculty of 1000, 31 Aug 2011. > F1000.com/12462957 > > On Dec 9, 2011, at 10:47 AM, lumos lumos wrote: > > > Hello, > > > > Is there a way to get human homologues for a mouse gene list where I get > > all human genes(symbols) as text output ? > > > > Thank you > > LM > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From lumos.lumos.lumos at gmail.com Wed Dec 7 23:47:19 2011 From: lumos.lumos.lumos at gmail.com (lumos lumos) Date: Wed, 7 Dec 2011 20:47:19 -0800 Subject: [Bioperl-l] Perl parsing Message-ID: Hello, I have a text file(tab-delim) with some gene names as shown below. *BRCA1: breast cancer 1, early onset TNF: tumor necrosis factor OMG: oligodendrocyte myelin glycoprotein* I would like to get the list of gene name BRCA1,TNF,OMG that is before the colon(:) . How do I parse in perl this text file with this list of genes? Thanks in advance. LM From b.m.forde at umail.ucc.ie Fri Dec 9 11:52:56 2011 From: b.m.forde at umail.ucc.ie (BForde) Date: Fri, 9 Dec 2011 08:52:56 -0800 (PST) Subject: [Bioperl-l] Genbank files Message-ID: <32941955.post@talk.nabble.com> Hello all, I am new to Bioperl so I apologise if this is stupid question. For CDS features I which to add additional qualifiers e.g. /colour and /note qualifiers. I have looked at the BioPerl wiki but am still unsure as how to do this? regards Brian -- View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From jboddu at illinois.edu Fri Dec 9 14:59:39 2011 From: jboddu at illinois.edu (Boddu, Jayanand) Date: Fri, 9 Dec 2011 19:59:39 +0000 Subject: [Bioperl-l] Batch processing of Data Message-ID: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu> Hi Anyone: Please let me know if the following is practical with PERL. My data output can be described as following. 1. Hundreds of samples are run. 2. A batch output sends data from each sample to its own "folder". Output is in the form of few text files, spreadsheets and PDF files. 3. One of the spreadsheet has the data of most interest. 4. This means I end up having hundreds of folders. 5. The spreadsheet with the data has multiple worksheets out of which a couple have the interesting data to be processed (Please find attached a spreadsheet output in which the data is organized and the worksheets of my interest are named as "Compound" and "Peak". Yellow high-lighted columns in each worksheet has the data to be processed). OK. That's long description. NOW. Is it practical to write a PERL/or any script to; 1. Enter each folder. 2. Look for the spreadsheet of interest. 3. Look for worksheets named "Compound" and "Peak". 4. Look for the specific columns of interest. 5. Copy paste the columns of interest into a new spreadsheet/text file with data from each folder next to each other. This final spreadsheet will pass through a bunch of other calculations. I apologize for this long and painful description. However, it would be great if this can be done. Thanks Jay -------------- next part -------------- A non-text attachment was scrubbed... Name: REPORT01.xls Type: application/vnd.ms-excel Size: 93696 bytes Desc: REPORT01.xls URL: From cjfields at illinois.edu Fri Dec 9 15:37:48 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 9 Dec 2011 20:37:48 +0000 Subject: [Bioperl-l] Perl parsing In-Reply-To: References: Message-ID: On Dec 7, 2011, at 10:47 PM, lumos lumos wrote: > Hello, > > I have a text file(tab-delim) with some gene names as shown below. > > *BRCA1: breast cancer 1, early onset > > TNF: tumor necrosis factor > > OMG: oligodendrocyte myelin glycoprotein* > > I would like to get the list of gene name BRCA1,TNF,OMG that is before the > colon(:) . > How do I parse in perl this text file with this list of genes? 'Very carefully?' Okay, I'll try to refrain from further sarcasm, but I'm confused, what does this have to do with BioPerl (*the toolkit*) specifically? That is what this mailing list is for. Just to note, this is a very common perl task. The answer is attainable by searching for it (not to mention taking the time to learn basic perl). For instance: http://stackoverflow.com/questions/4500407/in-perl-how-can-i-correctly-parse-tab-space-delimited-files-with-quoted-strings One of the many links found by simply using Google: http://lmgtfy.com/?q=perl+parse+tab+file I'll leave the regex munging to you. (okay, I failed at refraining from sarcasm, ah well it's friday). chris > Thanks in advance. > LM From jason.stajich at gmail.com Fri Dec 9 16:18:38 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Fri, 9 Dec 2011 13:18:38 -0800 Subject: [Bioperl-l] Genbank files In-Reply-To: <32941955.post@talk.nabble.com> References: <32941955.post@talk.nabble.com> Message-ID: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> $feature->add_tag_value('color','blue'); On Dec 9, 2011, at 8:52 AM, BForde wrote: > > Hello all, > > I am new to Bioperl so I apologise if this is stupid question. > > For CDS features I which to add additional qualifiers e.g. /colour and /note > qualifiers. I have looked at the BioPerl wiki but am still unsure as how to > do this? > > regards > > Brian > -- > View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason.stajich at gmail.com jason at bioperl.org From bosborne11 at verizon.net Fri Dec 9 15:31:15 2011 From: bosborne11 at verizon.net (Brian Osborne) Date: Fri, 09 Dec 2011 15:31:15 -0500 Subject: [Bioperl-l] Genbank files In-Reply-To: <32941955.post@talk.nabble.com> References: <32941955.post@talk.nabble.com> Message-ID: <3AB0716B-EC07-4BD6-9FC8-0C47A29FC0BA@verizon.net> Brian, Reasonable question. Start here: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation If you've never used Bioperl then: http://www.bioperl.org/wiki/HOWTO:Beginners Brian On Dec 9, 2011, at 11:52 AM, BForde wrote: > > Hello all, > > I am new to Bioperl so I apologise if this is stupid question. > > For CDS features I which to add additional qualifiers e.g. /colour and /note > qualifiers. I have looked at the BioPerl wiki but am still unsure as how to > do this? > > regards > > Brian > -- > View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From asjo at koldfront.dk Fri Dec 9 17:25:00 2011 From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=) Date: Fri, 09 Dec 2011 23:25:00 +0100 Subject: [Bioperl-l] Batch processing of Data References: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu> Message-ID: <871usdpemb.fsf@topper.koldfront.dk> On Fri, 9 Dec 2011 19:59:39 +0000, Boddu, wrote: > Please let me know if the following is practical with PERL. It might very well be, yes. Modules you might be interested in include Spreadsheet::ParseExcel, Spreadsheet::XLSX, Spreadsheet::WriteExcel and Excel::Writer::XLSX?. A big help in finding interesting CPAN modules is the search engine on https://metacpan.org/ Depending on your platform and preference using find(1) might also be helpful to traverse the folders, rather than doing so in Perl. Note that none of this has anything to do with BioPerl as such, though, and you'll need to do some actual programming to get the job done. Best regards, Adam ? http://blogs.perl.org/users/john_mcnamara/2011/10/spreadsheetwriteexcel-is-dead-long-live-excelwriterxlsx.html -- "Angels can fly because they take themselves lightly." Adam Sj?gren asjo at koldfront.dk From David.Messina at sbc.su.se Fri Dec 9 17:30:23 2011 From: David.Messina at sbc.su.se (Dave Messina) Date: Fri, 9 Dec 2011 23:30:23 +0100 Subject: [Bioperl-l] Batch processing of Data In-Reply-To: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu> References: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu> Message-ID: Yes, it can be done. However, it has nothing to do with this mailing list. Steps 1 and 2 are basic Perl. For steps 3 through 5, try googling "perl parse excel". Dave On Fri, Dec 9, 2011 at 20:59, Boddu, Jayanand wrote: > Hi Anyone: > Please let me know if the following is practical with PERL. > My data output can be described as following. > > 1. Hundreds of samples are run. > > 2. A batch output sends data from each sample to its own "folder". > Output is in the form of few text files, spreadsheets and PDF files. > > 3. One of the spreadsheet has the data of most interest. > > 4. This means I end up having hundreds of folders. > > 5. The spreadsheet with the data has multiple worksheets out of > which a couple have the interesting data to be processed (Please find > attached a spreadsheet output in which the data is organized and the > worksheets of my interest are named as "Compound" and "Peak". Yellow > high-lighted columns in each worksheet has the data to be processed). > OK. That's long description. > NOW. Is it practical to write a PERL/or any script to; > > 1. Enter each folder. > > 2. Look for the spreadsheet of interest. > > 3. Look for worksheets named "Compound" and "Peak". > > 4. Look for the specific columns of interest. > > 5. Copy paste the columns of interest into a new spreadsheet/text > file with data from each folder next to each other. > > This final spreadsheet will pass through a bunch of other calculations. > > I apologize for this long and painful description. > However, it would be great if this can be done. > Thanks > Jay > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From lsbrath at gmail.com Sat Dec 10 16:39:44 2011 From: lsbrath at gmail.com (Mgavi Brathwaite) Date: Sat, 10 Dec 2011 16:39:44 -0500 Subject: [Bioperl-l] Perl parsing In-Reply-To: References: Message-ID: Yes grasshopper you have to suffer a little bit. Learn Perl first, then step up to BioPerl. Chris I feel you concerning the power of Regex, and the sarcasm. Lom On Fri, Dec 9, 2011 at 3:37 PM, Fields, Christopher J wrote: > On Dec 7, 2011, at 10:47 PM, lumos lumos wrote: > > > Hello, > > > > I have a text file(tab-delim) with some gene names as shown below. > > > > *BRCA1: breast cancer 1, early onset > > > > TNF: tumor necrosis factor > > > > OMG: oligodendrocyte myelin glycoprotein* > > > > I would like to get the list of gene name BRCA1,TNF,OMG that is before > the > > colon(:) . > > How do I parse in perl this text file with this list of genes? > > 'Very carefully?' > > Okay, I'll try to refrain from further sarcasm, but I'm confused, what > does this have to do with BioPerl (*the toolkit*) specifically? That is > what this mailing list is for. > > Just to note, this is a very common perl task. The answer is attainable by > searching for it (not to mention taking the time to learn basic perl). For > instance: > > > http://stackoverflow.com/questions/4500407/in-perl-how-can-i-correctly-parse-tab-space-delimited-files-with-quoted-strings > > One of the many links found by simply using Google: > > http://lmgtfy.com/?q=perl+parse+tab+file > > I'll leave the regex munging to you. > > (okay, I failed at refraining from sarcasm, ah well it's friday). > > chris > > > > Thanks in advance. > > LM > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From pawan.mani2 at gmail.com Mon Dec 5 17:00:09 2011 From: pawan.mani2 at gmail.com (pawan.mani2 at gmail.com) Date: Tue, 6 Dec 2011 03:30:09 +0530 Subject: [Bioperl-l] bioperl in cygwin Message-ID: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC> Hi I would like to after the givibg following commands in cgwin terminal: perl -MCPAN -e shell then I type o conf prerequisites_policy follow o conf commit install Bundle::CPAN install Module::Build d /bioperl/ then we you get a list of different versions. I selected CJFIELDS/BioPerl-1.6.1.96 install CJFIELDS/BioPerl-1.6.1.96.tar.gz but build.install was not ok. Kindly allow me to know the step of complete bioprl installatin cygwin undre windows 7. thanks in advanced. with best regards, Pawan From cjfields at illinois.edu Sun Dec 11 13:22:01 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sun, 11 Dec 2011 18:22:01 +0000 Subject: [Bioperl-l] bioperl in cygwin In-Reply-To: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC> References: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC> Message-ID: Pawan, Hard to say what the problem is w/o supplying warnings/errors. Prior to doing this, you should try installing BioPerl-1.6.901 (the latest CPAN release). You can try a direct installation of the distribution, but the easiest way to get the latest version is to try installing Bio::Perl. (I'm not sure what BioPerl-1.6.1.96 is, but this seems wrong) chris On Dec 5, 2011, at 4:00 PM, wrote: > Hi > I would like to after the givibg following commands in cgwin terminal: > > > perl -MCPAN -e shell > > then I type > > o conf prerequisites_policy follow > o conf commit > install Bundle::CPAN > install Module::Build > d /bioperl/ > then we you get a list of different versions. > I selected CJFIELDS/BioPerl-1.6.1.96 > install CJFIELDS/BioPerl-1.6.1.96.tar.gz > > > but build.install was not ok. > > Kindly allow me to know the step of complete bioprl installatin cygwin undre windows 7. > > thanks in advanced. > > with best regards, > Pawan > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From b.m.forde at umail.ucc.ie Tue Dec 13 06:03:50 2011 From: b.m.forde at umail.ucc.ie (BForde) Date: Tue, 13 Dec 2011 03:03:50 -0800 (PST) Subject: [Bioperl-l] Genbank files In-Reply-To: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> References: <32941955.post@talk.nabble.com> <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> Message-ID: <32965574.post@talk.nabble.com> Than you for the replies. My script (below) reads in a list of locus_tags from a tab delimited text file. Compares these locus_tags to the locus_tags in a genbank file and where they are equal adds new features. the line $feat->add_tag_value() needs to be defined. In the bioperl wiki this variable appears to be defined by giving it coordinates etc (creating a new feature). I wish to add features to CDS key when the locus_tags are identical. Is this possible? use strict; use Bio::SeqIO; my @V; open (LIST1, 'list') ||die; while (){ push @V, (split(/\t/, $_))[0]; } close(LIST1); my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); my $seq_object = $seqio_object->next_seq; for my $feat_object ($seq_object->get_SeqFeatures){ if ($feat_object->primary_tag eq "CDS"){ if ($feat_object->has_tag('locus_tag')){ for my $V3 ($feat_object->get_tag_values('locus_tag')){ for my $V1 (@V) { if ($V1 eq $V3){ ADD NEW FEATURES } } } } } } The script works down as far as the comparison point where locus_tags in the genbankfile "Contig100.gb" are compared against a list of locus_tags from a delimited txt file. regards Brian Jason Stajich-5 wrote: > > $feature->add_tag_value('color','blue'); > > On Dec 9, 2011, at 8:52 AM, BForde wrote: > >> >> Hello all, >> >> I am new to Bioperl so I apologise if this is stupid question. >> >> For CDS features I which to add additional qualifiers e.g. /colour and >> /note >> qualifiers. I have looked at the BioPerl wiki but am still unsure as how >> to >> do this? >> >> regards >> >> Brian >> -- >> View this message in context: >> http://old.nabble.com/Genbank-files-tp32941955p32941955.html >> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Jason Stajich > jason.stajich at gmail.com > jason at bioperl.org > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32965574.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From roy.chaudhuri at gmail.com Tue Dec 13 06:52:05 2011 From: roy.chaudhuri at gmail.com (Roy Chaudhuri) Date: Tue, 13 Dec 2011 11:52:05 +0000 Subject: [Bioperl-l] Genbank files In-Reply-To: <32965574.post@talk.nabble.com> References: <32941955.post@talk.nabble.com> <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> <32965574.post@talk.nabble.com> Message-ID: <4EE73C65.1080101@gmail.com> Hi Brian, Just to check I have understood you, you want to read through a genbank file and add additional tags to features which are listed in a tab-delimited file of locus tags? Your code is on the right lines, but it would be much more efficient to read your tab-delimited locus_tags into a hash, and check using exists, rather than ploughing through the (potentially very long) list of locus tags every time. Also, be careful with new lines in your tab file (you can safely get rid of them using "chomp"). You can miss out the "has_tag" check by using "get_tagset_values" instead of "get_tag_values", since the former does not complain if the tag is not present. Once you have modified your sequence object, you need to write it out to a new file (or STDOUT) using Bio::SeqIO. Also, just a couple of general points, you should always "use warnings" (or even better "use warnings FATAL=>qw(all)") since that can help solve many problems, and your code may be easier to read if you don't include the word "object" in all your variable names (after all you wouldn't say you write on a paper object using a pen object). use strict; use warnings FATAL=>qw(all); use Bio::SeqIO; open (my $list, 'list') or die $!; my %V; while (<$list>){ chomp; $V{(split(/\t/, $_))[0]}=1; } my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); my $seq_object = $seqio_object->next_seq; for my $feat_object ($seq_object->remove_SeqFeatures){ if ($feat_object->primary_tag eq "CDS"){ for my $V3 ($feat_object->get_tagset_values('locus_tag')){ if (exists $V{$V3}){ $feat_object->add_tag_value(listed_in_tab_file=>'yes'); next; } } } $seq_object->add_SeqFeature($feat_object); } Bio::SeqIO->new(-format=>'genbank')->write_seq($seq_object); Hope this helps. Cheers, Roy. On 13/12/2011 11:03, BForde wrote: > > Than you for the replies. > > My script (below) reads in a list of locus_tags from a tab delimited text > file. Compares these locus_tags to the locus_tags in a genbank file and > where they are equal adds new features. > the line > $feat->add_tag_value() > needs to be defined. In the bioperl wiki this variable appears to be defined > by giving it coordinates etc (creating a new feature). I wish to add > features to CDS key when the locus_tags are identical. Is this possible? > > use strict; > use Bio::SeqIO; > > my @V; > open (LIST1, 'list') ||die; > while (){ > push @V, (split(/\t/, $_))[0]; > } > close(LIST1); > > my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); > my $seq_object = $seqio_object->next_seq; > > for my $feat_object ($seq_object->get_SeqFeatures){ > if ($feat_object->primary_tag eq "CDS"){ > if ($feat_object->has_tag('locus_tag')){ > for my $V3 ($feat_object->get_tag_values('locus_tag')){ > for my $V1 (@V) { > if ($V1 eq $V3){ > ADD NEW FEATURES > > } > } > } > } > } > } > > The script works down as far as the comparison point where locus_tags in the > genbankfile "Contig100.gb" are compared against a list of locus_tags from a > delimited txt file. > > > regards > > Brian > > Jason Stajich-5 wrote: >> >> $feature->add_tag_value('color','blue'); >> >> On Dec 9, 2011, at 8:52 AM, BForde wrote: >> >>> >>> Hello all, >>> >>> I am new to Bioperl so I apologise if this is stupid question. >>> >>> For CDS features I which to add additional qualifiers e.g. /colour and >>> /note >>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how >>> to >>> do this? >>> >>> regards >>> >>> Brian >>> -- >>> View this message in context: >>> http://old.nabble.com/Genbank-files-tp32941955p32941955.html >>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> Jason Stajich >> jason.stajich at gmail.com >> jason at bioperl.org >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > From b.m.forde at umail.ucc.ie Tue Dec 13 09:22:01 2011 From: b.m.forde at umail.ucc.ie (Brian Forde) Date: Tue, 13 Dec 2011 14:22:01 +0000 Subject: [Bioperl-l] Genbank files In-Reply-To: <4EE73C65.1080101@gmail.com> References: <32941955.post@talk.nabble.com> <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> <32965574.post@talk.nabble.com> <4EE73C65.1080101@gmail.com> Message-ID: Hi Roy, Thank you. That works perfectly. I have to confess that someone else told me to use hashes but I could not get them to work.. Thanks again regards Brian On Tue, Dec 13, 2011 at 11:52 AM, Roy Chaudhuri wrote: > Hi Brian, > > Just to check I have understood you, you want to read through a genbank > file and add additional tags to features which are listed in a > tab-delimited file of locus tags? > > Your code is on the right lines, but it would be much more efficient to > read your tab-delimited locus_tags into a hash, and check using exists, > rather than ploughing through the (potentially very long) list of locus > tags every time. Also, be careful with new lines in your tab file (you can > safely get rid of them using "chomp"). You can miss out the "has_tag" check > by using "get_tagset_values" instead of "get_tag_values", since the former > does not complain if the tag is not present. Once you have modified your > sequence object, you need to write it out to a new file (or STDOUT) using > Bio::SeqIO. > > Also, just a couple of general points, you should always "use warnings" > (or even better "use warnings FATAL=>qw(all)") since that can help solve > many problems, and your code may be easier to read if you don't include the > word "object" in all your variable names (after all you wouldn't say you > write on a paper object using a pen object). > > use strict; > use warnings FATAL=>qw(all); > use Bio::SeqIO; > open (my $list, 'list') or die $!; > my %V; > while (<$list>){ > chomp; > $V{(split(/\t/, $_))[0]}=1; > > } > my $seqio_object = Bio::SeqIO->new(-file=>"**Contig100.gb"); > my $seq_object = $seqio_object->next_seq; > for my $feat_object ($seq_object->remove_**SeqFeatures){ > > if ($feat_object->primary_tag eq "CDS"){ > for my $V3 ($feat_object->get_tagset_**values('locus_tag')){ > if (exists $V{$V3}){ > $feat_object->add_tag_value(**listed_in_tab_file=>'yes'); > next; > } > } > } > $seq_object->add_SeqFeature($**feat_object); > } > Bio::SeqIO->new(-format=>'**genbank')->write_seq($seq_**object); > > Hope this helps. > Cheers, > Roy. > > > On 13/12/2011 11:03, BForde wrote: > >> >> Than you for the replies. >> >> My script (below) reads in a list of locus_tags from a tab delimited text >> file. Compares these locus_tags to the locus_tags in a genbank file and >> where they are equal adds new features. >> the line >> $feat->add_tag_value() >> needs to be defined. In the bioperl wiki this variable appears to be >> defined >> by giving it coordinates etc (creating a new feature). I wish to add >> features to CDS key when the locus_tags are identical. Is this possible? >> >> use strict; >> use Bio::SeqIO; >> >> my @V; >> open (LIST1, 'list') ||die; >> while (){ >> push @V, (split(/\t/, $_))[0]; >> } >> close(LIST1); >> >> my $seqio_object = Bio::SeqIO->new(-file=>"**Contig100.gb"); >> my $seq_object = $seqio_object->next_seq; >> >> for my $feat_object ($seq_object->get_SeqFeatures)**{ >> if ($feat_object->primary_tag eq "CDS"){ >> if ($feat_object->has_tag('locus_**tag')){ >> for my $V3 ($feat_object->get_tag_values(**'locus_tag')){ >> for my $V1 (@V) { >> if ($V1 eq $V3){ >> ADD NEW FEATURES >> >> } >> } >> } >> } >> } >> } >> >> The script works down as far as the comparison point where locus_tags in >> the >> genbankfile "Contig100.gb" are compared against a list of locus_tags from >> a >> delimited txt file. >> >> >> regards >> >> Brian >> >> Jason Stajich-5 wrote: >> >>> >>> $feature->add_tag_value('**color','blue'); >>> >>> On Dec 9, 2011, at 8:52 AM, BForde wrote: >>> >>> >>>> Hello all, >>>> >>>> I am new to Bioperl so I apologise if this is stupid question. >>>> >>>> For CDS features I which to add additional qualifiers e.g. /colour and >>>> /note >>>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how >>>> to >>>> do this? >>>> >>>> regards >>>> >>>> Brian >>>> -- >>>> View this message in context: >>>> http://old.nabble.com/Genbank-**files-tp32941955p32941955.html >>>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >>>> >>>> ______________________________**_________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/**mailman/listinfo/bioperl-l >>>> >>> >>> Jason Stajich >>> jason.stajich at gmail.com >>> jason at bioperl.org >>> >>> >>> ______________________________**_________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/**mailman/listinfo/bioperl-l >>> >>> >>> >> > -- Brian Forde Microbiology Dept. Bioscience Institute. Room 4.11 University College Cork Cork Ireland tel:+353 21 4901306 email: b.m.forde at umail.ucc.ie From b.m.forde at umail.ucc.ie Mon Dec 12 12:20:53 2011 From: b.m.forde at umail.ucc.ie (BForde) Date: Mon, 12 Dec 2011 09:20:53 -0800 (PST) Subject: [Bioperl-l] Genbank files In-Reply-To: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> References: <32941955.post@talk.nabble.com> <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> Message-ID: <32959999.post@talk.nabble.com> Than you for the replies. I am unsure as to how to use the line below with my script. My script so far reads use strict; use Bio::SeqIO; my @V; open (LIST1, 'list') ||die; while (){ push @V, (split(/\t/, $_))[0]; } close(LIST1); my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); my $seq_object = $seqio_object->next_seq; for my $feat_object ($seq_object->get_SeqFeatures){ if ($feat_object->primary_tag eq "CDS"){ if ($feat_object->has_tag('locus_tag')){ for my $V3 ($feat_object->get_tag_values('locus_tag')){ for my $V1 (@V) { if ($V1 eq $V3){ ADD NEW FEATURES } } } } } } The script works down as far as the comparison point where locus_tags in the genbankfile "Contig100.gb" are compared against a list of locus_tags from a delimited txt file. I possbile could you show me how to amend my script so I can add new features regards Brian Jason Stajich-5 wrote: > > $feature->add_tag_value('color','blue'); > > On Dec 9, 2011, at 8:52 AM, BForde wrote: > >> >> Hello all, >> >> I am new to Bioperl so I apologise if this is stupid question. >> >> For CDS features I which to add additional qualifiers e.g. /colour and >> /note >> qualifiers. I have looked at the BioPerl wiki but am still unsure as how >> to >> do this? >> >> regards >> >> Brian >> -- >> View this message in context: >> http://old.nabble.com/Genbank-files-tp32941955p32941955.html >> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Jason Stajich > jason.stajich at gmail.com > jason at bioperl.org > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32959999.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From Russell.Smithies at agresearch.co.nz Tue Dec 13 22:17:02 2011 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Wed, 14 Dec 2011 16:17:02 +1300 Subject: [Bioperl-l] Genbank files In-Reply-To: <32959999.post@talk.nabble.com> References: <32941955.post@talk.nabble.com> <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> <32959999.post@talk.nabble.com> Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF27A@exchsth.agresearch.co.nz> Something like this: use strict; use Bio::SeqIO; my @V; open (LIST1, 'list') ||die; while (){ push @V, (split(/\t/, $_))[0]; } close(LIST1); my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); my $seq_object = $seqio_object->next_seq; for my $feat_object ($seq_object->get_SeqFeatures){ if ($feat_object->primary_tag eq "CDS"){ if ($feat_object->has_tag('locus_tag')){ for my $V3 ($feat_object->get_tag_values('locus_tag')){ for my $V1 (@V) { if ($V1 eq $V3){ #ADD NEW FEATURES $feat_object->add_tag_value('color','blue'); } } } } } } #write the new annotations my $io = Bio::SeqIO->new(-format => "genbank", -file => ">new.gb" ); $io->write_seq($seq_object); Take another look at http://www.bioperl.org/wiki/HOWTO:Feature-Annotation#Building_Your_Own_Sequences --Russell > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of BForde > Sent: Tuesday, 13 December 2011 6:21 a.m. > To: Bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Genbank files > > > Than you for the replies. > > I am unsure as to how to use the line below with my script. My script so far > reads > > use strict; > use Bio::SeqIO; > > my @V; > open (LIST1, 'list') ||die; > while (){ > push @V, (split(/\t/, $_))[0]; > } > close(LIST1); > > my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); > my $seq_object = $seqio_object->next_seq; > > for my $feat_object ($seq_object->get_SeqFeatures){ > if ($feat_object->primary_tag eq "CDS"){ > if ($feat_object->has_tag('locus_tag')){ > for my $V3 ($feat_object->get_tag_values('locus_tag')){ > for my $V1 (@V) { > if ($V1 eq $V3){ > ADD NEW FEATURES > > } > } > } > } > } > } > > The script works down as far as the comparison point where locus_tags in the > genbankfile "Contig100.gb" are compared against a list of locus_tags from a > delimited txt file. > I possbile could you show me how to amend my script so I can add new > features > > regards > > Brian > > Jason Stajich-5 wrote: > > > > $feature->add_tag_value('color','blue'); > > > > On Dec 9, 2011, at 8:52 AM, BForde wrote: > > > >> > >> Hello all, > >> > >> I am new to Bioperl so I apologise if this is stupid question. > >> > >> For CDS features I which to add additional qualifiers e.g. /colour > >> and /note qualifiers. I have looked at the BioPerl wiki but am still > >> unsure as how to do this? > >> > >> regards > >> > >> Brian > >> -- > >> View this message in context: > >> http://old.nabble.com/Genbank-files-tp32941955p32941955.html > >> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > Jason Stajich > > jason.stajich at gmail.com > > jason at bioperl.org > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > -- > View this message in context: http://old.nabble.com/Genbank-files- > tp32941955p32959999.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From l.m.timmermans at students.uu.nl Wed Dec 14 10:43:24 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Wed, 14 Dec 2011 16:43:24 +0100 Subject: [Bioperl-l] Announcing Bio::SFF Message-ID: Hi all, As already mentioned on IRC, I recently wrote a SFF parser and uploaded it to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time to write one I'd be most grateful. Leon From p.j.a.cock at googlemail.com Wed Dec 14 11:03:05 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 14 Dec 2011 16:03:05 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Wed, Dec 14, 2011 at 3:43 PM, Leon Timmermans wrote: > Hi all, > > As already mentioned on IRC, I recently wrote a SFF parser and uploaded it > to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF > entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time > to write one I'd be most grateful. > > Leon Hi Leon, Have you looked at the index block at all, in order to offer random access by read ID, or to access the Roche XML manifest? Please ask if you need more information about this - or if you can read Python: https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py Is this building on Miguel Pignatelli's work? I don't recall seeing any follow up posts from him after this one: http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html Peter From cjfields at illinois.edu Wed Dec 14 11:12:58 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 14 Dec 2011 16:12:58 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: , Message-ID: <3BBC1132-E768-45D9-A107-ACD51791722D@illinois.edu> Leon, Nice! Definitely a good idea to have the lower-level parser and the BioPerl-bridging code separate, one of my concerns with the various parsers we have right now which hard-wire BioPerl classes in with the parser (makes it hard for optimization). Chris PS- Peter, I don't think the two projects are related, but I suppose Leon is the best to answer that. Sent from my stupid iPad, now my laptop's on the fritz On Dec 14, 2011, at 10:04 AM, "Peter Cock" wrote: > On Wed, Dec 14, 2011 at 3:43 PM, Leon Timmermans > wrote: >> Hi all, >> >> As already mentioned on IRC, I recently wrote a SFF parser and uploaded it >> to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF >> entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time >> to write one I'd be most grateful. >> >> Leon > > Hi Leon, > > Have you looked at the index block at all, in order to offer random > access by read ID, or to access the Roche XML manifest? Please > ask if you need more information about this - or if you can read Python: > https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py > > Is this building on Miguel Pignatelli's work? I don't recall seeing > any follow up posts from him after this one: > http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html > > Peter > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From l.m.timmermans at students.uu.nl Wed Dec 14 11:27:58 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Wed, 14 Dec 2011 17:27:58 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Wed, Dec 14, 2011 at 5:03 PM, Peter Cock wrote: > Hi Leon, > > Have you looked at the index block at all, in order to offer random > access by read ID, or to access the Roche XML manifest? Please > ask if you need more information about this - or if you can read Python: > https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py > I have looked at it, but not implemented it yet. There is no standardized index, and the ones that are in common use either seem stupid (the Roche index, which is essentially just a weirdly formatted sequential list, though that should still be faster than a table scan) or undocumented (hash based index). Is this building on Miguel Pignatelli's work? I don't recall seeing > any follow up posts from him after this one: > http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html > It isn't. I like his idea for reusing BioPython's test files though. Leon From p.j.a.cock at googlemail.com Wed Dec 14 11:44:28 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 14 Dec 2011 16:44:28 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Wed, Dec 14, 2011 at 4:27 PM, Leon Timmermans wrote: > On Wed, Dec 14, 2011 at 5:03 PM, Peter Cock > wrote: >> >> Hi Leon, >> >> Have you looked at the index block at all, in order to offer random >> access by read ID, or to access the Roche XML manifest? Please >> ask if you need more information about this - or if you can read Python: >> https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py > > I have looked at it, but not implemented it yet. There is no standardized > index, and the ones that are in common use either seem stupid (the Roche > index, which is essentially just a weirdly formatted sequential list, though > that should still be faster than a table scan) or undocumented (hash based > index). There are two widely used indexes, both from Roche (one with and one without an XML manifest, magic bytes .mft and .srt). They are both just a simple table of the reads names and offsets, sorted alphabetically. This works pretty well for rapid lookup for SFF files (because the read count is not so high), and is pretty easy. I don't think anyone used the hash table style indexes (.hsh), which I assume was a proof of principle or trial in the early days of SFF. One thing to check is what Ion Torrent's SFF files use. I would guess they've followed Roche, but I don't know. After all, the index structure is not defined in the SFF specification - it was left extensible on purpose. >> Is this building on Miguel Pignatelli's work? I don't recall seeing >> any follow up posts from him after this one: >> http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html > > It isn't. I like his idea for reusing BioPython's test files though. Yes, please do. Peter From gingerplum at gmail.com Wed Dec 14 00:18:55 2011 From: gingerplum at gmail.com (plum ginger) Date: Tue, 13 Dec 2011 21:18:55 -0800 (PST) Subject: [Bioperl-l] a problem about BLAST Message-ID: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com> Hi. I'm trying to run BLAST with Bio::Tools::Run::StandAloneBlast. I need run BLAST on more than one sequences. However the blast outfile only store the result of last sequence. How to make the outfile store all results? Wish your help. Thanks very much! Best regards From jason.stajich at gmail.com Thu Dec 15 12:02:47 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Thu, 15 Dec 2011 11:02:47 -0600 Subject: [Bioperl-l] a problem about BLAST In-Reply-To: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com> References: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com> Message-ID: <58E5B487-7FF0-4018-B109-D6595DC2E493@gmail.com> you are probably setting the outfile in each parsing iteration -- you need to show your code if you want someone to help you debug the problem. On Dec 13, 2011, at 11:18 PM, plum ginger wrote: > Hi. I'm trying to run BLAST with Bio::Tools::Run::StandAloneBlast. I > need run BLAST on more than one sequences. However the blast outfile > only store the result of last sequence. How to make the outfile store > all results? > > Wish your help. Thanks very much! > > > Best regards > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason.stajich at gmail.com jason at bioperl.org From pengyu.ut at gmail.com Fri Dec 16 17:10:27 2011 From: pengyu.ut at gmail.com (Peng Yu) Date: Fri, 16 Dec 2011 16:10:27 -0600 Subject: [Bioperl-l] How to stop rather than emit warnings with Bio::Das::segment? Message-ID: Hi, Bio::Das::segment can give me the following warnings without stopping the whole program when the position for the query doesn't exist. I could test the return result and quit when it is []. But this would cause my program have an test whenever I call segment. I'm wondering if there is an automatic way to let Bio::Das::segment stop in such cases. --------------------- WARNING --------------------- MSG: Sequence is not dna or rna, but []. Attempting to revcom, but unsure if this is right --------------------------------------------------- -- Regards, Peng From cjfields at illinois.edu Fri Dec 16 21:48:07 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sat, 17 Dec 2011 02:48:07 +0000 Subject: [Bioperl-l] How to stop rather than emit warnings with Bio::Das::segment? In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF0881CF8E@CITESMBX4.ad.uillinois.edu> Setting verbosity to 2 should convert warnings to exceptions. IIRC, set '-verbose => 2' in the Bio::Das constructor, set '$das->verbose(2)' explicitly, or set the env variable BIOPERLDEBUG=2. chris ________________________________________ From: bioperl-l-bounces at lists.open-bio.org [bioperl-l-bounces at lists.open-bio.org] on behalf of Peng Yu [pengyu.ut at gmail.com] Sent: Friday, December 16, 2011 4:10 PM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] How to stop rather than emit warnings with Bio::Das::segment? Hi, Bio::Das::segment can give me the following warnings without stopping the whole program when the position for the query doesn't exist. I could test the return result and quit when it is []. But this would cause my program have an test whenever I call segment. I'm wondering if there is an automatic way to let Bio::Das::segment stop in such cases. --------------------- WARNING --------------------- MSG: Sequence is not dna or rna, but []. Attempting to revcom, but unsure if this is right --------------------------------------------------- -- Regards, Peng _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From anna.fr at gmail.com Mon Dec 19 02:09:15 2011 From: anna.fr at gmail.com (Anna Friedlander) Date: Mon, 19 Dec 2011 20:09:15 +1300 Subject: [Bioperl-l] StandAloneBlastPlus blastdbcmd question Message-ID: Hi all I have a question about using blastdbcmd via Bio::Tools::Run::StandAloneBlastPlus I have some Blast+ search results that I am manipulating in a perl programme, and I would like to retrieve some sequence information for some results using subject sequence IDs, and associated subject start and end indices. If I was using blastdbcmd directly, I would do so using the -entry and -range options. My question is, can I use all the blastdbcmd options (or more specifically, just the -entry and -range options) from within the StandAloneBlastPlus module? My apologies if I don't properly understand how this "wrapper" works! Thanks in advance for your help Anna Friedlander From l.m.timmermans at students.uu.nl Mon Dec 19 09:19:14 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Mon, 19 Dec 2011 15:19:14 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Wed, Dec 14, 2011 at 5:44 PM, Peter Cock wrote: > There are two widely used indexes, both from Roche (one with and > one without an XML manifest, magic bytes .mft and .srt). They are > both just a simple table of the reads names and offsets, sorted > alphabetically. Yeah, that's what I got from the BioPython code. I didn't know it was sorted though (it doesn't make much sense either, unless they wanted to do a binary search or something). This works pretty well for rapid lookup for SFF files > (because the read count is not so high), and is pretty easy. > It's implemented in Bio::SFF 0.003. I did restructure my code into two readers though, since doing sequential and random-access in the class didn't make much sense code-wise. I don't think anyone used the hash table style indexes (.hsh), which > I assume was a proof of principle or trial in the early days of SFF. > I see, too bad. > One thing to check is what Ion Torrent's SFF files use. I would > guess they've followed Roche, but I don't know. After all, the > index structure is not defined in the SFF specification - it was > left extensible on purpose. > Yeah, we should check that too. Yes, please do. > It's added to 0.003. The lack of tests was bothering me, but the SFFs I had at hand were not suitable. Leon From p.j.a.cock at googlemail.com Mon Dec 19 09:31:18 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 19 Dec 2011 14:31:18 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 2:19 PM, Leon Timmermans wrote: > On Wed, Dec 14, 2011 at 5:44 PM, Peter Cock wrote: > >> There are two widely used indexes, both from Roche (one with and >> one without an XML manifest, magic bytes .mft and .srt). They are >> both just a simple table of the reads names and offsets, sorted >> alphabetically. > > Yeah, that's what I got from the BioPython code. I didn't know it > was sorted though (it doesn't make much sense either, unless they > wanted to do a binary search or something). I presume that's what Roche uses if they keep the index on disk. The alternative is to load the index into RAM, which is really fast. You just open the SFF, read the header, seek to the index, load the index. Without the index, you have to scan the entire SFF file to find each record and its offset - which is much slower. >> This works pretty well for rapid lookup for SFF files >> (because the read count is not so high), and is pretty easy. > > It's implemented in Bio::SFF 0.003. I did restructure my code into two > readers though, since doing sequential and random-access in the class > didn't make much sense code-wise. > >> I don't think anyone used the hash table style indexes (.hsh), which >> I assume was a proof of principle or trial in the early days of SFF. > > I see, too bad. > >> One thing to check is what Ion Torrent's SFF files use. I would >> guess they've followed Roche, but I don't know. After all, the >> index structure is not defined in the SFF specification - it was >> left extensible on purpose. > > Yeah, we should check that too. I don't have any Ion Torrent data first hand, and the public samples I've seen were FASTQ not SFF. But I know a few people with Ion Torrent machines that might be able to help... > It's added to 0.003. The lack of tests was bothering me, but the > SFFs I had at hand were not suitable. Have you looked at the sample SFF data in Biopython? Please use them for the BioPerl unit tests (we're been talking about a cross project collection of test data files like this), the README file should be self-explanatory: https://github.com/biopython/biopython/tree/master/Tests/Roche Peter From p.j.a.cock at googlemail.com Mon Dec 19 10:13:53 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 19 Dec 2011 15:13:53 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: <4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk> References: <4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk> Message-ID: On Mon, Dec 19, 2011 at 3:03 PM, Adam Witney wrote: >> I don't have any Ion Torrent data first hand, and the public >> samples I've seen were FASTQ not SFF. But I know a few >> people with Ion Torrent machines that might be able to help? > > I can you let you have some Ion Torrent SFF files if it helps > > adam Hi Adam, I've just had a quick look at a file from an IonTorrent 314 chip that a colleague kindly sent me, and that SFF file had no index (but only 50k reads so this isn't so important). If you can send me (and Leon?) one of two original SFF files that would be useful, even if just to confirm that Ion Torrent's SFF files do indeed typically lack an index. If that is the case, I may need to remove the warning message Biopython currently prints when indexing these files: No SFF index, doing it the slow way Off list is fine if you'd like to keep the data private, use dropbox or something if you don't have an FTP server. Thanks, Peter From awitney at sgul.ac.uk Mon Dec 19 10:03:16 2011 From: awitney at sgul.ac.uk (Adam Witney) Date: Mon, 19 Dec 2011 15:03:16 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: <4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk> >>> One thing to check is what Ion Torrent's SFF files use. I would >>> guess they've followed Roche, but I don't know. After all, the >>> index structure is not defined in the SFF specification - it was >>> left extensible on purpose. >> >> Yeah, we should check that too. > > I don't have any Ion Torrent data first hand, and the public > samples I've seen were FASTQ not SFF. But I know a few > people with Ion Torrent machines that might be able to help? I can you let you have some Ion Torrent SFF files if it helps adam From l.m.timmermans at students.uu.nl Mon Dec 19 10:48:34 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Mon, 19 Dec 2011 16:48:34 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 3:31 PM, Peter Cock wrote: > I presume that's what Roche uses if they keep the index on disk. > > The alternative is to load the index into RAM, which is really fast. > You just open the SFF, read the header, seek to the index, load > the index. Without the index, you have to scan the entire SFF file > to find each record and its offset - which is much slower. > That's what I'm doing now. It's much faster, but it still takes a noticeable amount of time on large files. Have you looked at the sample SFF data in Biopython? Please > use them for the BioPerl unit tests (we're been talking about a > cross project collection of test data files like this), the README > file should be self-explanatory: > https://github.com/biopython/biopython/tree/master/Tests/Roche > Yeah, I'm using those now ( https://github.com/Leont/bio-sff/blob/master/t/reader.t). I must say there were some interesting corner cases in it. Leon From p.j.a.cock at googlemail.com Mon Dec 19 11:15:15 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 19 Dec 2011 16:15:15 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 3:48 PM, Leon Timmermans wrote: > On Mon, Dec 19, 2011 at 3:31 PM, Peter Cock wrote: > >> Have you looked at the sample SFF data in Biopython? Please >> use them for the BioPerl unit tests (we're been talking about a >> cross project collection of test data files like this), the README >> file should be self-explanatory: >> https://github.com/biopython/biopython/tree/master/Tests/Roche > > Yeah, I'm using those now > (https://github.com/Leont/bio-sff/blob/master/t/reader.t). Could you a link to your /corpus/README.txt file pointing back to the Biopython original for acknowledgement and future reference? > > I must say there were some interesting corner cases in it. > I'm glad you agree - and if you can think of any more special cases to verify that would be great. Are you doing just SFF parsing for now? Not writing? Now, as to Bio::SeqIO integration, Biopython's SeqIO uses format name "sff" to mean the full read sequence (with mixed case, upper case for the good sequence, lower cases for any left/right clipping - as in the Roche tools), and "sff-trim" to mean the trimmed sequences. I would encourage you to do the same, as part of the general aim of having consistent sequence format names between BioPerl, Biopython, and EMBOSS, where possible. Peter From l.m.timmermans at students.uu.nl Mon Dec 19 11:47:41 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Mon, 19 Dec 2011 17:47:41 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock wrote: > Could you a link to your /corpus/README.txt file pointing > back to the Biopython original for acknowledgement and > future reference? > I forgot about that, I will add it to the next release. Are you doing just SFF parsing for now? Not writing? > I haven't written the writer yet (haven't needed it so far). I'd rather release working code early instead of waiting until everything is complete. Now, as to Bio::SeqIO integration, Biopython's SeqIO uses > format name "sff" to mean the full read sequence (with mixed > case, upper case for the good sequence, lower cases for any > left/right clipping - as in the Roche tools), and "sff-trim" to mean > the trimmed sequences. I would encourage you to do the > same, as part of the general aim of having consistent > sequence format names between BioPerl, Biopython, and > EMBOSS, where possible. > I agree, consistency is good. Leon From p.j.a.cock at googlemail.com Mon Dec 19 12:00:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 19 Dec 2011 17:00:03 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 4:47 PM, Leon Timmermans wrote: > On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock > wrote: >> >> Could you a link to your /corpus/README.txt file pointing >> back to the Biopython original for acknowledgement and >> future reference? > > I forgot about that, I will add it to the next release. Thanks. >> Are you doing just SFF parsing for now? Not writing? > > > I haven't written the writer yet (haven't needed it so far). I'd rather > release working code early instead of waiting until everything is complete. I understand - but make sure you've designed the data structures in the parser so as to allow the original record to be re-built as SFF. >> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses >> format name "sff" to mean the full read sequence (with mixed >> case, upper case for the good sequence, lower cases for any >> left/right clipping - as in the Roche tools), and "sff-trim" to mean >> the trimmed sequences. I would encourage you to do the >> same, as part of the general aim of having consistent >> sequence format names between BioPerl, Biopython, and >> EMBOSS, where possible. > > I agree, consistency is good. Great. I'd guess Bio::SeqIO integration would be more important that SFF output initially. Peter From cjfields at illinois.edu Mon Dec 19 14:44:22 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 19 Dec 2011 19:44:22 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: , Message-ID: Kinda joining this a little late, but I think if there is a way to have a low-level parser/writer that generically parses the data into simple (possibly hash-tagged) data structures, that would be best. Barring that, a very simple class for storing data. We've found BioPerl objects/classes pretty heavy. (for an example of this, see Heng Li's readfq parser on github, which has some stats for Fastq/fasta parsing). Any way we can separate the parser from object instantiation would enable us to optimize the object/class layer and parser/writer layers separately, with the possible nice side effect of making the parser more broadly used. For instance, if someone wanted a faster parser, use the low level, otherwise use the higher level (possibly BioPerl-specific) API. Lincoln does this do a certain degree with Bio-samtools; I would go further and make the bp- and non-bp code in separate dists. Chris Sent from my iPad On Dec 19, 2011, at 11:05 AM, "Peter Cock" wrote: > On Mon, Dec 19, 2011 at 4:47 PM, Leon Timmermans > wrote: >> On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock >> wrote: >>> >>> Could you a link to your /corpus/README.txt file pointing >>> back to the Biopython original for acknowledgement and >>> future reference? >> >> I forgot about that, I will add it to the next release. > > Thanks. > >>> Are you doing just SFF parsing for now? Not writing? >> >> >> I haven't written the writer yet (haven't needed it so far). I'd rather >> release working code early instead of waiting until everything is complete. > > I understand - but make sure you've designed the data structures > in the parser so as to allow the original record to be re-built as SFF. > >>> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses >>> format name "sff" to mean the full read sequence (with mixed >>> case, upper case for the good sequence, lower cases for any >>> left/right clipping - as in the Roche tools), and "sff-trim" to mean >>> the trimmed sequences. I would encourage you to do the >>> same, as part of the general aim of having consistent >>> sequence format names between BioPerl, Biopython, and >>> EMBOSS, where possible. >> >> I agree, consistency is good. > > Great. I'd guess Bio::SeqIO integration would be more important > that SFF output initially. > > Peter > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Mon Dec 19 19:28:25 2011 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 19 Dec 2011 18:28:25 -0600 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: <4EEFD6A9.3010303@illinois.edu> On 12/19/2011 10:47 AM, Leon Timmermans wrote: > On Mon, Dec 19, 2011 at 5:15 PM, Peter Cockwrote: > >> Could you a link to your /corpus/README.txt file pointing >> back to the Biopython original for acknowledgement and >> future reference? >> > I forgot about that, I will add it to the next release. > > Are you doing just SFF parsing for now? Not writing? > I haven't written the writer yet (haven't needed it so far). I'd rather > release working code early instead of waiting until everything is complete. > > Now, as to Bio::SeqIO integration, Biopython's SeqIO uses >> format name "sff" to mean the full read sequence (with mixed >> case, upper case for the good sequence, lower cases for any >> left/right clipping - as in the Roche tools), and "sff-trim" to mean >> the trimmed sequences. I would encourage you to do the >> same, as part of the general aim of having consistent >> sequence format names between BioPerl, Biopython, and >> EMBOSS, where possible. >> > I agree, consistency is good. > > Leon This is already implemented in Bio::SeqIO I believe. This is the same line of thinking with the FASTQ format, that one can have a 'format-variant' combination that (as one might guess) indicates to the parser any variation of the parser so logic within the parser can deal with it. You can also pass the '-variant => "foo"' parameter as well IIRC. You would just check the variant with the variant() method. chris From l.m.timmermans at students.uu.nl Tue Dec 20 10:25:13 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Tue, 20 Dec 2011 16:25:13 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 6:00 PM, Peter Cock wrote: > I understand - but make sure you've designed the data structures > in the parser so as to allow the original record to be re-built as SFF. > I did, though currently it's rather hard to make new entries from scratch. That said, I can hardly imagine anyone wanting to do this. Great. I'd guess Bio::SeqIO integration would be more important > that SFF output initially. > Probably. It looks like it's quite easy, it's just rather underdocumented. Leon From l.m.timmermans at students.uu.nl Tue Dec 20 10:26:11 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Tue, 20 Dec 2011 16:26:11 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 8:44 PM, Fields, Christopher J < cjfields at illinois.edu> wrote: > Kinda joining this a little late, but I think if there is a way to have a > low-level parser/writer that generically parses the data into simple > (possibly hash-tagged) data structures, that would be best. Barring that, > a very simple class for storing data. We've found BioPerl objects/classes > pretty heavy. > > (for an example of this, see Heng Li's readfq parser on github, which has > some stats for Fastq/fasta parsing). > > Any way we can separate the parser from object instantiation would enable > us to optimize the object/class layer and parser/writer layers separately, > with the possible nice side effect of making the parser more broadly used. > > For insn Sance, if someone wanted a faster parser, use the low level, > otherwise use the higher level (possibly BioPerl-specific) API. Lincoln > does this do a certain degree with Bio-samtools; I would go further and > make the bp- and non-bp code in separate dists. > A good OO system can actually help make things faster. For example, I'm unpacking the flowspace and quality data lazily, which made scanning through an SFF file 2.5-3 times as fast while having marginal extra costs when you do need them. Leon From l.m.timmermans at students.uu.nl Tue Dec 20 10:30:54 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Tue, 20 Dec 2011 16:30:54 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: <4EEFD6A9.3010303@illinois.edu> References: <4EEFD6A9.3010303@illinois.edu> Message-ID: On Tue, Dec 20, 2011 at 1:28 AM, Chris Fields wrote: > This is already implemented in Bio::SeqIO I believe. This is the same > line of thinking with the FASTQ format, that one can have a > 'format-variant' combination that (as one might guess) indicates to the > parser any variation of the parser so logic within the parser can deal with > it. You can also pass the '-variant => "foo"' parameter as well IIRC. You > would just check the variant with the variant() method. > Great. That makes life much easier :-) Leon From p.j.a.cock at googlemail.com Tue Dec 20 10:31:59 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 20 Dec 2011 15:31:59 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Tue, Dec 20, 2011 at 3:25 PM, Leon Timmermans wrote: > On Mon, Dec 19, 2011 at 6:00 PM, Peter Cock > wrote: >> >> I understand - but make sure you've designed the data structures >> in the parser so as to allow the original record to be re-built as SFF. > > ?I did, though currently it's rather hard to make new entries from scratch. > That said, I can hardly imagine anyone wanting to do this. Typical use cases I've found in using the Biopython SFF code are filtering an SFF file (taking some records only), and modifying the clipping values. In both cases, the user isn't creating the SFF records from scratch. Peter From cjfields at illinois.edu Tue Dec 20 17:40:31 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 20 Dec 2011 22:40:31 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: , Message-ID: On Dec 20, 2011, at 9:26 AM, "Leon Timmermans" > wrote: On Mon, Dec 19, 2011 at 8:44 PM, Fields, Christopher J > wrote: Kinda joining this a little late, but I think if there is a way to have a low-level parser/writer that generically parses the data into simple (possibly hash-tagged) data structures, that would be best. Barring that, a very simple class for storing data. We've found BioPerl objects/classes pretty heavy. (for an example of this, see Heng Li's readfq parser on github, which has some stats for Fastq/fasta parsing). Any way we can separate the parser from object instantiation would enable us to optimize the object/class layer and parser/writer layers separately, with the possible nice side effect of making the parser more broadly used. For insn Sance, if someone wanted a faster parser, use the low level, otherwise use the higher level (possibly BioPerl-specific) API. Lincoln does this do a certain degree with Bio-samtools; I would go further and make the bp- and non-bp code in separate dists. A good OO system can actually help make things faster. For example, I'm unpacking the flowspace and quality data lazily, which made scanning through an SFF file 2.5-3 times as fast while having marginal extra costs when you do need them. Leon Yep, thinking about using the same approach for the Fastq variants. Chris Sent from my ancient iPad b/c my laptop's borked From dgacquer at ulb.ac.be Wed Dec 21 08:26:07 2011 From: dgacquer at ulb.ac.be (David Gacquer) Date: Wed, 21 Dec 2011 14:26:07 +0100 Subject: [Bioperl-l] Strange behaviour in the write_seq function for large fasta Message-ID: <4EF1DE6F.4070508@ulb.ac.be> Dear BioPerl users/developers, I am facing a strange issue with the $seq_out->write_seq function when using large fasta files I have downloaded the hg19 chromosome 1, and applied the following code (basically I wanted to mask some regions in it but the problem also appears when copying the sequence without modifications): sub main{ my $seq_in = Bio::SeqIO->new( -format => 'largefasta', -file => $ARGV[0]); my $seq_out = Bio::SeqIO->new( -format => 'largefasta', -file => '>'.$ARGV[1]); my $seq_obj_in = $seq_in->next_seq(); my $modified_seq = $seq_obj_in->seq(); my $seq_obj_out = Bio::Seq::LargePrimarySeq->new( -seq => $modified_seq, -id => $seq_obj_in->id, -desc => $seq_obj_in->desc); $seq_out->write_seq($seq_obj_out); } when checking the output fasta file, the sequence of chr1 is 1-bp shorter. I have noticed that in the original fasta file, each line contains exactly 50 nucleotides, while the output of the $seq_out->write_seq function contains always 60 characters per line. chr1 is exactly 249,250,621 bp (4,154,177 * 60 + 1) so to verify that the very last base was missing, I created the following fasta files: chr121.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAG chr122.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAG They contain respectively 121 (60*2+1) and 122 (60*2+2) bp, the last character being a G. When running the above code: chr121.out.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA chr122.out.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AG The output for the 122 bp chromosome is correct (2 lines of 60 bp and the last line with 2 bp, AG) but for the 121 bp chromosome, the last character is missing (2 lines of 60 bp only, last G is missing). When replacing -format => 'largefasta' by -format => 'fasta' or writing the output without the write_seq function however, the problem is solved. Am I missing something or is there a problem with the write_seq function used with large fasta files? (I am running BioPerl on a Mac under OS X Snow Leopard) Best regards David -- David Gacquer, Ph. D. IRIBHM - Universite Libre de Bruxelles Bldg C, room C.4.117 ULB, Campus Erasme, CP602 808 route de Lennik B-1070 Brussels Belgium Phone: +32-2-555 4187 Fax: +32-2-555 4655 E-mail: dgacquer at ulb.ac.be From koraydogankaya at gmail.com Sat Dec 24 03:44:43 2011 From: koraydogankaya at gmail.com (Koray) Date: Sat, 24 Dec 2011 00:44:43 -0800 (PST) Subject: [Bioperl-l] exons Message-ID: <8454dd72-4fed-41c5-977e-83d9300cd68b@z25g2000vbs.googlegroups.com> I need an explicit code for getting exon sequences of an mrna or gene fetched by get_Seq_by_acc or id. in ensembl it is easy but here it is not easy many ios exists. for example: here how can i get such a $gene object from DBs (GeneBank or EntrezGene) by acc numberor ids? exons code prev next Top Title : exons() Usage : @exons = $gene->exons(); @inital_exons = $gene->exons('Initial'); Function: Get all exon features or all exons of a specified type of this gene structure. Exon type is treated as a case-insensitive regular expression and optional. For consistency, use only the following types: initial, internal, terminal, utr, utr5prime, and utr3prime. A special and virtual type is 'coding', which refers to all types except utr. This method basically merges the exons returned by transcripts. Returns : An array of Bio::SeqFeature::Gene::ExonI implementing objects. Args : An optional string specifying the type of exon. From challa_ghanashyam at yahoo.com Sat Dec 24 15:09:09 2011 From: challa_ghanashyam at yahoo.com (GSC) Date: Sat, 24 Dec 2011 12:09:09 -0800 (PST) Subject: [Bioperl-l] re trieve description for a list of gi ids.. Message-ID: <33034438.post@talk.nabble.com> Hi all: I am new to perl. I am working on a script to retrieve the record description (name given for a sequence record in genbank) for a list of gi ids. the script works fine for 1000 ids but my list is about 250,000 ids long and it is not working for me. Any suggestions on this. GS -- View this message in context: http://old.nabble.com/retrieve-description-for-a-list-of-gi-ids..-tp33034438p33034438.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From cjfields at illinois.edu Tue Dec 27 10:03:28 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 27 Dec 2011 15:03:28 +0000 Subject: [Bioperl-l] Strange behaviour in the write_seq function for large fasta In-Reply-To: <4EF1DE6F.4070508@ulb.ac.be> References: <4EF1DE6F.4070508@ulb.ac.be> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF0BB29547@CITESMBX4.ad.uillinois.edu> This is a strange one. Personally I haven't seen this behavior, but that maybe it's OS-dependent? We'll need more information, particularly what version of BioPerl you are using, the OS, version of perl, etc. Also, in general to make sure we don't lose track of this issue it is best to submit a bug report: https://redmine.open-bio.org/projects/bioperl I'm planning on triaging bugs next week, I could take a look then. chris ________________________________________ From: bioperl-l-bounces at lists.open-bio.org [bioperl-l-bounces at lists.open-bio.org] on behalf of David Gacquer [dgacquer at ulb.ac.be] Sent: Wednesday, December 21, 2011 7:26 AM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] Strange behaviour in the write_seq function for large fasta Dear BioPerl users/developers, I am facing a strange issue with the $seq_out->write_seq function when using large fasta files I have downloaded the hg19 chromosome 1, and applied the following code (basically I wanted to mask some regions in it but the problem also appears when copying the sequence without modifications): sub main{ my $seq_in = Bio::SeqIO->new( -format => 'largefasta', -file => $ARGV[0]); my $seq_out = Bio::SeqIO->new( -format => 'largefasta', -file => '>'.$ARGV[1]); my $seq_obj_in = $seq_in->next_seq(); my $modified_seq = $seq_obj_in->seq(); my $seq_obj_out = Bio::Seq::LargePrimarySeq->new( -seq => $modified_seq, -id => $seq_obj_in->id, -desc => $seq_obj_in->desc); $seq_out->write_seq($seq_obj_out); } when checking the output fasta file, the sequence of chr1 is 1-bp shorter. I have noticed that in the original fasta file, each line contains exactly 50 nucleotides, while the output of the $seq_out->write_seq function contains always 60 characters per line. chr1 is exactly 249,250,621 bp (4,154,177 * 60 + 1) so to verify that the very last base was missing, I created the following fasta files: chr121.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAG chr122.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAG They contain respectively 121 (60*2+1) and 122 (60*2+2) bp, the last character being a G. When running the above code: chr121.out.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA chr122.out.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AG The output for the 122 bp chromosome is correct (2 lines of 60 bp and the last line with 2 bp, AG) but for the 121 bp chromosome, the last character is missing (2 lines of 60 bp only, last G is missing). When replacing -format => 'largefasta' by -format => 'fasta' or writing the output without the write_seq function however, the problem is solved. Am I missing something or is there a problem with the write_seq function used with large fasta files? (I am running BioPerl on a Mac under OS X Snow Leopard) Best regards David -- David Gacquer, Ph. D. IRIBHM - Universite Libre de Bruxelles Bldg C, room C.4.117 ULB, Campus Erasme, CP602 808 route de Lennik B-1070 Brussels Belgium Phone: +32-2-555 4187 Fax: +32-2-555 4655 E-mail: dgacquer at ulb.ac.be _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From jdeuts01 at students.poly.edu Thu Dec 1 09:09:19 2011 From: jdeuts01 at students.poly.edu (jdeuts01 at students.poly.edu) Date: Thu, 1 Dec 2011 14:09:19 +0000 Subject: [Bioperl-l] question Message-ID: Dear Bioperl, This is my first experience with bioperl and I need help please. 1. The version of bioperl is 1.6.1 and have also installed Bundle-BioPerl 2.1.8 and mGen 1.03. I was unable to install Bribes and trouchelle DB. Will this prevent the BioPerl package from functioning correctly? 2. The operating platform is windows 7 - 64 bit using ActiveState - Perl v5.12.2 3. The script is as follows: #!/usr/bin/perl # Write a script using OOP to write protein sequences to the file fasta.txt.use strict;use warnings;use Bio::SeqIO::fasta; # Declare and initialize input and output files.my $protein_fasta = "protein.fa";my $protein_out = ">fasta.txt"; # Setup objects for input and output.my $seq_in = Bio::SeqIO->new(-file => "$protein_fasta", -format => 'Fasta');my $seq_out = Bio::SeqIO->new(-file => "$protein_out", -format => 'fasta'); # Establish while loop using "next_seq" method # to read in multiple sequences from "protein.fa" file# one by one until none were remaining.while(my $seq = $seq_in -> next_seq){ $seq_out->write_seq($seq);} The information is successfully written to the file: fasta.txt. 4. Receiving the following error messages: Replacement list is longer than search list at C:/Perl64/site/lib/Bio/Range.pm line 251.Subroutine _initialize redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 92.Subroutine next_seq redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 112.Subroutine write_seq redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 193.Subroutine width redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 272.Subroutine preferred_id_type redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 295. Thanks in advance for your help.John Deutsch From jboddu at illinois.edu Thu Dec 1 11:38:00 2011 From: jboddu at illinois.edu (Boddu, Jayanand) Date: Thu, 1 Dec 2011 16:38:00 +0000 Subject: [Bioperl-l] Chromosome coordinates Message-ID: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu> Hello I am newbie to Perl scripts. I have a file with short reads mapped to the MAIZE genome The format is a simple BLASTN output. READ_ID Chr % Similarity Alignment Mismatches Gaps READ Start READ End Chr Start Chr End E Value Score READ1 chrPt 100 17 0 0 1 17 35021 35037 0.21 34.2 READ1 chr10 100 17 0 0 1 17 128587356 128587372 0.21 34.2 READ1 chr6 100 17 0 0 1 17 160769803 160769787 0.21 34.2 READ1 chr5 100 17 0 0 1 17 172103083 172103067 0.21 34.2 READ1 chr4 100 17 0 0 1 17 213173683 213173699 0.21 34.2 READ1 chr3 100 17 0 0 1 17 23689132 23689116 0.21 34.2 READ2 chr8 100 17 0 0 1 17 161048603 161048587 0.21 34.2 READ2 chr6 100 17 0 0 1 17 155768884 155768868 0.21 34.2 READ2 chr5 100 17 0 0 1 17 32958812 32958828 0.21 34.2 READ2 chr3 100 17 0 0 1 17 212451090 212451074 0.21 34.2 READ2 chr2 100 17 0 0 1 17 2046449 2046465 0.21 34.2 READ2 chr1 100 17 0 0 1 17 223233801 223233785 0.21 34.2 READ2 chr1 100 17 0 0 1 17 277573037 277573021 0.21 34.2 As expected the same read maps to multiple places on the same/different chromosome. I have a GFF file with annotated coordinates. I would like to run a PERL script to find out READS that are within the GENES in the GFF file and that are not. The anticipated script should; 1. Take the READ coordinates on the genome (by chromosome); 2. Go the GFF file; 3. Find the Chromosome; 4. Find the GENE (by coordinates); 5. and report READ-its coordinates-Chromosome-GENE-and its coordinates. It doesn't need to be in the same order. After this, I guess I could use simple Microsoft ACCESS query to pull out READS that are not mapped to the GENEs. I would greatly appreciate if anyone can has a script that more or less similar job. Thanks Jay From scott at scottcain.net Thu Dec 1 11:59:56 2011 From: scott at scottcain.net (Scott Cain) Date: Thu, 1 Dec 2011 11:59:56 -0500 Subject: [Bioperl-l] Chromosome coordinates In-Reply-To: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu> References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu> Message-ID: Hi Jay, Since the maize GFF file is likely to be fairly large, I would consider putting it in a database, using either Bio::DB::GFF if it is GFF2 or Bio::DB::SeqFeature::Store if it is gff3. Then you can use the methods that come along with either of those modules to search regions for for genes. They both support a get_features_by_location method, so you could get the range for each of the regions you want to look at, and check the database with that method to see if anything is there. Scott On Thu, Dec 1, 2011 at 11:38 AM, Boddu, Jayanand wrote: > Hello > I am newbie to Perl scripts. > I have a file with short reads mapped to the MAIZE genome > The format is a simple BLASTN output. > READ_ID > > Chr > > % Similarity > > Alignment > > Mismatches > > Gaps > > READ Start > > READ End > > Chr Start > > Chr End > > E Value > > Score > > READ1 > > chrPt > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 35021 > > 35037 > > 0.21 > > 34.2 > > READ1 > > chr10 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 128587356 > > 128587372 > > 0.21 > > 34.2 > > READ1 > > chr6 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 160769803 > > 160769787 > > 0.21 > > 34.2 > > READ1 > > chr5 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 172103083 > > 172103067 > > 0.21 > > 34.2 > > READ1 > > chr4 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 213173683 > > 213173699 > > 0.21 > > 34.2 > > READ1 > > chr3 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 23689132 > > 23689116 > > 0.21 > > 34.2 > > READ2 > > chr8 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 161048603 > > 161048587 > > 0.21 > > 34.2 > > READ2 > > chr6 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 155768884 > > 155768868 > > 0.21 > > 34.2 > > READ2 > > chr5 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 32958812 > > 32958828 > > 0.21 > > 34.2 > > READ2 > > chr3 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 212451090 > > 212451074 > > 0.21 > > 34.2 > > READ2 > > chr2 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 2046449 > > 2046465 > > 0.21 > > 34.2 > > READ2 > > chr1 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 223233801 > > 223233785 > > 0.21 > > 34.2 > > READ2 > > chr1 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 277573037 > > 277573021 > > 0.21 > > 34.2 > > > > > > > > > > > > > > > > > > > > > > > > > > As expected the same read maps to multiple places on the same/different > chromosome. > I have a GFF file with annotated coordinates. > I would like to run a PERL script to find out READS that are within the > GENES in the GFF file and that are not. > The anticipated script should; > > 1. Take the READ coordinates on the genome (by chromosome); > > 2. Go the GFF file; > > 3. Find the Chromosome; > > 4. Find the GENE (by coordinates); > > 5. and report READ-its coordinates-Chromosome-GENE-and its > coordinates. > > It doesn't need to be in the same order. > After this, I guess I could use simple Microsoft ACCESS query to pull out > READS that are not mapped to the GENEs. > I would greatly appreciate if anyone can has a script that more or less > similar job. > > Thanks > Jay > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research From jason.stajich at gmail.com Thu Dec 1 12:31:29 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Thu, 1 Dec 2011 09:31:29 -0800 Subject: [Bioperl-l] Chromosome coordinates In-Reply-To: References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu> Message-ID: <470A08F7-E238-44E1-B0D1-69FDBC0BA2A3@gmail.com> You might try using BEDtools, intersectBED which can do a lot of what you are doing in simple command line program. Jason On Dec 1, 2011, at 8:59 AM, Scott Cain wrote: > Hi Jay, > > Since the maize GFF file is likely to be fairly large, I would consider > putting it in a database, using either Bio::DB::GFF if it is GFF2 or > Bio::DB::SeqFeature::Store if it is gff3. Then you can use the methods > that come along with either of those modules to search regions for for > genes. They both support a get_features_by_location method, so you could > get the range for each of the regions you want to look at, and check the > database with that method to see if anything is there. > > Scott > > > On Thu, Dec 1, 2011 at 11:38 AM, Boddu, Jayanand wrote: > >> Hello >> I am newbie to Perl scripts. >> I have a file with short reads mapped to the MAIZE genome >> The format is a simple BLASTN output. >> READ_ID >> >> Chr >> >> % Similarity >> >> Alignment >> >> Mismatches >> >> Gaps >> >> READ Start >> >> READ End >> >> Chr Start >> >> Chr End >> >> E Value >> >> Score >> >> READ1 >> >> chrPt >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 35021 >> >> 35037 >> >> 0.21 >> >> 34.2 >> >> READ1 >> >> chr10 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 128587356 >> >> 128587372 >> >> 0.21 >> >> 34.2 >> >> READ1 >> >> chr6 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 160769803 >> >> 160769787 >> >> 0.21 >> >> 34.2 >> >> READ1 >> >> chr5 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 172103083 >> >> 172103067 >> >> 0.21 >> >> 34.2 >> >> READ1 >> >> chr4 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 213173683 >> >> 213173699 >> >> 0.21 >> >> 34.2 >> >> READ1 >> >> chr3 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 23689132 >> >> 23689116 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr8 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 161048603 >> >> 161048587 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr6 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 155768884 >> >> 155768868 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr5 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 32958812 >> >> 32958828 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr3 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 212451090 >> >> 212451074 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr2 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 2046449 >> >> 2046465 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr1 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 223233801 >> >> 223233785 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr1 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 277573037 >> >> 277573021 >> >> 0.21 >> >> 34.2 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> As expected the same read maps to multiple places on the same/different >> chromosome. >> I have a GFF file with annotated coordinates. >> I would like to run a PERL script to find out READS that are within the >> GENES in the GFF file and that are not. >> The anticipated script should; >> >> 1. Take the READ coordinates on the genome (by chromosome); >> >> 2. Go the GFF file; >> >> 3. Find the Chromosome; >> >> 4. Find the GENE (by coordinates); >> >> 5. and report READ-its coordinates-Chromosome-GENE-and its >> coordinates. >> >> It doesn't need to be in the same order. >> After this, I guess I could use simple Microsoft ACCESS query to pull out >> READS that are not mapped to the GENEs. >> I would greatly appreciate if anyone can has a script that more or less >> similar job. >> >> Thanks >> Jay >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > > > -- > ------------------------------------------------------------------------ > Scott Cain, Ph. D. scott at scottcain dot > net > GMOD Coordinator (http://gmod.org/) 216-392-3087 > Ontario Institute for Cancer Research > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From jovel_juan at hotmail.com Thu Dec 1 12:36:32 2011 From: jovel_juan at hotmail.com (Juan Jovel) Date: Thu, 1 Dec 2011 17:36:32 +0000 Subject: [Bioperl-l] Error when using SearchIO In-Reply-To: References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>, Message-ID: Hello Everybody! I have been running endless standalone Blasts these days, in two different clusters. The first one runs on Rock Linux, while the other one on Devian. In the Rock Linux one, when I use SeachIO to parse my blastp or blastx reports, I get invariably a message: "Replacement list is longer than search list at /share/apps/lib/perl5/site_perl/5.14.1/Bio/Range.pm line 251" What it does mean? Would it have any effect on my parsing results? Thanks, JUAN From cjfields at illinois.edu Thu Dec 1 14:03:45 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 1 Dec 2011 19:03:45 +0000 Subject: [Bioperl-l] Error when using SearchIO In-Reply-To: References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>, Message-ID: <7233AC75-E03B-401A-8D0A-260ED76956A4@illinois.edu> On Dec 1, 2011, at 11:36 AM, Juan Jovel wrote: > Hello Everybody! > I have been running endless standalone Blasts these days, in two different clusters. The first one runs on Rock Linux, while the other one on Devian. In the Rock Linux one, when I use SeachIO to parse my blastp or blastx reports, I get invariably a message: > "Replacement list is longer than search list at /share/apps/lib/perl5/site_perl/5.14.1/Bio/Range.pm line 251" > What it does mean? Would it have any effect on my parsing results? > Thanks, > JUAN This is a bug that was fixed, I think this was in the latest BioPerl release (1.6.901). There was a transliteration error that produced this warning, but otherwise it's harmless. The warning is perl version dependent, I think it pops up in perl 5.12 and up. This user post (about halfway down) runs into the same issue: http://www.sysarchitects.com/bioperl. chris From David.Messina at sbc.su.se Thu Dec 1 17:02:20 2011 From: David.Messina at sbc.su.se (Dave Messina) Date: Thu, 1 Dec 2011 23:02:20 +0100 Subject: [Bioperl-l] re trieving blast multiple alignment in fasta form In-Reply-To: <32886592.post@talk.nabble.com> References: <32886592.post@talk.nabble.com> Message-ID: Hi Eric, Wait, do you want multiple pairwise alignments in your output FASTA file, or a single multiple alignment of your query and all the hits? If the former, get_aln() will give you one pairwise alignment per hsp, but you'll need to move the output file creation statement (my $alnIO = ...) before the loops so it gets created only once. Then, when you do the write statement ($alnIO->write_aln($aln);), all of the alignments will go to the same file. If on the other hand you'd like to have a multiple alignment between a query and all of its hits, you'll have to take the IDs of the hits, pull the corresponding sequences out of the database, and then run a multiple alignment algorithm on them. Dave From scuoppo at gmail.com Fri Dec 2 17:50:28 2011 From: scuoppo at gmail.com (Claudio Scuoppo) Date: Fri, 2 Dec 2011 17:50:28 -0500 Subject: [Bioperl-l] List of genes from genomic intervals Message-ID: Hi, I am new to BioPerl. I was wondering what`s the best strategy to get the genes contained in a a series of human genomic interval. Basically, I have a table with: Chromosome Start End Which module should I be looking at? Thanks, Claudio From awitney at sgul.ac.uk Mon Dec 5 06:09:39 2011 From: awitney at sgul.ac.uk (Adam Witney) Date: Mon, 5 Dec 2011 11:09:39 +0000 Subject: [Bioperl-l] Bio::Graphics imagemap and padding Message-ID: <44A27378-CC4B-4396-817E-AA31004847C7@sgul.ac.uk> Hi, Image maps seem to be out of position if you use padding in the Panel, like this: my $panel = Bio::Graphics::Panel->new( ?.. -pad_left => 20, -pad_right => 20 ?? ); Without these options, the image map is fine. Is this a known issue? Also on a side note, I noticed that when using Bio::Graphics with Dancer, some of the CGI code was blocking somewhere (I found a reference to a similar problem with CGI and Catalyst), swapping CGI with HTML::Entities fixes it: sub create_web_map { ?. eval "require HTML::Entities" unless HTML::Entities->can('encode_entities'); ?. my $title = HTML::Entities::encode_entities($self->make_link($tr,$feature,1)); my $target = HTML::Entities::encode_entities($self->make_link($tgr,$feature,1)); ?.. } Thanks Adam From momin.amin at gmail.com Mon Dec 5 18:00:23 2011 From: momin.amin at gmail.com (Amin Momin) Date: Mon, 5 Dec 2011 15:00:23 -0800 (PST) Subject: [Bioperl-l] SimpleAlign and consensus_string Message-ID: Hi , I am generating a consensus sequence by aligning two protein homologs using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to understand the criteria consensus_string() method of simpleAlign uses to determine the consensus at position with dissimilar aminoacids/ nucleotide. Also how would the % cutoffs provided to consensus_string() affect the outcome. Thanks, Amin From jason.stajich at gmail.com Mon Dec 5 18:58:59 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Mon, 5 Dec 2011 15:58:59 -0800 Subject: [Bioperl-l] SimpleAlign and consensus_string In-Reply-To: References: Message-ID: <4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com> There are several methods that do related things. Below is the documentation for the method - it allows a more strict calling besides majority rule voting for the position - if you set the cutoff at 50% then the called AA or NT has to appear at least 50% of the time. There is also an IUPAC consensus string which is useful for DNA consensus generating patterns. If you want a summary as to whether or not there are only conservative amino acid changes in the column use match_line method which generates the similar string that CLUSTALW reports as a summary string for each column. =head2 consensus_string Title : consensus_string Usage : $str = $ali->consensus_string($threshold_percent) Function : Makes a strict consensus Returns : Consensus string Argument : Optional treshold ranging from 0 to 100. The consensus residue has to appear at least threshold % of the sequences at a given location, otherwise a '?' character will be placed at that location. (Default value = 0%) =cut On Dec 5, 2011, at 3:00 PM, Amin Momin wrote: > Hi , > > I am generating a consensus sequence by aligning two protein homologs > using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to > understand the criteria consensus_string() method of simpleAlign uses > to determine the consensus at position with dissimilar aminoacids/ > nucleotide. Also how would the % cutoffs provided to > consensus_string() affect the outcome. > > > Thanks, > Amin > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From wenbinmei at gmail.com Tue Dec 6 11:09:35 2011 From: wenbinmei at gmail.com (wenbin mei) Date: Tue, 6 Dec 2011 11:09:35 -0500 Subject: [Bioperl-l] revcom the multiple sequence alignment Message-ID: Hi, I have a question about revcom the multiple sequence alignment. One way I can do convert the format into fasta and revcom individual sequences. I wonder is there a easy way to convert the multiple sequence alignment as a whole. Thank you for help. -best, wenbin From jason.stajich at gmail.com Tue Dec 6 12:40:37 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Tue, 6 Dec 2011 09:40:37 -0800 Subject: [Bioperl-l] revcom the multiple sequence alignment In-Reply-To: References: Message-ID: I think this would work to update it in place though I haven't tried it myself for my $seq ( $aln->each_seq ) { $seq->seq( $seq->revcom->seq ); } $out->write_aln($aln); This may also work - not entirely sure if there is any extra work done on the meta data (start/end) of the Seq object when this is done. You may want to flip start/end for the sequences (the seqs are Bio::LocatableSeq objects) explicitly if not. Or you may not care about those data and can ignore. $seq = $seq->revcom Jason On Dec 6, 2011, at 8:09 AM, wenbin mei wrote: > Hi, > > I have a question about revcom the multiple sequence alignment. One way I > can do convert the format into fasta and revcom individual sequences. I > wonder is there a easy way to convert the multiple sequence alignment as a > whole. Thank you for help. > > -best, > wenbin > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From wenbinmei at gmail.com Tue Dec 6 12:51:18 2011 From: wenbinmei at gmail.com (wenbin mei) Date: Tue, 6 Dec 2011 12:51:18 -0500 Subject: [Bioperl-l] revcom the multiple sequence alignment In-Reply-To: References: Message-ID: I think I might not explain clearly my questions. I extract the individual gene alignment from the whole genome alignment. Since some gene are on the reverse strand, I want to revcom the gene alignment. There is part of my scripts. I can read the strand information from another file. my $newstart = $refseq->column_from_residue_number($start); my $newend = $refseq->column_from_residue_number($end); $seq{$genename} = $aln->slice($newstart, $newend); Any suggestion to help me revcom some gene alignment on the minus strand is helpful. Thank you. -best, wenbin On Tue, Dec 6, 2011 at 12:40 PM, Jason Stajich wrote: > I think this would work to update it in place though I haven't tried it > myself > > for my $seq ( $aln->each_seq ) { > $seq->seq( $seq->revcom->seq ); > } > $out->write_aln($aln); > > This may also work - not entirely sure if there is any extra work done on > the meta data (start/end) of the Seq object when this is done. You may > want to flip start/end for the sequences (the seqs are Bio::LocatableSeq > objects) explicitly if not. Or you may not care about those data and can > ignore. > > $seq = $seq->revcom > > Jason > On Dec 6, 2011, at 8:09 AM, wenbin mei wrote: > > > Hi, > > > > I have a question about revcom the multiple sequence alignment. One way I > > can do convert the format into fasta and revcom individual sequences. I > > wonder is there a easy way to convert the multiple sequence alignment as > a > > whole. Thank you for help. > > > > -best, > > wenbin > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Wenbin Mei Ph.D. Student Dr. Brad Barbazuk's Lab Department of Biology University of Florida 509-899-3067 wmei at ufl.edu From kellert at ohsu.edu Tue Dec 6 13:21:39 2011 From: kellert at ohsu.edu (Tom Keller) Date: Tue, 6 Dec 2011 10:21:39 -0800 Subject: [Bioperl-l] Bioperl-l Digest, Vol 104, Issue 3 In-Reply-To: References: Message-ID: I'd start by looking for the section "Searching for genes in genomic DNA" in the HOWTO:Beginners - BioPerl website. Thomas (Tom) Keller, PhD kellert at ohsu.edu 503.494.2442 6588 R Jones Hall (BSc/CROET) MMI DNA Services Member of OHSU Shared Resources On Dec 3, 2011, at 9:00 AM, wrote: > Send Bioperl-l mailing list submissions to > bioperl-l at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/bioperl-l > or, via email, send a message with subject or body 'help' to > bioperl-l-request at lists.open-bio.org > > You can reach the person managing the list at > bioperl-l-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Bioperl-l digest..." > > > Today's Topics: > > 1. List of genes from genomic intervals (Claudio Scuoppo) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 2 Dec 2011 17:50:28 -0500 > From: Claudio Scuoppo > Subject: [Bioperl-l] List of genes from genomic intervals > To: bioperl-l at lists.open-bio.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > Hi, > > I am new to BioPerl. I was wondering what`s the best strategy to get > the genes contained in a a series of human genomic interval. > Basically, I have a table with: > > Chromosome Start End > > Which module should I be looking at? > Thanks, > Claudio > > > ------------------------------ > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > End of Bioperl-l Digest, Vol 104, Issue 3 > ***************************************** From wenbinmei at gmail.com Tue Dec 6 17:54:51 2011 From: wenbinmei at gmail.com (wenbin mei) Date: Tue, 6 Dec 2011 17:54:51 -0500 Subject: [Bioperl-l] revcom the multiple sequence alignment In-Reply-To: References: Message-ID: Figured out! Thanks for help. -best, wenbin On Tue, Dec 6, 2011 at 12:40 PM, Jason Stajich wrote: > I think this would work to update it in place though I haven't tried it > myself > > for my $seq ( $aln->each_seq ) { > $seq->seq( $seq->revcom->seq ); > } > $out->write_aln($aln); > > This may also work - not entirely sure if there is any extra work done on > the meta data (start/end) of the Seq object when this is done. You may > want to flip start/end for the sequences (the seqs are Bio::LocatableSeq > objects) explicitly if not. Or you may not care about those data and can > ignore. > > $seq = $seq->revcom > > Jason > On Dec 6, 2011, at 8:09 AM, wenbin mei wrote: > > > Hi, > > > > I have a question about revcom the multiple sequence alignment. One way I > > can do convert the format into fasta and revcom individual sequences. I > > wonder is there a easy way to convert the multiple sequence alignment as > a > > whole. Thank you for help. > > > > -best, > > wenbin > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Wenbin Mei Ph.D. Student Dr. Brad Barbazuk's Lab Department of Biology University of Florida 509-899-3067 wmei at ufl.edu From momin.amin at gmail.com Tue Dec 6 12:37:16 2011 From: momin.amin at gmail.com (Amin Momin) Date: Tue, 6 Dec 2011 11:37:16 -0600 Subject: [Bioperl-l] SimpleAlign and consensus_string In-Reply-To: <4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com> References: <4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com> Message-ID: Thanks Jason On Mon, Dec 5, 2011 at 5:58 PM, Jason Stajich wrote: > There are several methods that do related things. > > Below is the documentation for the method - it allows a more strict calling besides majority rule voting for the position - if you set the cutoff at 50% then the called AA or NT has to appear at least 50% of the time. There is also an IUPAC consensus string which is useful for DNA consensus generating patterns. > > If you want a summary as to whether or not there are only conservative amino acid changes in the column use match_line method which generates the similar string that CLUSTALW reports as a summary string for each column. > > =head2 consensus_string > > ?Title ? ? : consensus_string > ?Usage ? ? : $str = $ali->consensus_string($threshold_percent) > ?Function ?: Makes a strict consensus > ?Returns ? : Consensus string > ?Argument ?: Optional treshold ranging from 0 to 100. > ? ? ? ? ? ? The consensus residue has to appear at least threshold % > ? ? ? ? ? ? of the sequences at a given location, otherwise a '?' > ? ? ? ? ? ? character will be placed at that location. > ? ? ? ? ? ? (Default value = 0%) > > =cut > > On Dec 5, 2011, at 3:00 PM, Amin Momin wrote: > >> Hi , >> >> I am generating a consensus sequence by aligning two protein homologs >> using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to >> understand the criteria consensus_string() method of simpleAlign uses >> to determine the consensus at position with dissimilar aminoacids/ >> nucleotide. Also how would the % cutoffs provided to >> consensus_string() affect the outcome. >> >> >> Thanks, >> Amin >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > From sunwukong at potc.net Wed Dec 7 14:05:20 2011 From: sunwukong at potc.net (sunwukong) Date: Wed, 07 Dec 2011 11:05:20 -0800 Subject: [Bioperl-l] DNA Sequencing two questions Message-ID: <4EDFB8F0.8080001@potc.net> I am not a medical professional but I have two DNA related questions. A year or so ago I realized that if the standard building blocks of life were the amino acids GATC then they could be represented as a base 4 number system (e.g., 0,1,2 and 3). Then any life form could be represented by a number (it would be very long). So I set out on a quest to do this with a small life form. For fun I chose the Spanish Flu which I believe I found on an NIH site. Then I set out and realized that there was no standard. And I did not know if the number would be built with the most significant digit on the left or right. 1. Is there a standard method for representing the ATCD molecules as numbers g = 0 a = 1 t = 2 c = 3 2. is the sequence read left to right or right to left? note: It may be biologically significant if the right values are assigned to the letters GATC, there could be a pattern somewhere that holds significant information. One idea might be to look at DNA sequences in bases other than 4 to see if something jumps out. http://www.insectscience.org/2.10/ref/fig5a.gif VR Pat Kirol 509 442-2214 From Russell.Smithies at agresearch.co.nz Wed Dec 7 16:59:18 2011 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Thu, 8 Dec 2011 10:59:18 +1300 Subject: [Bioperl-l] DNA Sequencing two questions In-Reply-To: <4EDFB8F0.8080001@potc.net> References: <4EDFB8F0.8080001@potc.net> Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz> I did something similar a few years ago (after watching the movie "Contact" I think) and encoded codons as RGB values and drew an image of a genome. Looked much like random noise but I might try it again and draw as a space filling curve. I guess if you're looking for "hidden messages", why restrict yourself to 2 dimensions? Perhaps something pops out as a single-image stereogram eg. http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Random_Dot_Shark.png Perhaps it's a 3D "object" represented by slices drawn in a series of 2D planes? But you need a bit of biological background as there will be patterns simply because of the way genes "work" and are laid out in chromosomes. You need to remember that DNA is effectively a 2D representation of a 3D protein structure and there is already much hidden information we know we don't understand - a "simple" task like how proteins fold is barely understood and why some become prions is still a mystery. But don't let this stop you uncovering the great secret hidden in our genes :-) On a similar note, have a look at http://medgadget.com/2011/10/send-your-secret-message-hidden-in-bacteria.html --Russell > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of sunwukong > Sent: Thursday, 8 December 2011 8:05 a.m. > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] DNA Sequencing two questions > > I am not a medical professional but I have two DNA related questions. > > A year or so ago I realized that if the standard building blocks of life were the > amino acids GATC then they could be represented as a base 4 number > system (e.g., 0,1,2 and 3). Then any life form could be represented by a > number (it would be very long). So I set out on a quest to do this with a small > life form. For fun I chose the Spanish Flu which I believe I found on an NIH > site. Then I set out and realized that there was no standard. And I did not > know if the number would be built with the most significant digit on the left > or right. > > 1. Is there a standard method for representing the ATCD molecules as > numbers g = 0 a = 1 t = 2 c = 3 > > 2. is the sequence read left to right or right to left? > > note: It may be biologically significant if the right values are assigned to the > letters GATC, there could be a pattern somewhere that holds significant > information. One idea might be to look at DNA sequences in bases other > than 4 to see if something jumps out. > > http://www.insectscience.org/2.10/ref/fig5a.gif > > VR > Pat Kirol > 509 442-2214 > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From jason.stajich at gmail.com Wed Dec 7 17:53:10 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 7 Dec 2011 14:53:10 -0800 Subject: [Bioperl-l] DNA Sequencing two questions In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz> References: <4EDFB8F0.8080001@potc.net> <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz> Message-ID: <9BDA2BFA-264F-45B2-9234-A6BC9402FBA8@gmail.com> For other fun picture games -- You can look at patterns of motifs/words in a chaos game representation of genomes. http://mbe.oxfordjournals.org/content/16/10/1391.long http://mbe.oxfordjournals.org/content/20/6/901.long On Dec 7, 2011, at 1:59 PM, Smithies, Russell wrote: > I did something similar a few years ago (after watching the movie "Contact" I think) and encoded codons as RGB values and drew an image of a genome. Looked much like random noise but I might try it again and draw as a space filling curve. > I guess if you're looking for "hidden messages", why restrict yourself to 2 dimensions? Perhaps something pops out as a single-image stereogram eg. http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Random_Dot_Shark.png > Perhaps it's a 3D "object" represented by slices drawn in a series of 2D planes? > > But you need a bit of biological background as there will be patterns simply because of the way genes "work" and are laid out in chromosomes. You need to remember that DNA is effectively a 2D representation of a 3D protein structure and there is already much hidden information we know we don't understand - a "simple" task like how proteins fold is barely understood and why some become prions is still a mystery. > > But don't let this stop you uncovering the great secret hidden in our genes :-) > > On a similar note, have a look at http://medgadget.com/2011/10/send-your-secret-message-hidden-in-bacteria.html > > --Russell > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of sunwukong >> Sent: Thursday, 8 December 2011 8:05 a.m. >> To: bioperl-l at bioperl.org >> Subject: [Bioperl-l] DNA Sequencing two questions >> >> I am not a medical professional but I have two DNA related questions. >> >> A year or so ago I realized that if the standard building blocks of life were the >> amino acids GATC then they could be represented as a base 4 number >> system (e.g., 0,1,2 and 3). Then any life form could be represented by a >> number (it would be very long). So I set out on a quest to do this with a small >> life form. For fun I chose the Spanish Flu which I believe I found on an NIH >> site. Then I set out and realized that there was no standard. And I did not >> know if the number would be built with the most significant digit on the left >> or right. >> >> 1. Is there a standard method for representing the ATCD molecules as >> numbers g = 0 a = 1 t = 2 c = 3 >> >> 2. is the sequence read left to right or right to left? >> >> note: It may be biologically significant if the right values are assigned to the >> letters GATC, there could be a pattern somewhere that holds significant >> information. One idea might be to look at DNA sequences in bases other >> than 4 to see if something jumps out. >> >> http://www.insectscience.org/2.10/ref/fig5a.gif >> >> VR >> Pat Kirol >> 509 442-2214 >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From Russell.Smithies at agresearch.co.nz Wed Dec 7 19:29:47 2011 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Thu, 8 Dec 2011 13:29:47 +1300 Subject: [Bioperl-l] DNA Sequencing two questions In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz> References: <4EDFB8F0.8080001@potc.net> <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz> Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF245@exchsth.agresearch.co.nz> I tried again and came up with this: http://www.bioperl.org/w/images/7/7a/Autostereogram.png If you look carefully, you can see the answer to life, the universe, and everything!! --Russell > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Smithies, Russell > Sent: Thursday, 8 December 2011 10:59 a.m. > To: 'sunwukong'; bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] DNA Sequencing two questions > > I did something similar a few years ago (after watching the movie "Contact" I > think) and encoded codons as RGB values and drew an image of a genome. > Looked much like random noise but I might try it again and draw as a space > filling curve. > I guess if you're looking for "hidden messages", why restrict yourself to 2 > dimensions? Perhaps something pops out as a single-image stereogram eg. > http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Ra > ndom_Dot_Shark.png > Perhaps it's a 3D "object" represented by slices drawn in a series of 2D > planes? > > But you need a bit of biological background as there will be patterns simply > because of the way genes "work" and are laid out in chromosomes. You > need to remember that DNA is effectively a 2D representation of a 3D > protein structure and there is already much hidden information we know we > don't understand - a "simple" task like how proteins fold is barely understood > and why some become prions is still a mystery. > > But don't let this stop you uncovering the great secret hidden in our genes :-) > > On a similar note, have a look at http://medgadget.com/2011/10/send-your- > secret-message-hidden-in-bacteria.html > > --Russell > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of sunwukong > > Sent: Thursday, 8 December 2011 8:05 a.m. > > To: bioperl-l at bioperl.org > > Subject: [Bioperl-l] DNA Sequencing two questions > > > > I am not a medical professional but I have two DNA related questions. > > > > A year or so ago I realized that if the standard building blocks of > > life were the amino acids GATC then they could be represented as a > > base 4 number system (e.g., 0,1,2 and 3). Then any life form could be > > represented by a number (it would be very long). So I set out on a > > quest to do this with a small life form. For fun I chose the Spanish > > Flu which I believe I found on an NIH site. Then I set out and > > realized that there was no standard. And I did not know if the number > > would be built with the most significant digit on the left or right. > > > > 1. Is there a standard method for representing the ATCD molecules as > > numbers g = 0 a = 1 t = 2 c = 3 > > > > 2. is the sequence read left to right or right to left? > > > > note: It may be biologically significant if the right values are > > assigned to the letters GATC, there could be a pattern somewhere that > > holds significant information. One idea might be to look at DNA > > sequences in bases other than 4 to see if something jumps out. > > > > http://www.insectscience.org/2.10/ref/fig5a.gif > > > > VR > > Pat Kirol > > 509 442-2214 > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > ========================================================== > ============= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities to which > it is addressed and may contain confidential and/or privileged material. Any > review, retransmission, dissemination or other use of, or taking of any action > in reliance upon, this information by persons or entities other than the > intended recipients is prohibited by AgResearch Limited. If you have received > this message in error, please notify the sender immediately. > ========================================================== > ============= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From lumos.lumos.lumos at gmail.com Fri Dec 9 11:47:36 2011 From: lumos.lumos.lumos at gmail.com (lumos lumos) Date: Fri, 9 Dec 2011 08:47:36 -0800 Subject: [Bioperl-l] Mouse->Human homologues ? Message-ID: Hello, Is there a way to get human homologues for a mouse gene list where I get all human genes(symbols) as text output ? Thank you LM From cjfields at illinois.edu Fri Dec 9 12:17:20 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 9 Dec 2011 17:17:20 +0000 Subject: [Bioperl-l] Mouse->Human homologues ? In-Reply-To: References: Message-ID: There are lots of databases that have this capability (ensembl, orthodb, homologene, oma, to name only a few). Have you tried a simple search for this, or did you want expert opinion on the matter? chris PS - Just to note, there is a lot of controversy swirling about re: the ortholog conjecture and some recently published papers calling it into question using human-mouse data, worth a look if you're trotting this path to know the current situation. If you have access to F1000, see the following (paper itself is open :) Faculty of 1000 evaluations, dissents and comments for [Nehrt NL et al. Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Comput Biol. 2011 Jun; 7(6):e1002073; doi: 10.1371/journal.pcbi.1002073]. Faculty of 1000, 31 Aug 2011. F1000.com/12462957 On Dec 9, 2011, at 10:47 AM, lumos lumos wrote: > Hello, > > Is there a way to get human homologues for a mouse gene list where I get > all human genes(symbols) as text output ? > > Thank you > LM > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From lumos.lumos.lumos at gmail.com Fri Dec 9 12:29:24 2011 From: lumos.lumos.lumos at gmail.com (lumos lumos) Date: Fri, 9 Dec 2011 09:29:24 -0800 Subject: [Bioperl-l] Mouse->Human homologues ? In-Reply-To: References: Message-ID: Hi Chris, Thanks for your reply. I wanted to know if there is anyway you can do it via script/automatically in perl for a list of mouse genes whose human homologues I require. LM On Fri, Dec 9, 2011 at 9:17 AM, Fields, Christopher J wrote: > There are lots of databases that have this capability (ensembl, orthodb, > homologene, oma, to name only a few). Have you tried a simple search for > this, or did you want expert opinion on the matter? > > chris > > PS - Just to note, there is a lot of controversy swirling about re: the > ortholog conjecture and some recently published papers calling it into > question using human-mouse data, worth a look if you're trotting this path > to know the current situation. If you have access to F1000, see the > following (paper itself is open :) > > Faculty of 1000 evaluations, dissents and comments for [Nehrt NL et al. > Testing the ortholog conjecture with comparative functional genomic data > from mammals. PLoS Comput Biol. 2011 Jun; 7(6):e1002073; doi: > 10.1371/journal.pcbi.1002073]. Faculty of 1000, 31 Aug 2011. > F1000.com/12462957 > > On Dec 9, 2011, at 10:47 AM, lumos lumos wrote: > > > Hello, > > > > Is there a way to get human homologues for a mouse gene list where I get > > all human genes(symbols) as text output ? > > > > Thank you > > LM > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From lumos.lumos.lumos at gmail.com Wed Dec 7 23:47:19 2011 From: lumos.lumos.lumos at gmail.com (lumos lumos) Date: Wed, 7 Dec 2011 20:47:19 -0800 Subject: [Bioperl-l] Perl parsing Message-ID: Hello, I have a text file(tab-delim) with some gene names as shown below. *BRCA1: breast cancer 1, early onset TNF: tumor necrosis factor OMG: oligodendrocyte myelin glycoprotein* I would like to get the list of gene name BRCA1,TNF,OMG that is before the colon(:) . How do I parse in perl this text file with this list of genes? Thanks in advance. LM From b.m.forde at umail.ucc.ie Fri Dec 9 11:52:56 2011 From: b.m.forde at umail.ucc.ie (BForde) Date: Fri, 9 Dec 2011 08:52:56 -0800 (PST) Subject: [Bioperl-l] Genbank files Message-ID: <32941955.post@talk.nabble.com> Hello all, I am new to Bioperl so I apologise if this is stupid question. For CDS features I which to add additional qualifiers e.g. /colour and /note qualifiers. I have looked at the BioPerl wiki but am still unsure as how to do this? regards Brian -- View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From jboddu at illinois.edu Fri Dec 9 14:59:39 2011 From: jboddu at illinois.edu (Boddu, Jayanand) Date: Fri, 9 Dec 2011 19:59:39 +0000 Subject: [Bioperl-l] Batch processing of Data Message-ID: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu> Hi Anyone: Please let me know if the following is practical with PERL. My data output can be described as following. 1. Hundreds of samples are run. 2. A batch output sends data from each sample to its own "folder". Output is in the form of few text files, spreadsheets and PDF files. 3. One of the spreadsheet has the data of most interest. 4. This means I end up having hundreds of folders. 5. The spreadsheet with the data has multiple worksheets out of which a couple have the interesting data to be processed (Please find attached a spreadsheet output in which the data is organized and the worksheets of my interest are named as "Compound" and "Peak". Yellow high-lighted columns in each worksheet has the data to be processed). OK. That's long description. NOW. Is it practical to write a PERL/or any script to; 1. Enter each folder. 2. Look for the spreadsheet of interest. 3. Look for worksheets named "Compound" and "Peak". 4. Look for the specific columns of interest. 5. Copy paste the columns of interest into a new spreadsheet/text file with data from each folder next to each other. This final spreadsheet will pass through a bunch of other calculations. I apologize for this long and painful description. However, it would be great if this can be done. Thanks Jay -------------- next part -------------- A non-text attachment was scrubbed... Name: REPORT01.xls Type: application/vnd.ms-excel Size: 93696 bytes Desc: REPORT01.xls URL: From cjfields at illinois.edu Fri Dec 9 15:37:48 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 9 Dec 2011 20:37:48 +0000 Subject: [Bioperl-l] Perl parsing In-Reply-To: References: Message-ID: On Dec 7, 2011, at 10:47 PM, lumos lumos wrote: > Hello, > > I have a text file(tab-delim) with some gene names as shown below. > > *BRCA1: breast cancer 1, early onset > > TNF: tumor necrosis factor > > OMG: oligodendrocyte myelin glycoprotein* > > I would like to get the list of gene name BRCA1,TNF,OMG that is before the > colon(:) . > How do I parse in perl this text file with this list of genes? 'Very carefully?' Okay, I'll try to refrain from further sarcasm, but I'm confused, what does this have to do with BioPerl (*the toolkit*) specifically? That is what this mailing list is for. Just to note, this is a very common perl task. The answer is attainable by searching for it (not to mention taking the time to learn basic perl). For instance: http://stackoverflow.com/questions/4500407/in-perl-how-can-i-correctly-parse-tab-space-delimited-files-with-quoted-strings One of the many links found by simply using Google: http://lmgtfy.com/?q=perl+parse+tab+file I'll leave the regex munging to you. (okay, I failed at refraining from sarcasm, ah well it's friday). chris > Thanks in advance. > LM From jason.stajich at gmail.com Fri Dec 9 16:18:38 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Fri, 9 Dec 2011 13:18:38 -0800 Subject: [Bioperl-l] Genbank files In-Reply-To: <32941955.post@talk.nabble.com> References: <32941955.post@talk.nabble.com> Message-ID: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> $feature->add_tag_value('color','blue'); On Dec 9, 2011, at 8:52 AM, BForde wrote: > > Hello all, > > I am new to Bioperl so I apologise if this is stupid question. > > For CDS features I which to add additional qualifiers e.g. /colour and /note > qualifiers. I have looked at the BioPerl wiki but am still unsure as how to > do this? > > regards > > Brian > -- > View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason.stajich at gmail.com jason at bioperl.org From bosborne11 at verizon.net Fri Dec 9 15:31:15 2011 From: bosborne11 at verizon.net (Brian Osborne) Date: Fri, 09 Dec 2011 15:31:15 -0500 Subject: [Bioperl-l] Genbank files In-Reply-To: <32941955.post@talk.nabble.com> References: <32941955.post@talk.nabble.com> Message-ID: <3AB0716B-EC07-4BD6-9FC8-0C47A29FC0BA@verizon.net> Brian, Reasonable question. Start here: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation If you've never used Bioperl then: http://www.bioperl.org/wiki/HOWTO:Beginners Brian On Dec 9, 2011, at 11:52 AM, BForde wrote: > > Hello all, > > I am new to Bioperl so I apologise if this is stupid question. > > For CDS features I which to add additional qualifiers e.g. /colour and /note > qualifiers. I have looked at the BioPerl wiki but am still unsure as how to > do this? > > regards > > Brian > -- > View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From asjo at koldfront.dk Fri Dec 9 17:25:00 2011 From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=) Date: Fri, 09 Dec 2011 23:25:00 +0100 Subject: [Bioperl-l] Batch processing of Data References: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu> Message-ID: <871usdpemb.fsf@topper.koldfront.dk> On Fri, 9 Dec 2011 19:59:39 +0000, Boddu, wrote: > Please let me know if the following is practical with PERL. It might very well be, yes. Modules you might be interested in include Spreadsheet::ParseExcel, Spreadsheet::XLSX, Spreadsheet::WriteExcel and Excel::Writer::XLSX?. A big help in finding interesting CPAN modules is the search engine on https://metacpan.org/ Depending on your platform and preference using find(1) might also be helpful to traverse the folders, rather than doing so in Perl. Note that none of this has anything to do with BioPerl as such, though, and you'll need to do some actual programming to get the job done. Best regards, Adam ? http://blogs.perl.org/users/john_mcnamara/2011/10/spreadsheetwriteexcel-is-dead-long-live-excelwriterxlsx.html -- "Angels can fly because they take themselves lightly." Adam Sj?gren asjo at koldfront.dk From David.Messina at sbc.su.se Fri Dec 9 17:30:23 2011 From: David.Messina at sbc.su.se (Dave Messina) Date: Fri, 9 Dec 2011 23:30:23 +0100 Subject: [Bioperl-l] Batch processing of Data In-Reply-To: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu> References: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu> Message-ID: Yes, it can be done. However, it has nothing to do with this mailing list. Steps 1 and 2 are basic Perl. For steps 3 through 5, try googling "perl parse excel". Dave On Fri, Dec 9, 2011 at 20:59, Boddu, Jayanand wrote: > Hi Anyone: > Please let me know if the following is practical with PERL. > My data output can be described as following. > > 1. Hundreds of samples are run. > > 2. A batch output sends data from each sample to its own "folder". > Output is in the form of few text files, spreadsheets and PDF files. > > 3. One of the spreadsheet has the data of most interest. > > 4. This means I end up having hundreds of folders. > > 5. The spreadsheet with the data has multiple worksheets out of > which a couple have the interesting data to be processed (Please find > attached a spreadsheet output in which the data is organized and the > worksheets of my interest are named as "Compound" and "Peak". Yellow > high-lighted columns in each worksheet has the data to be processed). > OK. That's long description. > NOW. Is it practical to write a PERL/or any script to; > > 1. Enter each folder. > > 2. Look for the spreadsheet of interest. > > 3. Look for worksheets named "Compound" and "Peak". > > 4. Look for the specific columns of interest. > > 5. Copy paste the columns of interest into a new spreadsheet/text > file with data from each folder next to each other. > > This final spreadsheet will pass through a bunch of other calculations. > > I apologize for this long and painful description. > However, it would be great if this can be done. > Thanks > Jay > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From lsbrath at gmail.com Sat Dec 10 16:39:44 2011 From: lsbrath at gmail.com (Mgavi Brathwaite) Date: Sat, 10 Dec 2011 16:39:44 -0500 Subject: [Bioperl-l] Perl parsing In-Reply-To: References: Message-ID: Yes grasshopper you have to suffer a little bit. Learn Perl first, then step up to BioPerl. Chris I feel you concerning the power of Regex, and the sarcasm. Lom On Fri, Dec 9, 2011 at 3:37 PM, Fields, Christopher J wrote: > On Dec 7, 2011, at 10:47 PM, lumos lumos wrote: > > > Hello, > > > > I have a text file(tab-delim) with some gene names as shown below. > > > > *BRCA1: breast cancer 1, early onset > > > > TNF: tumor necrosis factor > > > > OMG: oligodendrocyte myelin glycoprotein* > > > > I would like to get the list of gene name BRCA1,TNF,OMG that is before > the > > colon(:) . > > How do I parse in perl this text file with this list of genes? > > 'Very carefully?' > > Okay, I'll try to refrain from further sarcasm, but I'm confused, what > does this have to do with BioPerl (*the toolkit*) specifically? That is > what this mailing list is for. > > Just to note, this is a very common perl task. The answer is attainable by > searching for it (not to mention taking the time to learn basic perl). For > instance: > > > http://stackoverflow.com/questions/4500407/in-perl-how-can-i-correctly-parse-tab-space-delimited-files-with-quoted-strings > > One of the many links found by simply using Google: > > http://lmgtfy.com/?q=perl+parse+tab+file > > I'll leave the regex munging to you. > > (okay, I failed at refraining from sarcasm, ah well it's friday). > > chris > > > > Thanks in advance. > > LM > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From pawan.mani2 at gmail.com Mon Dec 5 17:00:09 2011 From: pawan.mani2 at gmail.com (pawan.mani2 at gmail.com) Date: Tue, 6 Dec 2011 03:30:09 +0530 Subject: [Bioperl-l] bioperl in cygwin Message-ID: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC> Hi I would like to after the givibg following commands in cgwin terminal: perl -MCPAN -e shell then I type o conf prerequisites_policy follow o conf commit install Bundle::CPAN install Module::Build d /bioperl/ then we you get a list of different versions. I selected CJFIELDS/BioPerl-1.6.1.96 install CJFIELDS/BioPerl-1.6.1.96.tar.gz but build.install was not ok. Kindly allow me to know the step of complete bioprl installatin cygwin undre windows 7. thanks in advanced. with best regards, Pawan From cjfields at illinois.edu Sun Dec 11 13:22:01 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sun, 11 Dec 2011 18:22:01 +0000 Subject: [Bioperl-l] bioperl in cygwin In-Reply-To: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC> References: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC> Message-ID: Pawan, Hard to say what the problem is w/o supplying warnings/errors. Prior to doing this, you should try installing BioPerl-1.6.901 (the latest CPAN release). You can try a direct installation of the distribution, but the easiest way to get the latest version is to try installing Bio::Perl. (I'm not sure what BioPerl-1.6.1.96 is, but this seems wrong) chris On Dec 5, 2011, at 4:00 PM, wrote: > Hi > I would like to after the givibg following commands in cgwin terminal: > > > perl -MCPAN -e shell > > then I type > > o conf prerequisites_policy follow > o conf commit > install Bundle::CPAN > install Module::Build > d /bioperl/ > then we you get a list of different versions. > I selected CJFIELDS/BioPerl-1.6.1.96 > install CJFIELDS/BioPerl-1.6.1.96.tar.gz > > > but build.install was not ok. > > Kindly allow me to know the step of complete bioprl installatin cygwin undre windows 7. > > thanks in advanced. > > with best regards, > Pawan > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From b.m.forde at umail.ucc.ie Tue Dec 13 06:03:50 2011 From: b.m.forde at umail.ucc.ie (BForde) Date: Tue, 13 Dec 2011 03:03:50 -0800 (PST) Subject: [Bioperl-l] Genbank files In-Reply-To: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> References: <32941955.post@talk.nabble.com> <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> Message-ID: <32965574.post@talk.nabble.com> Than you for the replies. My script (below) reads in a list of locus_tags from a tab delimited text file. Compares these locus_tags to the locus_tags in a genbank file and where they are equal adds new features. the line $feat->add_tag_value() needs to be defined. In the bioperl wiki this variable appears to be defined by giving it coordinates etc (creating a new feature). I wish to add features to CDS key when the locus_tags are identical. Is this possible? use strict; use Bio::SeqIO; my @V; open (LIST1, 'list') ||die; while (){ push @V, (split(/\t/, $_))[0]; } close(LIST1); my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); my $seq_object = $seqio_object->next_seq; for my $feat_object ($seq_object->get_SeqFeatures){ if ($feat_object->primary_tag eq "CDS"){ if ($feat_object->has_tag('locus_tag')){ for my $V3 ($feat_object->get_tag_values('locus_tag')){ for my $V1 (@V) { if ($V1 eq $V3){ ADD NEW FEATURES } } } } } } The script works down as far as the comparison point where locus_tags in the genbankfile "Contig100.gb" are compared against a list of locus_tags from a delimited txt file. regards Brian Jason Stajich-5 wrote: > > $feature->add_tag_value('color','blue'); > > On Dec 9, 2011, at 8:52 AM, BForde wrote: > >> >> Hello all, >> >> I am new to Bioperl so I apologise if this is stupid question. >> >> For CDS features I which to add additional qualifiers e.g. /colour and >> /note >> qualifiers. I have looked at the BioPerl wiki but am still unsure as how >> to >> do this? >> >> regards >> >> Brian >> -- >> View this message in context: >> http://old.nabble.com/Genbank-files-tp32941955p32941955.html >> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Jason Stajich > jason.stajich at gmail.com > jason at bioperl.org > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32965574.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From roy.chaudhuri at gmail.com Tue Dec 13 06:52:05 2011 From: roy.chaudhuri at gmail.com (Roy Chaudhuri) Date: Tue, 13 Dec 2011 11:52:05 +0000 Subject: [Bioperl-l] Genbank files In-Reply-To: <32965574.post@talk.nabble.com> References: <32941955.post@talk.nabble.com> <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> <32965574.post@talk.nabble.com> Message-ID: <4EE73C65.1080101@gmail.com> Hi Brian, Just to check I have understood you, you want to read through a genbank file and add additional tags to features which are listed in a tab-delimited file of locus tags? Your code is on the right lines, but it would be much more efficient to read your tab-delimited locus_tags into a hash, and check using exists, rather than ploughing through the (potentially very long) list of locus tags every time. Also, be careful with new lines in your tab file (you can safely get rid of them using "chomp"). You can miss out the "has_tag" check by using "get_tagset_values" instead of "get_tag_values", since the former does not complain if the tag is not present. Once you have modified your sequence object, you need to write it out to a new file (or STDOUT) using Bio::SeqIO. Also, just a couple of general points, you should always "use warnings" (or even better "use warnings FATAL=>qw(all)") since that can help solve many problems, and your code may be easier to read if you don't include the word "object" in all your variable names (after all you wouldn't say you write on a paper object using a pen object). use strict; use warnings FATAL=>qw(all); use Bio::SeqIO; open (my $list, 'list') or die $!; my %V; while (<$list>){ chomp; $V{(split(/\t/, $_))[0]}=1; } my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); my $seq_object = $seqio_object->next_seq; for my $feat_object ($seq_object->remove_SeqFeatures){ if ($feat_object->primary_tag eq "CDS"){ for my $V3 ($feat_object->get_tagset_values('locus_tag')){ if (exists $V{$V3}){ $feat_object->add_tag_value(listed_in_tab_file=>'yes'); next; } } } $seq_object->add_SeqFeature($feat_object); } Bio::SeqIO->new(-format=>'genbank')->write_seq($seq_object); Hope this helps. Cheers, Roy. On 13/12/2011 11:03, BForde wrote: > > Than you for the replies. > > My script (below) reads in a list of locus_tags from a tab delimited text > file. Compares these locus_tags to the locus_tags in a genbank file and > where they are equal adds new features. > the line > $feat->add_tag_value() > needs to be defined. In the bioperl wiki this variable appears to be defined > by giving it coordinates etc (creating a new feature). I wish to add > features to CDS key when the locus_tags are identical. Is this possible? > > use strict; > use Bio::SeqIO; > > my @V; > open (LIST1, 'list') ||die; > while (){ > push @V, (split(/\t/, $_))[0]; > } > close(LIST1); > > my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); > my $seq_object = $seqio_object->next_seq; > > for my $feat_object ($seq_object->get_SeqFeatures){ > if ($feat_object->primary_tag eq "CDS"){ > if ($feat_object->has_tag('locus_tag')){ > for my $V3 ($feat_object->get_tag_values('locus_tag')){ > for my $V1 (@V) { > if ($V1 eq $V3){ > ADD NEW FEATURES > > } > } > } > } > } > } > > The script works down as far as the comparison point where locus_tags in the > genbankfile "Contig100.gb" are compared against a list of locus_tags from a > delimited txt file. > > > regards > > Brian > > Jason Stajich-5 wrote: >> >> $feature->add_tag_value('color','blue'); >> >> On Dec 9, 2011, at 8:52 AM, BForde wrote: >> >>> >>> Hello all, >>> >>> I am new to Bioperl so I apologise if this is stupid question. >>> >>> For CDS features I which to add additional qualifiers e.g. /colour and >>> /note >>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how >>> to >>> do this? >>> >>> regards >>> >>> Brian >>> -- >>> View this message in context: >>> http://old.nabble.com/Genbank-files-tp32941955p32941955.html >>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> Jason Stajich >> jason.stajich at gmail.com >> jason at bioperl.org >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > From b.m.forde at umail.ucc.ie Tue Dec 13 09:22:01 2011 From: b.m.forde at umail.ucc.ie (Brian Forde) Date: Tue, 13 Dec 2011 14:22:01 +0000 Subject: [Bioperl-l] Genbank files In-Reply-To: <4EE73C65.1080101@gmail.com> References: <32941955.post@talk.nabble.com> <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> <32965574.post@talk.nabble.com> <4EE73C65.1080101@gmail.com> Message-ID: Hi Roy, Thank you. That works perfectly. I have to confess that someone else told me to use hashes but I could not get them to work.. Thanks again regards Brian On Tue, Dec 13, 2011 at 11:52 AM, Roy Chaudhuri wrote: > Hi Brian, > > Just to check I have understood you, you want to read through a genbank > file and add additional tags to features which are listed in a > tab-delimited file of locus tags? > > Your code is on the right lines, but it would be much more efficient to > read your tab-delimited locus_tags into a hash, and check using exists, > rather than ploughing through the (potentially very long) list of locus > tags every time. Also, be careful with new lines in your tab file (you can > safely get rid of them using "chomp"). You can miss out the "has_tag" check > by using "get_tagset_values" instead of "get_tag_values", since the former > does not complain if the tag is not present. Once you have modified your > sequence object, you need to write it out to a new file (or STDOUT) using > Bio::SeqIO. > > Also, just a couple of general points, you should always "use warnings" > (or even better "use warnings FATAL=>qw(all)") since that can help solve > many problems, and your code may be easier to read if you don't include the > word "object" in all your variable names (after all you wouldn't say you > write on a paper object using a pen object). > > use strict; > use warnings FATAL=>qw(all); > use Bio::SeqIO; > open (my $list, 'list') or die $!; > my %V; > while (<$list>){ > chomp; > $V{(split(/\t/, $_))[0]}=1; > > } > my $seqio_object = Bio::SeqIO->new(-file=>"**Contig100.gb"); > my $seq_object = $seqio_object->next_seq; > for my $feat_object ($seq_object->remove_**SeqFeatures){ > > if ($feat_object->primary_tag eq "CDS"){ > for my $V3 ($feat_object->get_tagset_**values('locus_tag')){ > if (exists $V{$V3}){ > $feat_object->add_tag_value(**listed_in_tab_file=>'yes'); > next; > } > } > } > $seq_object->add_SeqFeature($**feat_object); > } > Bio::SeqIO->new(-format=>'**genbank')->write_seq($seq_**object); > > Hope this helps. > Cheers, > Roy. > > > On 13/12/2011 11:03, BForde wrote: > >> >> Than you for the replies. >> >> My script (below) reads in a list of locus_tags from a tab delimited text >> file. Compares these locus_tags to the locus_tags in a genbank file and >> where they are equal adds new features. >> the line >> $feat->add_tag_value() >> needs to be defined. In the bioperl wiki this variable appears to be >> defined >> by giving it coordinates etc (creating a new feature). I wish to add >> features to CDS key when the locus_tags are identical. Is this possible? >> >> use strict; >> use Bio::SeqIO; >> >> my @V; >> open (LIST1, 'list') ||die; >> while (){ >> push @V, (split(/\t/, $_))[0]; >> } >> close(LIST1); >> >> my $seqio_object = Bio::SeqIO->new(-file=>"**Contig100.gb"); >> my $seq_object = $seqio_object->next_seq; >> >> for my $feat_object ($seq_object->get_SeqFeatures)**{ >> if ($feat_object->primary_tag eq "CDS"){ >> if ($feat_object->has_tag('locus_**tag')){ >> for my $V3 ($feat_object->get_tag_values(**'locus_tag')){ >> for my $V1 (@V) { >> if ($V1 eq $V3){ >> ADD NEW FEATURES >> >> } >> } >> } >> } >> } >> } >> >> The script works down as far as the comparison point where locus_tags in >> the >> genbankfile "Contig100.gb" are compared against a list of locus_tags from >> a >> delimited txt file. >> >> >> regards >> >> Brian >> >> Jason Stajich-5 wrote: >> >>> >>> $feature->add_tag_value('**color','blue'); >>> >>> On Dec 9, 2011, at 8:52 AM, BForde wrote: >>> >>> >>>> Hello all, >>>> >>>> I am new to Bioperl so I apologise if this is stupid question. >>>> >>>> For CDS features I which to add additional qualifiers e.g. /colour and >>>> /note >>>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how >>>> to >>>> do this? >>>> >>>> regards >>>> >>>> Brian >>>> -- >>>> View this message in context: >>>> http://old.nabble.com/Genbank-**files-tp32941955p32941955.html >>>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >>>> >>>> ______________________________**_________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/**mailman/listinfo/bioperl-l >>>> >>> >>> Jason Stajich >>> jason.stajich at gmail.com >>> jason at bioperl.org >>> >>> >>> ______________________________**_________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/**mailman/listinfo/bioperl-l >>> >>> >>> >> > -- Brian Forde Microbiology Dept. Bioscience Institute. Room 4.11 University College Cork Cork Ireland tel:+353 21 4901306 email: b.m.forde at umail.ucc.ie From b.m.forde at umail.ucc.ie Mon Dec 12 12:20:53 2011 From: b.m.forde at umail.ucc.ie (BForde) Date: Mon, 12 Dec 2011 09:20:53 -0800 (PST) Subject: [Bioperl-l] Genbank files In-Reply-To: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> References: <32941955.post@talk.nabble.com> <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> Message-ID: <32959999.post@talk.nabble.com> Than you for the replies. I am unsure as to how to use the line below with my script. My script so far reads use strict; use Bio::SeqIO; my @V; open (LIST1, 'list') ||die; while (){ push @V, (split(/\t/, $_))[0]; } close(LIST1); my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); my $seq_object = $seqio_object->next_seq; for my $feat_object ($seq_object->get_SeqFeatures){ if ($feat_object->primary_tag eq "CDS"){ if ($feat_object->has_tag('locus_tag')){ for my $V3 ($feat_object->get_tag_values('locus_tag')){ for my $V1 (@V) { if ($V1 eq $V3){ ADD NEW FEATURES } } } } } } The script works down as far as the comparison point where locus_tags in the genbankfile "Contig100.gb" are compared against a list of locus_tags from a delimited txt file. I possbile could you show me how to amend my script so I can add new features regards Brian Jason Stajich-5 wrote: > > $feature->add_tag_value('color','blue'); > > On Dec 9, 2011, at 8:52 AM, BForde wrote: > >> >> Hello all, >> >> I am new to Bioperl so I apologise if this is stupid question. >> >> For CDS features I which to add additional qualifiers e.g. /colour and >> /note >> qualifiers. I have looked at the BioPerl wiki but am still unsure as how >> to >> do this? >> >> regards >> >> Brian >> -- >> View this message in context: >> http://old.nabble.com/Genbank-files-tp32941955p32941955.html >> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Jason Stajich > jason.stajich at gmail.com > jason at bioperl.org > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32959999.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From Russell.Smithies at agresearch.co.nz Tue Dec 13 22:17:02 2011 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Wed, 14 Dec 2011 16:17:02 +1300 Subject: [Bioperl-l] Genbank files In-Reply-To: <32959999.post@talk.nabble.com> References: <32941955.post@talk.nabble.com> <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> <32959999.post@talk.nabble.com> Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF27A@exchsth.agresearch.co.nz> Something like this: use strict; use Bio::SeqIO; my @V; open (LIST1, 'list') ||die; while (){ push @V, (split(/\t/, $_))[0]; } close(LIST1); my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); my $seq_object = $seqio_object->next_seq; for my $feat_object ($seq_object->get_SeqFeatures){ if ($feat_object->primary_tag eq "CDS"){ if ($feat_object->has_tag('locus_tag')){ for my $V3 ($feat_object->get_tag_values('locus_tag')){ for my $V1 (@V) { if ($V1 eq $V3){ #ADD NEW FEATURES $feat_object->add_tag_value('color','blue'); } } } } } } #write the new annotations my $io = Bio::SeqIO->new(-format => "genbank", -file => ">new.gb" ); $io->write_seq($seq_object); Take another look at http://www.bioperl.org/wiki/HOWTO:Feature-Annotation#Building_Your_Own_Sequences --Russell > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of BForde > Sent: Tuesday, 13 December 2011 6:21 a.m. > To: Bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Genbank files > > > Than you for the replies. > > I am unsure as to how to use the line below with my script. My script so far > reads > > use strict; > use Bio::SeqIO; > > my @V; > open (LIST1, 'list') ||die; > while (){ > push @V, (split(/\t/, $_))[0]; > } > close(LIST1); > > my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); > my $seq_object = $seqio_object->next_seq; > > for my $feat_object ($seq_object->get_SeqFeatures){ > if ($feat_object->primary_tag eq "CDS"){ > if ($feat_object->has_tag('locus_tag')){ > for my $V3 ($feat_object->get_tag_values('locus_tag')){ > for my $V1 (@V) { > if ($V1 eq $V3){ > ADD NEW FEATURES > > } > } > } > } > } > } > > The script works down as far as the comparison point where locus_tags in the > genbankfile "Contig100.gb" are compared against a list of locus_tags from a > delimited txt file. > I possbile could you show me how to amend my script so I can add new > features > > regards > > Brian > > Jason Stajich-5 wrote: > > > > $feature->add_tag_value('color','blue'); > > > > On Dec 9, 2011, at 8:52 AM, BForde wrote: > > > >> > >> Hello all, > >> > >> I am new to Bioperl so I apologise if this is stupid question. > >> > >> For CDS features I which to add additional qualifiers e.g. /colour > >> and /note qualifiers. I have looked at the BioPerl wiki but am still > >> unsure as how to do this? > >> > >> regards > >> > >> Brian > >> -- > >> View this message in context: > >> http://old.nabble.com/Genbank-files-tp32941955p32941955.html > >> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > Jason Stajich > > jason.stajich at gmail.com > > jason at bioperl.org > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > -- > View this message in context: http://old.nabble.com/Genbank-files- > tp32941955p32959999.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From l.m.timmermans at students.uu.nl Wed Dec 14 10:43:24 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Wed, 14 Dec 2011 16:43:24 +0100 Subject: [Bioperl-l] Announcing Bio::SFF Message-ID: Hi all, As already mentioned on IRC, I recently wrote a SFF parser and uploaded it to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time to write one I'd be most grateful. Leon From p.j.a.cock at googlemail.com Wed Dec 14 11:03:05 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 14 Dec 2011 16:03:05 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Wed, Dec 14, 2011 at 3:43 PM, Leon Timmermans wrote: > Hi all, > > As already mentioned on IRC, I recently wrote a SFF parser and uploaded it > to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF > entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time > to write one I'd be most grateful. > > Leon Hi Leon, Have you looked at the index block at all, in order to offer random access by read ID, or to access the Roche XML manifest? Please ask if you need more information about this - or if you can read Python: https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py Is this building on Miguel Pignatelli's work? I don't recall seeing any follow up posts from him after this one: http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html Peter From cjfields at illinois.edu Wed Dec 14 11:12:58 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 14 Dec 2011 16:12:58 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: , Message-ID: <3BBC1132-E768-45D9-A107-ACD51791722D@illinois.edu> Leon, Nice! Definitely a good idea to have the lower-level parser and the BioPerl-bridging code separate, one of my concerns with the various parsers we have right now which hard-wire BioPerl classes in with the parser (makes it hard for optimization). Chris PS- Peter, I don't think the two projects are related, but I suppose Leon is the best to answer that. Sent from my stupid iPad, now my laptop's on the fritz On Dec 14, 2011, at 10:04 AM, "Peter Cock" wrote: > On Wed, Dec 14, 2011 at 3:43 PM, Leon Timmermans > wrote: >> Hi all, >> >> As already mentioned on IRC, I recently wrote a SFF parser and uploaded it >> to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF >> entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time >> to write one I'd be most grateful. >> >> Leon > > Hi Leon, > > Have you looked at the index block at all, in order to offer random > access by read ID, or to access the Roche XML manifest? Please > ask if you need more information about this - or if you can read Python: > https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py > > Is this building on Miguel Pignatelli's work? I don't recall seeing > any follow up posts from him after this one: > http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html > > Peter > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From l.m.timmermans at students.uu.nl Wed Dec 14 11:27:58 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Wed, 14 Dec 2011 17:27:58 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Wed, Dec 14, 2011 at 5:03 PM, Peter Cock wrote: > Hi Leon, > > Have you looked at the index block at all, in order to offer random > access by read ID, or to access the Roche XML manifest? Please > ask if you need more information about this - or if you can read Python: > https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py > I have looked at it, but not implemented it yet. There is no standardized index, and the ones that are in common use either seem stupid (the Roche index, which is essentially just a weirdly formatted sequential list, though that should still be faster than a table scan) or undocumented (hash based index). Is this building on Miguel Pignatelli's work? I don't recall seeing > any follow up posts from him after this one: > http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html > It isn't. I like his idea for reusing BioPython's test files though. Leon From p.j.a.cock at googlemail.com Wed Dec 14 11:44:28 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 14 Dec 2011 16:44:28 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Wed, Dec 14, 2011 at 4:27 PM, Leon Timmermans wrote: > On Wed, Dec 14, 2011 at 5:03 PM, Peter Cock > wrote: >> >> Hi Leon, >> >> Have you looked at the index block at all, in order to offer random >> access by read ID, or to access the Roche XML manifest? Please >> ask if you need more information about this - or if you can read Python: >> https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py > > I have looked at it, but not implemented it yet. There is no standardized > index, and the ones that are in common use either seem stupid (the Roche > index, which is essentially just a weirdly formatted sequential list, though > that should still be faster than a table scan) or undocumented (hash based > index). There are two widely used indexes, both from Roche (one with and one without an XML manifest, magic bytes .mft and .srt). They are both just a simple table of the reads names and offsets, sorted alphabetically. This works pretty well for rapid lookup for SFF files (because the read count is not so high), and is pretty easy. I don't think anyone used the hash table style indexes (.hsh), which I assume was a proof of principle or trial in the early days of SFF. One thing to check is what Ion Torrent's SFF files use. I would guess they've followed Roche, but I don't know. After all, the index structure is not defined in the SFF specification - it was left extensible on purpose. >> Is this building on Miguel Pignatelli's work? I don't recall seeing >> any follow up posts from him after this one: >> http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html > > It isn't. I like his idea for reusing BioPython's test files though. Yes, please do. Peter From gingerplum at gmail.com Wed Dec 14 00:18:55 2011 From: gingerplum at gmail.com (plum ginger) Date: Tue, 13 Dec 2011 21:18:55 -0800 (PST) Subject: [Bioperl-l] a problem about BLAST Message-ID: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com> Hi. I'm trying to run BLAST with Bio::Tools::Run::StandAloneBlast. I need run BLAST on more than one sequences. However the blast outfile only store the result of last sequence. How to make the outfile store all results? Wish your help. Thanks very much! Best regards From jason.stajich at gmail.com Thu Dec 15 12:02:47 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Thu, 15 Dec 2011 11:02:47 -0600 Subject: [Bioperl-l] a problem about BLAST In-Reply-To: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com> References: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com> Message-ID: <58E5B487-7FF0-4018-B109-D6595DC2E493@gmail.com> you are probably setting the outfile in each parsing iteration -- you need to show your code if you want someone to help you debug the problem. On Dec 13, 2011, at 11:18 PM, plum ginger wrote: > Hi. I'm trying to run BLAST with Bio::Tools::Run::StandAloneBlast. I > need run BLAST on more than one sequences. However the blast outfile > only store the result of last sequence. How to make the outfile store > all results? > > Wish your help. Thanks very much! > > > Best regards > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason.stajich at gmail.com jason at bioperl.org From pengyu.ut at gmail.com Fri Dec 16 17:10:27 2011 From: pengyu.ut at gmail.com (Peng Yu) Date: Fri, 16 Dec 2011 16:10:27 -0600 Subject: [Bioperl-l] How to stop rather than emit warnings with Bio::Das::segment? Message-ID: Hi, Bio::Das::segment can give me the following warnings without stopping the whole program when the position for the query doesn't exist. I could test the return result and quit when it is []. But this would cause my program have an test whenever I call segment. I'm wondering if there is an automatic way to let Bio::Das::segment stop in such cases. --------------------- WARNING --------------------- MSG: Sequence is not dna or rna, but []. Attempting to revcom, but unsure if this is right --------------------------------------------------- -- Regards, Peng From cjfields at illinois.edu Fri Dec 16 21:48:07 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sat, 17 Dec 2011 02:48:07 +0000 Subject: [Bioperl-l] How to stop rather than emit warnings with Bio::Das::segment? In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF0881CF8E@CITESMBX4.ad.uillinois.edu> Setting verbosity to 2 should convert warnings to exceptions. IIRC, set '-verbose => 2' in the Bio::Das constructor, set '$das->verbose(2)' explicitly, or set the env variable BIOPERLDEBUG=2. chris ________________________________________ From: bioperl-l-bounces at lists.open-bio.org [bioperl-l-bounces at lists.open-bio.org] on behalf of Peng Yu [pengyu.ut at gmail.com] Sent: Friday, December 16, 2011 4:10 PM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] How to stop rather than emit warnings with Bio::Das::segment? Hi, Bio::Das::segment can give me the following warnings without stopping the whole program when the position for the query doesn't exist. I could test the return result and quit when it is []. But this would cause my program have an test whenever I call segment. I'm wondering if there is an automatic way to let Bio::Das::segment stop in such cases. --------------------- WARNING --------------------- MSG: Sequence is not dna or rna, but []. Attempting to revcom, but unsure if this is right --------------------------------------------------- -- Regards, Peng _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From anna.fr at gmail.com Mon Dec 19 02:09:15 2011 From: anna.fr at gmail.com (Anna Friedlander) Date: Mon, 19 Dec 2011 20:09:15 +1300 Subject: [Bioperl-l] StandAloneBlastPlus blastdbcmd question Message-ID: Hi all I have a question about using blastdbcmd via Bio::Tools::Run::StandAloneBlastPlus I have some Blast+ search results that I am manipulating in a perl programme, and I would like to retrieve some sequence information for some results using subject sequence IDs, and associated subject start and end indices. If I was using blastdbcmd directly, I would do so using the -entry and -range options. My question is, can I use all the blastdbcmd options (or more specifically, just the -entry and -range options) from within the StandAloneBlastPlus module? My apologies if I don't properly understand how this "wrapper" works! Thanks in advance for your help Anna Friedlander From l.m.timmermans at students.uu.nl Mon Dec 19 09:19:14 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Mon, 19 Dec 2011 15:19:14 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Wed, Dec 14, 2011 at 5:44 PM, Peter Cock wrote: > There are two widely used indexes, both from Roche (one with and > one without an XML manifest, magic bytes .mft and .srt). They are > both just a simple table of the reads names and offsets, sorted > alphabetically. Yeah, that's what I got from the BioPython code. I didn't know it was sorted though (it doesn't make much sense either, unless they wanted to do a binary search or something). This works pretty well for rapid lookup for SFF files > (because the read count is not so high), and is pretty easy. > It's implemented in Bio::SFF 0.003. I did restructure my code into two readers though, since doing sequential and random-access in the class didn't make much sense code-wise. I don't think anyone used the hash table style indexes (.hsh), which > I assume was a proof of principle or trial in the early days of SFF. > I see, too bad. > One thing to check is what Ion Torrent's SFF files use. I would > guess they've followed Roche, but I don't know. After all, the > index structure is not defined in the SFF specification - it was > left extensible on purpose. > Yeah, we should check that too. Yes, please do. > It's added to 0.003. The lack of tests was bothering me, but the SFFs I had at hand were not suitable. Leon From p.j.a.cock at googlemail.com Mon Dec 19 09:31:18 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 19 Dec 2011 14:31:18 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 2:19 PM, Leon Timmermans wrote: > On Wed, Dec 14, 2011 at 5:44 PM, Peter Cock wrote: > >> There are two widely used indexes, both from Roche (one with and >> one without an XML manifest, magic bytes .mft and .srt). They are >> both just a simple table of the reads names and offsets, sorted >> alphabetically. > > Yeah, that's what I got from the BioPython code. I didn't know it > was sorted though (it doesn't make much sense either, unless they > wanted to do a binary search or something). I presume that's what Roche uses if they keep the index on disk. The alternative is to load the index into RAM, which is really fast. You just open the SFF, read the header, seek to the index, load the index. Without the index, you have to scan the entire SFF file to find each record and its offset - which is much slower. >> This works pretty well for rapid lookup for SFF files >> (because the read count is not so high), and is pretty easy. > > It's implemented in Bio::SFF 0.003. I did restructure my code into two > readers though, since doing sequential and random-access in the class > didn't make much sense code-wise. > >> I don't think anyone used the hash table style indexes (.hsh), which >> I assume was a proof of principle or trial in the early days of SFF. > > I see, too bad. > >> One thing to check is what Ion Torrent's SFF files use. I would >> guess they've followed Roche, but I don't know. After all, the >> index structure is not defined in the SFF specification - it was >> left extensible on purpose. > > Yeah, we should check that too. I don't have any Ion Torrent data first hand, and the public samples I've seen were FASTQ not SFF. But I know a few people with Ion Torrent machines that might be able to help... > It's added to 0.003. The lack of tests was bothering me, but the > SFFs I had at hand were not suitable. Have you looked at the sample SFF data in Biopython? Please use them for the BioPerl unit tests (we're been talking about a cross project collection of test data files like this), the README file should be self-explanatory: https://github.com/biopython/biopython/tree/master/Tests/Roche Peter From p.j.a.cock at googlemail.com Mon Dec 19 10:13:53 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 19 Dec 2011 15:13:53 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: <4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk> References: <4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk> Message-ID: On Mon, Dec 19, 2011 at 3:03 PM, Adam Witney wrote: >> I don't have any Ion Torrent data first hand, and the public >> samples I've seen were FASTQ not SFF. But I know a few >> people with Ion Torrent machines that might be able to help? > > I can you let you have some Ion Torrent SFF files if it helps > > adam Hi Adam, I've just had a quick look at a file from an IonTorrent 314 chip that a colleague kindly sent me, and that SFF file had no index (but only 50k reads so this isn't so important). If you can send me (and Leon?) one of two original SFF files that would be useful, even if just to confirm that Ion Torrent's SFF files do indeed typically lack an index. If that is the case, I may need to remove the warning message Biopython currently prints when indexing these files: No SFF index, doing it the slow way Off list is fine if you'd like to keep the data private, use dropbox or something if you don't have an FTP server. Thanks, Peter From awitney at sgul.ac.uk Mon Dec 19 10:03:16 2011 From: awitney at sgul.ac.uk (Adam Witney) Date: Mon, 19 Dec 2011 15:03:16 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: <4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk> >>> One thing to check is what Ion Torrent's SFF files use. I would >>> guess they've followed Roche, but I don't know. After all, the >>> index structure is not defined in the SFF specification - it was >>> left extensible on purpose. >> >> Yeah, we should check that too. > > I don't have any Ion Torrent data first hand, and the public > samples I've seen were FASTQ not SFF. But I know a few > people with Ion Torrent machines that might be able to help? I can you let you have some Ion Torrent SFF files if it helps adam From l.m.timmermans at students.uu.nl Mon Dec 19 10:48:34 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Mon, 19 Dec 2011 16:48:34 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 3:31 PM, Peter Cock wrote: > I presume that's what Roche uses if they keep the index on disk. > > The alternative is to load the index into RAM, which is really fast. > You just open the SFF, read the header, seek to the index, load > the index. Without the index, you have to scan the entire SFF file > to find each record and its offset - which is much slower. > That's what I'm doing now. It's much faster, but it still takes a noticeable amount of time on large files. Have you looked at the sample SFF data in Biopython? Please > use them for the BioPerl unit tests (we're been talking about a > cross project collection of test data files like this), the README > file should be self-explanatory: > https://github.com/biopython/biopython/tree/master/Tests/Roche > Yeah, I'm using those now ( https://github.com/Leont/bio-sff/blob/master/t/reader.t). I must say there were some interesting corner cases in it. Leon From p.j.a.cock at googlemail.com Mon Dec 19 11:15:15 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 19 Dec 2011 16:15:15 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 3:48 PM, Leon Timmermans wrote: > On Mon, Dec 19, 2011 at 3:31 PM, Peter Cock wrote: > >> Have you looked at the sample SFF data in Biopython? Please >> use them for the BioPerl unit tests (we're been talking about a >> cross project collection of test data files like this), the README >> file should be self-explanatory: >> https://github.com/biopython/biopython/tree/master/Tests/Roche > > Yeah, I'm using those now > (https://github.com/Leont/bio-sff/blob/master/t/reader.t). Could you a link to your /corpus/README.txt file pointing back to the Biopython original for acknowledgement and future reference? > > I must say there were some interesting corner cases in it. > I'm glad you agree - and if you can think of any more special cases to verify that would be great. Are you doing just SFF parsing for now? Not writing? Now, as to Bio::SeqIO integration, Biopython's SeqIO uses format name "sff" to mean the full read sequence (with mixed case, upper case for the good sequence, lower cases for any left/right clipping - as in the Roche tools), and "sff-trim" to mean the trimmed sequences. I would encourage you to do the same, as part of the general aim of having consistent sequence format names between BioPerl, Biopython, and EMBOSS, where possible. Peter From l.m.timmermans at students.uu.nl Mon Dec 19 11:47:41 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Mon, 19 Dec 2011 17:47:41 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock wrote: > Could you a link to your /corpus/README.txt file pointing > back to the Biopython original for acknowledgement and > future reference? > I forgot about that, I will add it to the next release. Are you doing just SFF parsing for now? Not writing? > I haven't written the writer yet (haven't needed it so far). I'd rather release working code early instead of waiting until everything is complete. Now, as to Bio::SeqIO integration, Biopython's SeqIO uses > format name "sff" to mean the full read sequence (with mixed > case, upper case for the good sequence, lower cases for any > left/right clipping - as in the Roche tools), and "sff-trim" to mean > the trimmed sequences. I would encourage you to do the > same, as part of the general aim of having consistent > sequence format names between BioPerl, Biopython, and > EMBOSS, where possible. > I agree, consistency is good. Leon From p.j.a.cock at googlemail.com Mon Dec 19 12:00:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 19 Dec 2011 17:00:03 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 4:47 PM, Leon Timmermans wrote: > On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock > wrote: >> >> Could you a link to your /corpus/README.txt file pointing >> back to the Biopython original for acknowledgement and >> future reference? > > I forgot about that, I will add it to the next release. Thanks. >> Are you doing just SFF parsing for now? Not writing? > > > I haven't written the writer yet (haven't needed it so far). I'd rather > release working code early instead of waiting until everything is complete. I understand - but make sure you've designed the data structures in the parser so as to allow the original record to be re-built as SFF. >> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses >> format name "sff" to mean the full read sequence (with mixed >> case, upper case for the good sequence, lower cases for any >> left/right clipping - as in the Roche tools), and "sff-trim" to mean >> the trimmed sequences. I would encourage you to do the >> same, as part of the general aim of having consistent >> sequence format names between BioPerl, Biopython, and >> EMBOSS, where possible. > > I agree, consistency is good. Great. I'd guess Bio::SeqIO integration would be more important that SFF output initially. Peter From cjfields at illinois.edu Mon Dec 19 14:44:22 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 19 Dec 2011 19:44:22 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: , Message-ID: Kinda joining this a little late, but I think if there is a way to have a low-level parser/writer that generically parses the data into simple (possibly hash-tagged) data structures, that would be best. Barring that, a very simple class for storing data. We've found BioPerl objects/classes pretty heavy. (for an example of this, see Heng Li's readfq parser on github, which has some stats for Fastq/fasta parsing). Any way we can separate the parser from object instantiation would enable us to optimize the object/class layer and parser/writer layers separately, with the possible nice side effect of making the parser more broadly used. For instance, if someone wanted a faster parser, use the low level, otherwise use the higher level (possibly BioPerl-specific) API. Lincoln does this do a certain degree with Bio-samtools; I would go further and make the bp- and non-bp code in separate dists. Chris Sent from my iPad On Dec 19, 2011, at 11:05 AM, "Peter Cock" wrote: > On Mon, Dec 19, 2011 at 4:47 PM, Leon Timmermans > wrote: >> On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock >> wrote: >>> >>> Could you a link to your /corpus/README.txt file pointing >>> back to the Biopython original for acknowledgement and >>> future reference? >> >> I forgot about that, I will add it to the next release. > > Thanks. > >>> Are you doing just SFF parsing for now? Not writing? >> >> >> I haven't written the writer yet (haven't needed it so far). I'd rather >> release working code early instead of waiting until everything is complete. > > I understand - but make sure you've designed the data structures > in the parser so as to allow the original record to be re-built as SFF. > >>> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses >>> format name "sff" to mean the full read sequence (with mixed >>> case, upper case for the good sequence, lower cases for any >>> left/right clipping - as in the Roche tools), and "sff-trim" to mean >>> the trimmed sequences. I would encourage you to do the >>> same, as part of the general aim of having consistent >>> sequence format names between BioPerl, Biopython, and >>> EMBOSS, where possible. >> >> I agree, consistency is good. > > Great. I'd guess Bio::SeqIO integration would be more important > that SFF output initially. > > Peter > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Mon Dec 19 19:28:25 2011 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 19 Dec 2011 18:28:25 -0600 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: <4EEFD6A9.3010303@illinois.edu> On 12/19/2011 10:47 AM, Leon Timmermans wrote: > On Mon, Dec 19, 2011 at 5:15 PM, Peter Cockwrote: > >> Could you a link to your /corpus/README.txt file pointing >> back to the Biopython original for acknowledgement and >> future reference? >> > I forgot about that, I will add it to the next release. > > Are you doing just SFF parsing for now? Not writing? > I haven't written the writer yet (haven't needed it so far). I'd rather > release working code early instead of waiting until everything is complete. > > Now, as to Bio::SeqIO integration, Biopython's SeqIO uses >> format name "sff" to mean the full read sequence (with mixed >> case, upper case for the good sequence, lower cases for any >> left/right clipping - as in the Roche tools), and "sff-trim" to mean >> the trimmed sequences. I would encourage you to do the >> same, as part of the general aim of having consistent >> sequence format names between BioPerl, Biopython, and >> EMBOSS, where possible. >> > I agree, consistency is good. > > Leon This is already implemented in Bio::SeqIO I believe. This is the same line of thinking with the FASTQ format, that one can have a 'format-variant' combination that (as one might guess) indicates to the parser any variation of the parser so logic within the parser can deal with it. You can also pass the '-variant => "foo"' parameter as well IIRC. You would just check the variant with the variant() method. chris From l.m.timmermans at students.uu.nl Tue Dec 20 10:25:13 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Tue, 20 Dec 2011 16:25:13 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 6:00 PM, Peter Cock wrote: > I understand - but make sure you've designed the data structures > in the parser so as to allow the original record to be re-built as SFF. > I did, though currently it's rather hard to make new entries from scratch. That said, I can hardly imagine anyone wanting to do this. Great. I'd guess Bio::SeqIO integration would be more important > that SFF output initially. > Probably. It looks like it's quite easy, it's just rather underdocumented. Leon From l.m.timmermans at students.uu.nl Tue Dec 20 10:26:11 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Tue, 20 Dec 2011 16:26:11 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 8:44 PM, Fields, Christopher J < cjfields at illinois.edu> wrote: > Kinda joining this a little late, but I think if there is a way to have a > low-level parser/writer that generically parses the data into simple > (possibly hash-tagged) data structures, that would be best. Barring that, > a very simple class for storing data. We've found BioPerl objects/classes > pretty heavy. > > (for an example of this, see Heng Li's readfq parser on github, which has > some stats for Fastq/fasta parsing). > > Any way we can separate the parser from object instantiation would enable > us to optimize the object/class layer and parser/writer layers separately, > with the possible nice side effect of making the parser more broadly used. > > For insn Sance, if someone wanted a faster parser, use the low level, > otherwise use the higher level (possibly BioPerl-specific) API. Lincoln > does this do a certain degree with Bio-samtools; I would go further and > make the bp- and non-bp code in separate dists. > A good OO system can actually help make things faster. For example, I'm unpacking the flowspace and quality data lazily, which made scanning through an SFF file 2.5-3 times as fast while having marginal extra costs when you do need them. Leon From l.m.timmermans at students.uu.nl Tue Dec 20 10:30:54 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Tue, 20 Dec 2011 16:30:54 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: <4EEFD6A9.3010303@illinois.edu> References: <4EEFD6A9.3010303@illinois.edu> Message-ID: On Tue, Dec 20, 2011 at 1:28 AM, Chris Fields wrote: > This is already implemented in Bio::SeqIO I believe. This is the same > line of thinking with the FASTQ format, that one can have a > 'format-variant' combination that (as one might guess) indicates to the > parser any variation of the parser so logic within the parser can deal with > it. You can also pass the '-variant => "foo"' parameter as well IIRC. You > would just check the variant with the variant() method. > Great. That makes life much easier :-) Leon From p.j.a.cock at googlemail.com Tue Dec 20 10:31:59 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 20 Dec 2011 15:31:59 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Tue, Dec 20, 2011 at 3:25 PM, Leon Timmermans wrote: > On Mon, Dec 19, 2011 at 6:00 PM, Peter Cock > wrote: >> >> I understand - but make sure you've designed the data structures >> in the parser so as to allow the original record to be re-built as SFF. > > ?I did, though currently it's rather hard to make new entries from scratch. > That said, I can hardly imagine anyone wanting to do this. Typical use cases I've found in using the Biopython SFF code are filtering an SFF file (taking some records only), and modifying the clipping values. In both cases, the user isn't creating the SFF records from scratch. Peter From cjfields at illinois.edu Tue Dec 20 17:40:31 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 20 Dec 2011 22:40:31 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: , Message-ID: On Dec 20, 2011, at 9:26 AM, "Leon Timmermans" > wrote: On Mon, Dec 19, 2011 at 8:44 PM, Fields, Christopher J > wrote: Kinda joining this a little late, but I think if there is a way to have a low-level parser/writer that generically parses the data into simple (possibly hash-tagged) data structures, that would be best. Barring that, a very simple class for storing data. We've found BioPerl objects/classes pretty heavy. (for an example of this, see Heng Li's readfq parser on github, which has some stats for Fastq/fasta parsing). Any way we can separate the parser from object instantiation would enable us to optimize the object/class layer and parser/writer layers separately, with the possible nice side effect of making the parser more broadly used. For insn Sance, if someone wanted a faster parser, use the low level, otherwise use the higher level (possibly BioPerl-specific) API. Lincoln does this do a certain degree with Bio-samtools; I would go further and make the bp- and non-bp code in separate dists. A good OO system can actually help make things faster. For example, I'm unpacking the flowspace and quality data lazily, which made scanning through an SFF file 2.5-3 times as fast while having marginal extra costs when you do need them. Leon Yep, thinking about using the same approach for the Fastq variants. Chris Sent from my ancient iPad b/c my laptop's borked From dgacquer at ulb.ac.be Wed Dec 21 08:26:07 2011 From: dgacquer at ulb.ac.be (David Gacquer) Date: Wed, 21 Dec 2011 14:26:07 +0100 Subject: [Bioperl-l] Strange behaviour in the write_seq function for large fasta Message-ID: <4EF1DE6F.4070508@ulb.ac.be> Dear BioPerl users/developers, I am facing a strange issue with the $seq_out->write_seq function when using large fasta files I have downloaded the hg19 chromosome 1, and applied the following code (basically I wanted to mask some regions in it but the problem also appears when copying the sequence without modifications): sub main{ my $seq_in = Bio::SeqIO->new( -format => 'largefasta', -file => $ARGV[0]); my $seq_out = Bio::SeqIO->new( -format => 'largefasta', -file => '>'.$ARGV[1]); my $seq_obj_in = $seq_in->next_seq(); my $modified_seq = $seq_obj_in->seq(); my $seq_obj_out = Bio::Seq::LargePrimarySeq->new( -seq => $modified_seq, -id => $seq_obj_in->id, -desc => $seq_obj_in->desc); $seq_out->write_seq($seq_obj_out); } when checking the output fasta file, the sequence of chr1 is 1-bp shorter. I have noticed that in the original fasta file, each line contains exactly 50 nucleotides, while the output of the $seq_out->write_seq function contains always 60 characters per line. chr1 is exactly 249,250,621 bp (4,154,177 * 60 + 1) so to verify that the very last base was missing, I created the following fasta files: chr121.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAG chr122.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAG They contain respectively 121 (60*2+1) and 122 (60*2+2) bp, the last character being a G. When running the above code: chr121.out.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA chr122.out.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AG The output for the 122 bp chromosome is correct (2 lines of 60 bp and the last line with 2 bp, AG) but for the 121 bp chromosome, the last character is missing (2 lines of 60 bp only, last G is missing). When replacing -format => 'largefasta' by -format => 'fasta' or writing the output without the write_seq function however, the problem is solved. Am I missing something or is there a problem with the write_seq function used with large fasta files? (I am running BioPerl on a Mac under OS X Snow Leopard) Best regards David -- David Gacquer, Ph. D. IRIBHM - Universite Libre de Bruxelles Bldg C, room C.4.117 ULB, Campus Erasme, CP602 808 route de Lennik B-1070 Brussels Belgium Phone: +32-2-555 4187 Fax: +32-2-555 4655 E-mail: dgacquer at ulb.ac.be From koraydogankaya at gmail.com Sat Dec 24 03:44:43 2011 From: koraydogankaya at gmail.com (Koray) Date: Sat, 24 Dec 2011 00:44:43 -0800 (PST) Subject: [Bioperl-l] exons Message-ID: <8454dd72-4fed-41c5-977e-83d9300cd68b@z25g2000vbs.googlegroups.com> I need an explicit code for getting exon sequences of an mrna or gene fetched by get_Seq_by_acc or id. in ensembl it is easy but here it is not easy many ios exists. for example: here how can i get such a $gene object from DBs (GeneBank or EntrezGene) by acc numberor ids? exons code prev next Top Title : exons() Usage : @exons = $gene->exons(); @inital_exons = $gene->exons('Initial'); Function: Get all exon features or all exons of a specified type of this gene structure. Exon type is treated as a case-insensitive regular expression and optional. For consistency, use only the following types: initial, internal, terminal, utr, utr5prime, and utr3prime. A special and virtual type is 'coding', which refers to all types except utr. This method basically merges the exons returned by transcripts. Returns : An array of Bio::SeqFeature::Gene::ExonI implementing objects. Args : An optional string specifying the type of exon. From challa_ghanashyam at yahoo.com Sat Dec 24 15:09:09 2011 From: challa_ghanashyam at yahoo.com (GSC) Date: Sat, 24 Dec 2011 12:09:09 -0800 (PST) Subject: [Bioperl-l] re trieve description for a list of gi ids.. Message-ID: <33034438.post@talk.nabble.com> Hi all: I am new to perl. I am working on a script to retrieve the record description (name given for a sequence record in genbank) for a list of gi ids. the script works fine for 1000 ids but my list is about 250,000 ids long and it is not working for me. Any suggestions on this. GS -- View this message in context: http://old.nabble.com/retrieve-description-for-a-list-of-gi-ids..-tp33034438p33034438.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From cjfields at illinois.edu Tue Dec 27 10:03:28 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 27 Dec 2011 15:03:28 +0000 Subject: [Bioperl-l] Strange behaviour in the write_seq function for large fasta In-Reply-To: <4EF1DE6F.4070508@ulb.ac.be> References: <4EF1DE6F.4070508@ulb.ac.be> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF0BB29547@CITESMBX4.ad.uillinois.edu> This is a strange one. Personally I haven't seen this behavior, but that maybe it's OS-dependent? We'll need more information, particularly what version of BioPerl you are using, the OS, version of perl, etc. Also, in general to make sure we don't lose track of this issue it is best to submit a bug report: https://redmine.open-bio.org/projects/bioperl I'm planning on triaging bugs next week, I could take a look then. chris ________________________________________ From: bioperl-l-bounces at lists.open-bio.org [bioperl-l-bounces at lists.open-bio.org] on behalf of David Gacquer [dgacquer at ulb.ac.be] Sent: Wednesday, December 21, 2011 7:26 AM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] Strange behaviour in the write_seq function for large fasta Dear BioPerl users/developers, I am facing a strange issue with the $seq_out->write_seq function when using large fasta files I have downloaded the hg19 chromosome 1, and applied the following code (basically I wanted to mask some regions in it but the problem also appears when copying the sequence without modifications): sub main{ my $seq_in = Bio::SeqIO->new( -format => 'largefasta', -file => $ARGV[0]); my $seq_out = Bio::SeqIO->new( -format => 'largefasta', -file => '>'.$ARGV[1]); my $seq_obj_in = $seq_in->next_seq(); my $modified_seq = $seq_obj_in->seq(); my $seq_obj_out = Bio::Seq::LargePrimarySeq->new( -seq => $modified_seq, -id => $seq_obj_in->id, -desc => $seq_obj_in->desc); $seq_out->write_seq($seq_obj_out); } when checking the output fasta file, the sequence of chr1 is 1-bp shorter. I have noticed that in the original fasta file, each line contains exactly 50 nucleotides, while the output of the $seq_out->write_seq function contains always 60 characters per line. chr1 is exactly 249,250,621 bp (4,154,177 * 60 + 1) so to verify that the very last base was missing, I created the following fasta files: chr121.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAG chr122.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAG They contain respectively 121 (60*2+1) and 122 (60*2+2) bp, the last character being a G. When running the above code: chr121.out.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA chr122.out.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AG The output for the 122 bp chromosome is correct (2 lines of 60 bp and the last line with 2 bp, AG) but for the 121 bp chromosome, the last character is missing (2 lines of 60 bp only, last G is missing). When replacing -format => 'largefasta' by -format => 'fasta' or writing the output without the write_seq function however, the problem is solved. Am I missing something or is there a problem with the write_seq function used with large fasta files? (I am running BioPerl on a Mac under OS X Snow Leopard) Best regards David -- David Gacquer, Ph. D. IRIBHM - Universite Libre de Bruxelles Bldg C, room C.4.117 ULB, Campus Erasme, CP602 808 route de Lennik B-1070 Brussels Belgium Phone: +32-2-555 4187 Fax: +32-2-555 4655 E-mail: dgacquer at ulb.ac.be _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From jdeuts01 at students.poly.edu Thu Dec 1 09:09:19 2011 From: jdeuts01 at students.poly.edu (jdeuts01 at students.poly.edu) Date: Thu, 1 Dec 2011 14:09:19 +0000 Subject: [Bioperl-l] question Message-ID: Dear Bioperl, This is my first experience with bioperl and I need help please. 1. The version of bioperl is 1.6.1 and have also installed Bundle-BioPerl 2.1.8 and mGen 1.03. I was unable to install Bribes and trouchelle DB. Will this prevent the BioPerl package from functioning correctly? 2. The operating platform is windows 7 - 64 bit using ActiveState - Perl v5.12.2 3. The script is as follows: #!/usr/bin/perl # Write a script using OOP to write protein sequences to the file fasta.txt.use strict;use warnings;use Bio::SeqIO::fasta; # Declare and initialize input and output files.my $protein_fasta = "protein.fa";my $protein_out = ">fasta.txt"; # Setup objects for input and output.my $seq_in = Bio::SeqIO->new(-file => "$protein_fasta", -format => 'Fasta');my $seq_out = Bio::SeqIO->new(-file => "$protein_out", -format => 'fasta'); # Establish while loop using "next_seq" method # to read in multiple sequences from "protein.fa" file# one by one until none were remaining.while(my $seq = $seq_in -> next_seq){ $seq_out->write_seq($seq);} The information is successfully written to the file: fasta.txt. 4. Receiving the following error messages: Replacement list is longer than search list at C:/Perl64/site/lib/Bio/Range.pm line 251.Subroutine _initialize redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 92.Subroutine next_seq redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 112.Subroutine write_seq redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 193.Subroutine width redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 272.Subroutine preferred_id_type redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 295. Thanks in advance for your help.John Deutsch From jboddu at illinois.edu Thu Dec 1 11:38:00 2011 From: jboddu at illinois.edu (Boddu, Jayanand) Date: Thu, 1 Dec 2011 16:38:00 +0000 Subject: [Bioperl-l] Chromosome coordinates Message-ID: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu> Hello I am newbie to Perl scripts. I have a file with short reads mapped to the MAIZE genome The format is a simple BLASTN output. READ_ID Chr % Similarity Alignment Mismatches Gaps READ Start READ End Chr Start Chr End E Value Score READ1 chrPt 100 17 0 0 1 17 35021 35037 0.21 34.2 READ1 chr10 100 17 0 0 1 17 128587356 128587372 0.21 34.2 READ1 chr6 100 17 0 0 1 17 160769803 160769787 0.21 34.2 READ1 chr5 100 17 0 0 1 17 172103083 172103067 0.21 34.2 READ1 chr4 100 17 0 0 1 17 213173683 213173699 0.21 34.2 READ1 chr3 100 17 0 0 1 17 23689132 23689116 0.21 34.2 READ2 chr8 100 17 0 0 1 17 161048603 161048587 0.21 34.2 READ2 chr6 100 17 0 0 1 17 155768884 155768868 0.21 34.2 READ2 chr5 100 17 0 0 1 17 32958812 32958828 0.21 34.2 READ2 chr3 100 17 0 0 1 17 212451090 212451074 0.21 34.2 READ2 chr2 100 17 0 0 1 17 2046449 2046465 0.21 34.2 READ2 chr1 100 17 0 0 1 17 223233801 223233785 0.21 34.2 READ2 chr1 100 17 0 0 1 17 277573037 277573021 0.21 34.2 As expected the same read maps to multiple places on the same/different chromosome. I have a GFF file with annotated coordinates. I would like to run a PERL script to find out READS that are within the GENES in the GFF file and that are not. The anticipated script should; 1. Take the READ coordinates on the genome (by chromosome); 2. Go the GFF file; 3. Find the Chromosome; 4. Find the GENE (by coordinates); 5. and report READ-its coordinates-Chromosome-GENE-and its coordinates. It doesn't need to be in the same order. After this, I guess I could use simple Microsoft ACCESS query to pull out READS that are not mapped to the GENEs. I would greatly appreciate if anyone can has a script that more or less similar job. Thanks Jay From scott at scottcain.net Thu Dec 1 11:59:56 2011 From: scott at scottcain.net (Scott Cain) Date: Thu, 1 Dec 2011 11:59:56 -0500 Subject: [Bioperl-l] Chromosome coordinates In-Reply-To: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu> References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu> Message-ID: Hi Jay, Since the maize GFF file is likely to be fairly large, I would consider putting it in a database, using either Bio::DB::GFF if it is GFF2 or Bio::DB::SeqFeature::Store if it is gff3. Then you can use the methods that come along with either of those modules to search regions for for genes. They both support a get_features_by_location method, so you could get the range for each of the regions you want to look at, and check the database with that method to see if anything is there. Scott On Thu, Dec 1, 2011 at 11:38 AM, Boddu, Jayanand wrote: > Hello > I am newbie to Perl scripts. > I have a file with short reads mapped to the MAIZE genome > The format is a simple BLASTN output. > READ_ID > > Chr > > % Similarity > > Alignment > > Mismatches > > Gaps > > READ Start > > READ End > > Chr Start > > Chr End > > E Value > > Score > > READ1 > > chrPt > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 35021 > > 35037 > > 0.21 > > 34.2 > > READ1 > > chr10 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 128587356 > > 128587372 > > 0.21 > > 34.2 > > READ1 > > chr6 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 160769803 > > 160769787 > > 0.21 > > 34.2 > > READ1 > > chr5 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 172103083 > > 172103067 > > 0.21 > > 34.2 > > READ1 > > chr4 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 213173683 > > 213173699 > > 0.21 > > 34.2 > > READ1 > > chr3 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 23689132 > > 23689116 > > 0.21 > > 34.2 > > READ2 > > chr8 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 161048603 > > 161048587 > > 0.21 > > 34.2 > > READ2 > > chr6 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 155768884 > > 155768868 > > 0.21 > > 34.2 > > READ2 > > chr5 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 32958812 > > 32958828 > > 0.21 > > 34.2 > > READ2 > > chr3 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 212451090 > > 212451074 > > 0.21 > > 34.2 > > READ2 > > chr2 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 2046449 > > 2046465 > > 0.21 > > 34.2 > > READ2 > > chr1 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 223233801 > > 223233785 > > 0.21 > > 34.2 > > READ2 > > chr1 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 277573037 > > 277573021 > > 0.21 > > 34.2 > > > > > > > > > > > > > > > > > > > > > > > > > > As expected the same read maps to multiple places on the same/different > chromosome. > I have a GFF file with annotated coordinates. > I would like to run a PERL script to find out READS that are within the > GENES in the GFF file and that are not. > The anticipated script should; > > 1. Take the READ coordinates on the genome (by chromosome); > > 2. Go the GFF file; > > 3. Find the Chromosome; > > 4. Find the GENE (by coordinates); > > 5. and report READ-its coordinates-Chromosome-GENE-and its > coordinates. > > It doesn't need to be in the same order. > After this, I guess I could use simple Microsoft ACCESS query to pull out > READS that are not mapped to the GENEs. > I would greatly appreciate if anyone can has a script that more or less > similar job. > > Thanks > Jay > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research From jason.stajich at gmail.com Thu Dec 1 12:31:29 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Thu, 1 Dec 2011 09:31:29 -0800 Subject: [Bioperl-l] Chromosome coordinates In-Reply-To: References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu> Message-ID: <470A08F7-E238-44E1-B0D1-69FDBC0BA2A3@gmail.com> You might try using BEDtools, intersectBED which can do a lot of what you are doing in simple command line program. Jason On Dec 1, 2011, at 8:59 AM, Scott Cain wrote: > Hi Jay, > > Since the maize GFF file is likely to be fairly large, I would consider > putting it in a database, using either Bio::DB::GFF if it is GFF2 or > Bio::DB::SeqFeature::Store if it is gff3. Then you can use the methods > that come along with either of those modules to search regions for for > genes. They both support a get_features_by_location method, so you could > get the range for each of the regions you want to look at, and check the > database with that method to see if anything is there. > > Scott > > > On Thu, Dec 1, 2011 at 11:38 AM, Boddu, Jayanand wrote: > >> Hello >> I am newbie to Perl scripts. >> I have a file with short reads mapped to the MAIZE genome >> The format is a simple BLASTN output. >> READ_ID >> >> Chr >> >> % Similarity >> >> Alignment >> >> Mismatches >> >> Gaps >> >> READ Start >> >> READ End >> >> Chr Start >> >> Chr End >> >> E Value >> >> Score >> >> READ1 >> >> chrPt >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 35021 >> >> 35037 >> >> 0.21 >> >> 34.2 >> >> READ1 >> >> chr10 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 128587356 >> >> 128587372 >> >> 0.21 >> >> 34.2 >> >> READ1 >> >> chr6 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 160769803 >> >> 160769787 >> >> 0.21 >> >> 34.2 >> >> READ1 >> >> chr5 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 172103083 >> >> 172103067 >> >> 0.21 >> >> 34.2 >> >> READ1 >> >> chr4 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 213173683 >> >> 213173699 >> >> 0.21 >> >> 34.2 >> >> READ1 >> >> chr3 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 23689132 >> >> 23689116 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr8 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 161048603 >> >> 161048587 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr6 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 155768884 >> >> 155768868 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr5 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 32958812 >> >> 32958828 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr3 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 212451090 >> >> 212451074 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr2 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 2046449 >> >> 2046465 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr1 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 223233801 >> >> 223233785 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr1 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 277573037 >> >> 277573021 >> >> 0.21 >> >> 34.2 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> As expected the same read maps to multiple places on the same/different >> chromosome. >> I have a GFF file with annotated coordinates. >> I would like to run a PERL script to find out READS that are within the >> GENES in the GFF file and that are not. >> The anticipated script should; >> >> 1. Take the READ coordinates on the genome (by chromosome); >> >> 2. Go the GFF file; >> >> 3. Find the Chromosome; >> >> 4. Find the GENE (by coordinates); >> >> 5. and report READ-its coordinates-Chromosome-GENE-and its >> coordinates. >> >> It doesn't need to be in the same order. >> After this, I guess I could use simple Microsoft ACCESS query to pull out >> READS that are not mapped to the GENEs. >> I would greatly appreciate if anyone can has a script that more or less >> similar job. >> >> Thanks >> Jay >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > > > -- > ------------------------------------------------------------------------ > Scott Cain, Ph. D. scott at scottcain dot > net > GMOD Coordinator (http://gmod.org/) 216-392-3087 > Ontario Institute for Cancer Research > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From jovel_juan at hotmail.com Thu Dec 1 12:36:32 2011 From: jovel_juan at hotmail.com (Juan Jovel) Date: Thu, 1 Dec 2011 17:36:32 +0000 Subject: [Bioperl-l] Error when using SearchIO In-Reply-To: References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>, Message-ID: Hello Everybody! I have been running endless standalone Blasts these days, in two different clusters. The first one runs on Rock Linux, while the other one on Devian. In the Rock Linux one, when I use SeachIO to parse my blastp or blastx reports, I get invariably a message: "Replacement list is longer than search list at /share/apps/lib/perl5/site_perl/5.14.1/Bio/Range.pm line 251" What it does mean? Would it have any effect on my parsing results? Thanks, JUAN From cjfields at illinois.edu Thu Dec 1 14:03:45 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 1 Dec 2011 19:03:45 +0000 Subject: [Bioperl-l] Error when using SearchIO In-Reply-To: References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>, Message-ID: <7233AC75-E03B-401A-8D0A-260ED76956A4@illinois.edu> On Dec 1, 2011, at 11:36 AM, Juan Jovel wrote: > Hello Everybody! > I have been running endless standalone Blasts these days, in two different clusters. The first one runs on Rock Linux, while the other one on Devian. In the Rock Linux one, when I use SeachIO to parse my blastp or blastx reports, I get invariably a message: > "Replacement list is longer than search list at /share/apps/lib/perl5/site_perl/5.14.1/Bio/Range.pm line 251" > What it does mean? Would it have any effect on my parsing results? > Thanks, > JUAN This is a bug that was fixed, I think this was in the latest BioPerl release (1.6.901). There was a transliteration error that produced this warning, but otherwise it's harmless. The warning is perl version dependent, I think it pops up in perl 5.12 and up. This user post (about halfway down) runs into the same issue: http://www.sysarchitects.com/bioperl. chris From David.Messina at sbc.su.se Thu Dec 1 17:02:20 2011 From: David.Messina at sbc.su.se (Dave Messina) Date: Thu, 1 Dec 2011 23:02:20 +0100 Subject: [Bioperl-l] re trieving blast multiple alignment in fasta form In-Reply-To: <32886592.post@talk.nabble.com> References: <32886592.post@talk.nabble.com> Message-ID: Hi Eric, Wait, do you want multiple pairwise alignments in your output FASTA file, or a single multiple alignment of your query and all the hits? If the former, get_aln() will give you one pairwise alignment per hsp, but you'll need to move the output file creation statement (my $alnIO = ...) before the loops so it gets created only once. Then, when you do the write statement ($alnIO->write_aln($aln);), all of the alignments will go to the same file. If on the other hand you'd like to have a multiple alignment between a query and all of its hits, you'll have to take the IDs of the hits, pull the corresponding sequences out of the database, and then run a multiple alignment algorithm on them. Dave From scuoppo at gmail.com Fri Dec 2 17:50:28 2011 From: scuoppo at gmail.com (Claudio Scuoppo) Date: Fri, 2 Dec 2011 17:50:28 -0500 Subject: [Bioperl-l] List of genes from genomic intervals Message-ID: Hi, I am new to BioPerl. I was wondering what`s the best strategy to get the genes contained in a a series of human genomic interval. Basically, I have a table with: Chromosome Start End Which module should I be looking at? Thanks, Claudio From awitney at sgul.ac.uk Mon Dec 5 06:09:39 2011 From: awitney at sgul.ac.uk (Adam Witney) Date: Mon, 5 Dec 2011 11:09:39 +0000 Subject: [Bioperl-l] Bio::Graphics imagemap and padding Message-ID: <44A27378-CC4B-4396-817E-AA31004847C7@sgul.ac.uk> Hi, Image maps seem to be out of position if you use padding in the Panel, like this: my $panel = Bio::Graphics::Panel->new( ?.. -pad_left => 20, -pad_right => 20 ?? ); Without these options, the image map is fine. Is this a known issue? Also on a side note, I noticed that when using Bio::Graphics with Dancer, some of the CGI code was blocking somewhere (I found a reference to a similar problem with CGI and Catalyst), swapping CGI with HTML::Entities fixes it: sub create_web_map { ?. eval "require HTML::Entities" unless HTML::Entities->can('encode_entities'); ?. my $title = HTML::Entities::encode_entities($self->make_link($tr,$feature,1)); my $target = HTML::Entities::encode_entities($self->make_link($tgr,$feature,1)); ?.. } Thanks Adam From momin.amin at gmail.com Mon Dec 5 18:00:23 2011 From: momin.amin at gmail.com (Amin Momin) Date: Mon, 5 Dec 2011 15:00:23 -0800 (PST) Subject: [Bioperl-l] SimpleAlign and consensus_string Message-ID: Hi , I am generating a consensus sequence by aligning two protein homologs using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to understand the criteria consensus_string() method of simpleAlign uses to determine the consensus at position with dissimilar aminoacids/ nucleotide. Also how would the % cutoffs provided to consensus_string() affect the outcome. Thanks, Amin From jason.stajich at gmail.com Mon Dec 5 18:58:59 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Mon, 5 Dec 2011 15:58:59 -0800 Subject: [Bioperl-l] SimpleAlign and consensus_string In-Reply-To: References: Message-ID: <4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com> There are several methods that do related things. Below is the documentation for the method - it allows a more strict calling besides majority rule voting for the position - if you set the cutoff at 50% then the called AA or NT has to appear at least 50% of the time. There is also an IUPAC consensus string which is useful for DNA consensus generating patterns. If you want a summary as to whether or not there are only conservative amino acid changes in the column use match_line method which generates the similar string that CLUSTALW reports as a summary string for each column. =head2 consensus_string Title : consensus_string Usage : $str = $ali->consensus_string($threshold_percent) Function : Makes a strict consensus Returns : Consensus string Argument : Optional treshold ranging from 0 to 100. The consensus residue has to appear at least threshold % of the sequences at a given location, otherwise a '?' character will be placed at that location. (Default value = 0%) =cut On Dec 5, 2011, at 3:00 PM, Amin Momin wrote: > Hi , > > I am generating a consensus sequence by aligning two protein homologs > using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to > understand the criteria consensus_string() method of simpleAlign uses > to determine the consensus at position with dissimilar aminoacids/ > nucleotide. Also how would the % cutoffs provided to > consensus_string() affect the outcome. > > > Thanks, > Amin > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From wenbinmei at gmail.com Tue Dec 6 11:09:35 2011 From: wenbinmei at gmail.com (wenbin mei) Date: Tue, 6 Dec 2011 11:09:35 -0500 Subject: [Bioperl-l] revcom the multiple sequence alignment Message-ID: Hi, I have a question about revcom the multiple sequence alignment. One way I can do convert the format into fasta and revcom individual sequences. I wonder is there a easy way to convert the multiple sequence alignment as a whole. Thank you for help. -best, wenbin From jason.stajich at gmail.com Tue Dec 6 12:40:37 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Tue, 6 Dec 2011 09:40:37 -0800 Subject: [Bioperl-l] revcom the multiple sequence alignment In-Reply-To: References: Message-ID: I think this would work to update it in place though I haven't tried it myself for my $seq ( $aln->each_seq ) { $seq->seq( $seq->revcom->seq ); } $out->write_aln($aln); This may also work - not entirely sure if there is any extra work done on the meta data (start/end) of the Seq object when this is done. You may want to flip start/end for the sequences (the seqs are Bio::LocatableSeq objects) explicitly if not. Or you may not care about those data and can ignore. $seq = $seq->revcom Jason On Dec 6, 2011, at 8:09 AM, wenbin mei wrote: > Hi, > > I have a question about revcom the multiple sequence alignment. One way I > can do convert the format into fasta and revcom individual sequences. I > wonder is there a easy way to convert the multiple sequence alignment as a > whole. Thank you for help. > > -best, > wenbin > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From wenbinmei at gmail.com Tue Dec 6 12:51:18 2011 From: wenbinmei at gmail.com (wenbin mei) Date: Tue, 6 Dec 2011 12:51:18 -0500 Subject: [Bioperl-l] revcom the multiple sequence alignment In-Reply-To: References: Message-ID: I think I might not explain clearly my questions. I extract the individual gene alignment from the whole genome alignment. Since some gene are on the reverse strand, I want to revcom the gene alignment. There is part of my scripts. I can read the strand information from another file. my $newstart = $refseq->column_from_residue_number($start); my $newend = $refseq->column_from_residue_number($end); $seq{$genename} = $aln->slice($newstart, $newend); Any suggestion to help me revcom some gene alignment on the minus strand is helpful. Thank you. -best, wenbin On Tue, Dec 6, 2011 at 12:40 PM, Jason Stajich wrote: > I think this would work to update it in place though I haven't tried it > myself > > for my $seq ( $aln->each_seq ) { > $seq->seq( $seq->revcom->seq ); > } > $out->write_aln($aln); > > This may also work - not entirely sure if there is any extra work done on > the meta data (start/end) of the Seq object when this is done. You may > want to flip start/end for the sequences (the seqs are Bio::LocatableSeq > objects) explicitly if not. Or you may not care about those data and can > ignore. > > $seq = $seq->revcom > > Jason > On Dec 6, 2011, at 8:09 AM, wenbin mei wrote: > > > Hi, > > > > I have a question about revcom the multiple sequence alignment. One way I > > can do convert the format into fasta and revcom individual sequences. I > > wonder is there a easy way to convert the multiple sequence alignment as > a > > whole. Thank you for help. > > > > -best, > > wenbin > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Wenbin Mei Ph.D. Student Dr. Brad Barbazuk's Lab Department of Biology University of Florida 509-899-3067 wmei at ufl.edu From kellert at ohsu.edu Tue Dec 6 13:21:39 2011 From: kellert at ohsu.edu (Tom Keller) Date: Tue, 6 Dec 2011 10:21:39 -0800 Subject: [Bioperl-l] Bioperl-l Digest, Vol 104, Issue 3 In-Reply-To: References: Message-ID: I'd start by looking for the section "Searching for genes in genomic DNA" in the HOWTO:Beginners - BioPerl website. Thomas (Tom) Keller, PhD kellert at ohsu.edu 503.494.2442 6588 R Jones Hall (BSc/CROET) MMI DNA Services Member of OHSU Shared Resources On Dec 3, 2011, at 9:00 AM, wrote: > Send Bioperl-l mailing list submissions to > bioperl-l at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/bioperl-l > or, via email, send a message with subject or body 'help' to > bioperl-l-request at lists.open-bio.org > > You can reach the person managing the list at > bioperl-l-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Bioperl-l digest..." > > > Today's Topics: > > 1. List of genes from genomic intervals (Claudio Scuoppo) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 2 Dec 2011 17:50:28 -0500 > From: Claudio Scuoppo > Subject: [Bioperl-l] List of genes from genomic intervals > To: bioperl-l at lists.open-bio.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > Hi, > > I am new to BioPerl. I was wondering what`s the best strategy to get > the genes contained in a a series of human genomic interval. > Basically, I have a table with: > > Chromosome Start End > > Which module should I be looking at? > Thanks, > Claudio > > > ------------------------------ > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > End of Bioperl-l Digest, Vol 104, Issue 3 > ***************************************** From wenbinmei at gmail.com Tue Dec 6 17:54:51 2011 From: wenbinmei at gmail.com (wenbin mei) Date: Tue, 6 Dec 2011 17:54:51 -0500 Subject: [Bioperl-l] revcom the multiple sequence alignment In-Reply-To: References: Message-ID: Figured out! Thanks for help. -best, wenbin On Tue, Dec 6, 2011 at 12:40 PM, Jason Stajich wrote: > I think this would work to update it in place though I haven't tried it > myself > > for my $seq ( $aln->each_seq ) { > $seq->seq( $seq->revcom->seq ); > } > $out->write_aln($aln); > > This may also work - not entirely sure if there is any extra work done on > the meta data (start/end) of the Seq object when this is done. You may > want to flip start/end for the sequences (the seqs are Bio::LocatableSeq > objects) explicitly if not. Or you may not care about those data and can > ignore. > > $seq = $seq->revcom > > Jason > On Dec 6, 2011, at 8:09 AM, wenbin mei wrote: > > > Hi, > > > > I have a question about revcom the multiple sequence alignment. One way I > > can do convert the format into fasta and revcom individual sequences. I > > wonder is there a easy way to convert the multiple sequence alignment as > a > > whole. Thank you for help. > > > > -best, > > wenbin > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Wenbin Mei Ph.D. Student Dr. Brad Barbazuk's Lab Department of Biology University of Florida 509-899-3067 wmei at ufl.edu From momin.amin at gmail.com Tue Dec 6 12:37:16 2011 From: momin.amin at gmail.com (Amin Momin) Date: Tue, 6 Dec 2011 11:37:16 -0600 Subject: [Bioperl-l] SimpleAlign and consensus_string In-Reply-To: <4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com> References: <4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com> Message-ID: Thanks Jason On Mon, Dec 5, 2011 at 5:58 PM, Jason Stajich wrote: > There are several methods that do related things. > > Below is the documentation for the method - it allows a more strict calling besides majority rule voting for the position - if you set the cutoff at 50% then the called AA or NT has to appear at least 50% of the time. There is also an IUPAC consensus string which is useful for DNA consensus generating patterns. > > If you want a summary as to whether or not there are only conservative amino acid changes in the column use match_line method which generates the similar string that CLUSTALW reports as a summary string for each column. > > =head2 consensus_string > > ?Title ? ? : consensus_string > ?Usage ? ? : $str = $ali->consensus_string($threshold_percent) > ?Function ?: Makes a strict consensus > ?Returns ? : Consensus string > ?Argument ?: Optional treshold ranging from 0 to 100. > ? ? ? ? ? ? The consensus residue has to appear at least threshold % > ? ? ? ? ? ? of the sequences at a given location, otherwise a '?' > ? ? ? ? ? ? character will be placed at that location. > ? ? ? ? ? ? (Default value = 0%) > > =cut > > On Dec 5, 2011, at 3:00 PM, Amin Momin wrote: > >> Hi , >> >> I am generating a consensus sequence by aligning two protein homologs >> using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to >> understand the criteria consensus_string() method of simpleAlign uses >> to determine the consensus at position with dissimilar aminoacids/ >> nucleotide. Also how would the % cutoffs provided to >> consensus_string() affect the outcome. >> >> >> Thanks, >> Amin >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > From sunwukong at potc.net Wed Dec 7 14:05:20 2011 From: sunwukong at potc.net (sunwukong) Date: Wed, 07 Dec 2011 11:05:20 -0800 Subject: [Bioperl-l] DNA Sequencing two questions Message-ID: <4EDFB8F0.8080001@potc.net> I am not a medical professional but I have two DNA related questions. A year or so ago I realized that if the standard building blocks of life were the amino acids GATC then they could be represented as a base 4 number system (e.g., 0,1,2 and 3). Then any life form could be represented by a number (it would be very long). So I set out on a quest to do this with a small life form. For fun I chose the Spanish Flu which I believe I found on an NIH site. Then I set out and realized that there was no standard. And I did not know if the number would be built with the most significant digit on the left or right. 1. Is there a standard method for representing the ATCD molecules as numbers g = 0 a = 1 t = 2 c = 3 2. is the sequence read left to right or right to left? note: It may be biologically significant if the right values are assigned to the letters GATC, there could be a pattern somewhere that holds significant information. One idea might be to look at DNA sequences in bases other than 4 to see if something jumps out. http://www.insectscience.org/2.10/ref/fig5a.gif VR Pat Kirol 509 442-2214 From Russell.Smithies at agresearch.co.nz Wed Dec 7 16:59:18 2011 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Thu, 8 Dec 2011 10:59:18 +1300 Subject: [Bioperl-l] DNA Sequencing two questions In-Reply-To: <4EDFB8F0.8080001@potc.net> References: <4EDFB8F0.8080001@potc.net> Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz> I did something similar a few years ago (after watching the movie "Contact" I think) and encoded codons as RGB values and drew an image of a genome. Looked much like random noise but I might try it again and draw as a space filling curve. I guess if you're looking for "hidden messages", why restrict yourself to 2 dimensions? Perhaps something pops out as a single-image stereogram eg. http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Random_Dot_Shark.png Perhaps it's a 3D "object" represented by slices drawn in a series of 2D planes? But you need a bit of biological background as there will be patterns simply because of the way genes "work" and are laid out in chromosomes. You need to remember that DNA is effectively a 2D representation of a 3D protein structure and there is already much hidden information we know we don't understand - a "simple" task like how proteins fold is barely understood and why some become prions is still a mystery. But don't let this stop you uncovering the great secret hidden in our genes :-) On a similar note, have a look at http://medgadget.com/2011/10/send-your-secret-message-hidden-in-bacteria.html --Russell > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of sunwukong > Sent: Thursday, 8 December 2011 8:05 a.m. > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] DNA Sequencing two questions > > I am not a medical professional but I have two DNA related questions. > > A year or so ago I realized that if the standard building blocks of life were the > amino acids GATC then they could be represented as a base 4 number > system (e.g., 0,1,2 and 3). Then any life form could be represented by a > number (it would be very long). So I set out on a quest to do this with a small > life form. For fun I chose the Spanish Flu which I believe I found on an NIH > site. Then I set out and realized that there was no standard. And I did not > know if the number would be built with the most significant digit on the left > or right. > > 1. Is there a standard method for representing the ATCD molecules as > numbers g = 0 a = 1 t = 2 c = 3 > > 2. is the sequence read left to right or right to left? > > note: It may be biologically significant if the right values are assigned to the > letters GATC, there could be a pattern somewhere that holds significant > information. One idea might be to look at DNA sequences in bases other > than 4 to see if something jumps out. > > http://www.insectscience.org/2.10/ref/fig5a.gif > > VR > Pat Kirol > 509 442-2214 > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From jason.stajich at gmail.com Wed Dec 7 17:53:10 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 7 Dec 2011 14:53:10 -0800 Subject: [Bioperl-l] DNA Sequencing two questions In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz> References: <4EDFB8F0.8080001@potc.net> <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz> Message-ID: <9BDA2BFA-264F-45B2-9234-A6BC9402FBA8@gmail.com> For other fun picture games -- You can look at patterns of motifs/words in a chaos game representation of genomes. http://mbe.oxfordjournals.org/content/16/10/1391.long http://mbe.oxfordjournals.org/content/20/6/901.long On Dec 7, 2011, at 1:59 PM, Smithies, Russell wrote: > I did something similar a few years ago (after watching the movie "Contact" I think) and encoded codons as RGB values and drew an image of a genome. Looked much like random noise but I might try it again and draw as a space filling curve. > I guess if you're looking for "hidden messages", why restrict yourself to 2 dimensions? Perhaps something pops out as a single-image stereogram eg. http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Random_Dot_Shark.png > Perhaps it's a 3D "object" represented by slices drawn in a series of 2D planes? > > But you need a bit of biological background as there will be patterns simply because of the way genes "work" and are laid out in chromosomes. You need to remember that DNA is effectively a 2D representation of a 3D protein structure and there is already much hidden information we know we don't understand - a "simple" task like how proteins fold is barely understood and why some become prions is still a mystery. > > But don't let this stop you uncovering the great secret hidden in our genes :-) > > On a similar note, have a look at http://medgadget.com/2011/10/send-your-secret-message-hidden-in-bacteria.html > > --Russell > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of sunwukong >> Sent: Thursday, 8 December 2011 8:05 a.m. >> To: bioperl-l at bioperl.org >> Subject: [Bioperl-l] DNA Sequencing two questions >> >> I am not a medical professional but I have two DNA related questions. >> >> A year or so ago I realized that if the standard building blocks of life were the >> amino acids GATC then they could be represented as a base 4 number >> system (e.g., 0,1,2 and 3). Then any life form could be represented by a >> number (it would be very long). So I set out on a quest to do this with a small >> life form. For fun I chose the Spanish Flu which I believe I found on an NIH >> site. Then I set out and realized that there was no standard. And I did not >> know if the number would be built with the most significant digit on the left >> or right. >> >> 1. Is there a standard method for representing the ATCD molecules as >> numbers g = 0 a = 1 t = 2 c = 3 >> >> 2. is the sequence read left to right or right to left? >> >> note: It may be biologically significant if the right values are assigned to the >> letters GATC, there could be a pattern somewhere that holds significant >> information. One idea might be to look at DNA sequences in bases other >> than 4 to see if something jumps out. >> >> http://www.insectscience.org/2.10/ref/fig5a.gif >> >> VR >> Pat Kirol >> 509 442-2214 >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From Russell.Smithies at agresearch.co.nz Wed Dec 7 19:29:47 2011 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Thu, 8 Dec 2011 13:29:47 +1300 Subject: [Bioperl-l] DNA Sequencing two questions In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz> References: <4EDFB8F0.8080001@potc.net> <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz> Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF245@exchsth.agresearch.co.nz> I tried again and came up with this: http://www.bioperl.org/w/images/7/7a/Autostereogram.png If you look carefully, you can see the answer to life, the universe, and everything!! --Russell > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Smithies, Russell > Sent: Thursday, 8 December 2011 10:59 a.m. > To: 'sunwukong'; bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] DNA Sequencing two questions > > I did something similar a few years ago (after watching the movie "Contact" I > think) and encoded codons as RGB values and drew an image of a genome. > Looked much like random noise but I might try it again and draw as a space > filling curve. > I guess if you're looking for "hidden messages", why restrict yourself to 2 > dimensions? Perhaps something pops out as a single-image stereogram eg. > http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Ra > ndom_Dot_Shark.png > Perhaps it's a 3D "object" represented by slices drawn in a series of 2D > planes? > > But you need a bit of biological background as there will be patterns simply > because of the way genes "work" and are laid out in chromosomes. You > need to remember that DNA is effectively a 2D representation of a 3D > protein structure and there is already much hidden information we know we > don't understand - a "simple" task like how proteins fold is barely understood > and why some become prions is still a mystery. > > But don't let this stop you uncovering the great secret hidden in our genes :-) > > On a similar note, have a look at http://medgadget.com/2011/10/send-your- > secret-message-hidden-in-bacteria.html > > --Russell > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of sunwukong > > Sent: Thursday, 8 December 2011 8:05 a.m. > > To: bioperl-l at bioperl.org > > Subject: [Bioperl-l] DNA Sequencing two questions > > > > I am not a medical professional but I have two DNA related questions. > > > > A year or so ago I realized that if the standard building blocks of > > life were the amino acids GATC then they could be represented as a > > base 4 number system (e.g., 0,1,2 and 3). Then any life form could be > > represented by a number (it would be very long). So I set out on a > > quest to do this with a small life form. For fun I chose the Spanish > > Flu which I believe I found on an NIH site. Then I set out and > > realized that there was no standard. And I did not know if the number > > would be built with the most significant digit on the left or right. > > > > 1. Is there a standard method for representing the ATCD molecules as > > numbers g = 0 a = 1 t = 2 c = 3 > > > > 2. is the sequence read left to right or right to left? > > > > note: It may be biologically significant if the right values are > > assigned to the letters GATC, there could be a pattern somewhere that > > holds significant information. One idea might be to look at DNA > > sequences in bases other than 4 to see if something jumps out. > > > > http://www.insectscience.org/2.10/ref/fig5a.gif > > > > VR > > Pat Kirol > > 509 442-2214 > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > ========================================================== > ============= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities to which > it is addressed and may contain confidential and/or privileged material. Any > review, retransmission, dissemination or other use of, or taking of any action > in reliance upon, this information by persons or entities other than the > intended recipients is prohibited by AgResearch Limited. If you have received > this message in error, please notify the sender immediately. > ========================================================== > ============= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From lumos.lumos.lumos at gmail.com Fri Dec 9 11:47:36 2011 From: lumos.lumos.lumos at gmail.com (lumos lumos) Date: Fri, 9 Dec 2011 08:47:36 -0800 Subject: [Bioperl-l] Mouse->Human homologues ? Message-ID: Hello, Is there a way to get human homologues for a mouse gene list where I get all human genes(symbols) as text output ? Thank you LM From cjfields at illinois.edu Fri Dec 9 12:17:20 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 9 Dec 2011 17:17:20 +0000 Subject: [Bioperl-l] Mouse->Human homologues ? In-Reply-To: References: Message-ID: There are lots of databases that have this capability (ensembl, orthodb, homologene, oma, to name only a few). Have you tried a simple search for this, or did you want expert opinion on the matter? chris PS - Just to note, there is a lot of controversy swirling about re: the ortholog conjecture and some recently published papers calling it into question using human-mouse data, worth a look if you're trotting this path to know the current situation. If you have access to F1000, see the following (paper itself is open :) Faculty of 1000 evaluations, dissents and comments for [Nehrt NL et al. Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Comput Biol. 2011 Jun; 7(6):e1002073; doi: 10.1371/journal.pcbi.1002073]. Faculty of 1000, 31 Aug 2011. F1000.com/12462957 On Dec 9, 2011, at 10:47 AM, lumos lumos wrote: > Hello, > > Is there a way to get human homologues for a mouse gene list where I get > all human genes(symbols) as text output ? > > Thank you > LM > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From lumos.lumos.lumos at gmail.com Fri Dec 9 12:29:24 2011 From: lumos.lumos.lumos at gmail.com (lumos lumos) Date: Fri, 9 Dec 2011 09:29:24 -0800 Subject: [Bioperl-l] Mouse->Human homologues ? In-Reply-To: References: Message-ID: Hi Chris, Thanks for your reply. I wanted to know if there is anyway you can do it via script/automatically in perl for a list of mouse genes whose human homologues I require. LM On Fri, Dec 9, 2011 at 9:17 AM, Fields, Christopher J wrote: > There are lots of databases that have this capability (ensembl, orthodb, > homologene, oma, to name only a few). Have you tried a simple search for > this, or did you want expert opinion on the matter? > > chris > > PS - Just to note, there is a lot of controversy swirling about re: the > ortholog conjecture and some recently published papers calling it into > question using human-mouse data, worth a look if you're trotting this path > to know the current situation. If you have access to F1000, see the > following (paper itself is open :) > > Faculty of 1000 evaluations, dissents and comments for [Nehrt NL et al. > Testing the ortholog conjecture with comparative functional genomic data > from mammals. PLoS Comput Biol. 2011 Jun; 7(6):e1002073; doi: > 10.1371/journal.pcbi.1002073]. Faculty of 1000, 31 Aug 2011. > F1000.com/12462957 > > On Dec 9, 2011, at 10:47 AM, lumos lumos wrote: > > > Hello, > > > > Is there a way to get human homologues for a mouse gene list where I get > > all human genes(symbols) as text output ? > > > > Thank you > > LM > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From lumos.lumos.lumos at gmail.com Wed Dec 7 23:47:19 2011 From: lumos.lumos.lumos at gmail.com (lumos lumos) Date: Wed, 7 Dec 2011 20:47:19 -0800 Subject: [Bioperl-l] Perl parsing Message-ID: Hello, I have a text file(tab-delim) with some gene names as shown below. *BRCA1: breast cancer 1, early onset TNF: tumor necrosis factor OMG: oligodendrocyte myelin glycoprotein* I would like to get the list of gene name BRCA1,TNF,OMG that is before the colon(:) . How do I parse in perl this text file with this list of genes? Thanks in advance. LM From b.m.forde at umail.ucc.ie Fri Dec 9 11:52:56 2011 From: b.m.forde at umail.ucc.ie (BForde) Date: Fri, 9 Dec 2011 08:52:56 -0800 (PST) Subject: [Bioperl-l] Genbank files Message-ID: <32941955.post@talk.nabble.com> Hello all, I am new to Bioperl so I apologise if this is stupid question. For CDS features I which to add additional qualifiers e.g. /colour and /note qualifiers. I have looked at the BioPerl wiki but am still unsure as how to do this? regards Brian -- View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From jboddu at illinois.edu Fri Dec 9 14:59:39 2011 From: jboddu at illinois.edu (Boddu, Jayanand) Date: Fri, 9 Dec 2011 19:59:39 +0000 Subject: [Bioperl-l] Batch processing of Data Message-ID: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu> Hi Anyone: Please let me know if the following is practical with PERL. My data output can be described as following. 1. Hundreds of samples are run. 2. A batch output sends data from each sample to its own "folder". Output is in the form of few text files, spreadsheets and PDF files. 3. One of the spreadsheet has the data of most interest. 4. This means I end up having hundreds of folders. 5. The spreadsheet with the data has multiple worksheets out of which a couple have the interesting data to be processed (Please find attached a spreadsheet output in which the data is organized and the worksheets of my interest are named as "Compound" and "Peak". Yellow high-lighted columns in each worksheet has the data to be processed). OK. That's long description. NOW. Is it practical to write a PERL/or any script to; 1. Enter each folder. 2. Look for the spreadsheet of interest. 3. Look for worksheets named "Compound" and "Peak". 4. Look for the specific columns of interest. 5. Copy paste the columns of interest into a new spreadsheet/text file with data from each folder next to each other. This final spreadsheet will pass through a bunch of other calculations. I apologize for this long and painful description. However, it would be great if this can be done. Thanks Jay -------------- next part -------------- A non-text attachment was scrubbed... Name: REPORT01.xls Type: application/vnd.ms-excel Size: 93696 bytes Desc: REPORT01.xls URL: From cjfields at illinois.edu Fri Dec 9 15:37:48 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 9 Dec 2011 20:37:48 +0000 Subject: [Bioperl-l] Perl parsing In-Reply-To: References: Message-ID: On Dec 7, 2011, at 10:47 PM, lumos lumos wrote: > Hello, > > I have a text file(tab-delim) with some gene names as shown below. > > *BRCA1: breast cancer 1, early onset > > TNF: tumor necrosis factor > > OMG: oligodendrocyte myelin glycoprotein* > > I would like to get the list of gene name BRCA1,TNF,OMG that is before the > colon(:) . > How do I parse in perl this text file with this list of genes? 'Very carefully?' Okay, I'll try to refrain from further sarcasm, but I'm confused, what does this have to do with BioPerl (*the toolkit*) specifically? That is what this mailing list is for. Just to note, this is a very common perl task. The answer is attainable by searching for it (not to mention taking the time to learn basic perl). For instance: http://stackoverflow.com/questions/4500407/in-perl-how-can-i-correctly-parse-tab-space-delimited-files-with-quoted-strings One of the many links found by simply using Google: http://lmgtfy.com/?q=perl+parse+tab+file I'll leave the regex munging to you. (okay, I failed at refraining from sarcasm, ah well it's friday). chris > Thanks in advance. > LM From jason.stajich at gmail.com Fri Dec 9 16:18:38 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Fri, 9 Dec 2011 13:18:38 -0800 Subject: [Bioperl-l] Genbank files In-Reply-To: <32941955.post@talk.nabble.com> References: <32941955.post@talk.nabble.com> Message-ID: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> $feature->add_tag_value('color','blue'); On Dec 9, 2011, at 8:52 AM, BForde wrote: > > Hello all, > > I am new to Bioperl so I apologise if this is stupid question. > > For CDS features I which to add additional qualifiers e.g. /colour and /note > qualifiers. I have looked at the BioPerl wiki but am still unsure as how to > do this? > > regards > > Brian > -- > View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason.stajich at gmail.com jason at bioperl.org From bosborne11 at verizon.net Fri Dec 9 15:31:15 2011 From: bosborne11 at verizon.net (Brian Osborne) Date: Fri, 09 Dec 2011 15:31:15 -0500 Subject: [Bioperl-l] Genbank files In-Reply-To: <32941955.post@talk.nabble.com> References: <32941955.post@talk.nabble.com> Message-ID: <3AB0716B-EC07-4BD6-9FC8-0C47A29FC0BA@verizon.net> Brian, Reasonable question. Start here: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation If you've never used Bioperl then: http://www.bioperl.org/wiki/HOWTO:Beginners Brian On Dec 9, 2011, at 11:52 AM, BForde wrote: > > Hello all, > > I am new to Bioperl so I apologise if this is stupid question. > > For CDS features I which to add additional qualifiers e.g. /colour and /note > qualifiers. I have looked at the BioPerl wiki but am still unsure as how to > do this? > > regards > > Brian > -- > View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From asjo at koldfront.dk Fri Dec 9 17:25:00 2011 From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=) Date: Fri, 09 Dec 2011 23:25:00 +0100 Subject: [Bioperl-l] Batch processing of Data References: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu> Message-ID: <871usdpemb.fsf@topper.koldfront.dk> On Fri, 9 Dec 2011 19:59:39 +0000, Boddu, wrote: > Please let me know if the following is practical with PERL. It might very well be, yes. Modules you might be interested in include Spreadsheet::ParseExcel, Spreadsheet::XLSX, Spreadsheet::WriteExcel and Excel::Writer::XLSX?. A big help in finding interesting CPAN modules is the search engine on https://metacpan.org/ Depending on your platform and preference using find(1) might also be helpful to traverse the folders, rather than doing so in Perl. Note that none of this has anything to do with BioPerl as such, though, and you'll need to do some actual programming to get the job done. Best regards, Adam ? http://blogs.perl.org/users/john_mcnamara/2011/10/spreadsheetwriteexcel-is-dead-long-live-excelwriterxlsx.html -- "Angels can fly because they take themselves lightly." Adam Sj?gren asjo at koldfront.dk From David.Messina at sbc.su.se Fri Dec 9 17:30:23 2011 From: David.Messina at sbc.su.se (Dave Messina) Date: Fri, 9 Dec 2011 23:30:23 +0100 Subject: [Bioperl-l] Batch processing of Data In-Reply-To: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu> References: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu> Message-ID: Yes, it can be done. However, it has nothing to do with this mailing list. Steps 1 and 2 are basic Perl. For steps 3 through 5, try googling "perl parse excel". Dave On Fri, Dec 9, 2011 at 20:59, Boddu, Jayanand wrote: > Hi Anyone: > Please let me know if the following is practical with PERL. > My data output can be described as following. > > 1. Hundreds of samples are run. > > 2. A batch output sends data from each sample to its own "folder". > Output is in the form of few text files, spreadsheets and PDF files. > > 3. One of the spreadsheet has the data of most interest. > > 4. This means I end up having hundreds of folders. > > 5. The spreadsheet with the data has multiple worksheets out of > which a couple have the interesting data to be processed (Please find > attached a spreadsheet output in which the data is organized and the > worksheets of my interest are named as "Compound" and "Peak". Yellow > high-lighted columns in each worksheet has the data to be processed). > OK. That's long description. > NOW. Is it practical to write a PERL/or any script to; > > 1. Enter each folder. > > 2. Look for the spreadsheet of interest. > > 3. Look for worksheets named "Compound" and "Peak". > > 4. Look for the specific columns of interest. > > 5. Copy paste the columns of interest into a new spreadsheet/text > file with data from each folder next to each other. > > This final spreadsheet will pass through a bunch of other calculations. > > I apologize for this long and painful description. > However, it would be great if this can be done. > Thanks > Jay > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From lsbrath at gmail.com Sat Dec 10 16:39:44 2011 From: lsbrath at gmail.com (Mgavi Brathwaite) Date: Sat, 10 Dec 2011 16:39:44 -0500 Subject: [Bioperl-l] Perl parsing In-Reply-To: References: Message-ID: Yes grasshopper you have to suffer a little bit. Learn Perl first, then step up to BioPerl. Chris I feel you concerning the power of Regex, and the sarcasm. Lom On Fri, Dec 9, 2011 at 3:37 PM, Fields, Christopher J wrote: > On Dec 7, 2011, at 10:47 PM, lumos lumos wrote: > > > Hello, > > > > I have a text file(tab-delim) with some gene names as shown below. > > > > *BRCA1: breast cancer 1, early onset > > > > TNF: tumor necrosis factor > > > > OMG: oligodendrocyte myelin glycoprotein* > > > > I would like to get the list of gene name BRCA1,TNF,OMG that is before > the > > colon(:) . > > How do I parse in perl this text file with this list of genes? > > 'Very carefully?' > > Okay, I'll try to refrain from further sarcasm, but I'm confused, what > does this have to do with BioPerl (*the toolkit*) specifically? That is > what this mailing list is for. > > Just to note, this is a very common perl task. The answer is attainable by > searching for it (not to mention taking the time to learn basic perl). For > instance: > > > http://stackoverflow.com/questions/4500407/in-perl-how-can-i-correctly-parse-tab-space-delimited-files-with-quoted-strings > > One of the many links found by simply using Google: > > http://lmgtfy.com/?q=perl+parse+tab+file > > I'll leave the regex munging to you. > > (okay, I failed at refraining from sarcasm, ah well it's friday). > > chris > > > > Thanks in advance. > > LM > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From pawan.mani2 at gmail.com Mon Dec 5 17:00:09 2011 From: pawan.mani2 at gmail.com (pawan.mani2 at gmail.com) Date: Tue, 6 Dec 2011 03:30:09 +0530 Subject: [Bioperl-l] bioperl in cygwin Message-ID: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC> Hi I would like to after the givibg following commands in cgwin terminal: perl -MCPAN -e shell then I type o conf prerequisites_policy follow o conf commit install Bundle::CPAN install Module::Build d /bioperl/ then we you get a list of different versions. I selected CJFIELDS/BioPerl-1.6.1.96 install CJFIELDS/BioPerl-1.6.1.96.tar.gz but build.install was not ok. Kindly allow me to know the step of complete bioprl installatin cygwin undre windows 7. thanks in advanced. with best regards, Pawan From cjfields at illinois.edu Sun Dec 11 13:22:01 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sun, 11 Dec 2011 18:22:01 +0000 Subject: [Bioperl-l] bioperl in cygwin In-Reply-To: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC> References: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC> Message-ID: Pawan, Hard to say what the problem is w/o supplying warnings/errors. Prior to doing this, you should try installing BioPerl-1.6.901 (the latest CPAN release). You can try a direct installation of the distribution, but the easiest way to get the latest version is to try installing Bio::Perl. (I'm not sure what BioPerl-1.6.1.96 is, but this seems wrong) chris On Dec 5, 2011, at 4:00 PM, wrote: > Hi > I would like to after the givibg following commands in cgwin terminal: > > > perl -MCPAN -e shell > > then I type > > o conf prerequisites_policy follow > o conf commit > install Bundle::CPAN > install Module::Build > d /bioperl/ > then we you get a list of different versions. > I selected CJFIELDS/BioPerl-1.6.1.96 > install CJFIELDS/BioPerl-1.6.1.96.tar.gz > > > but build.install was not ok. > > Kindly allow me to know the step of complete bioprl installatin cygwin undre windows 7. > > thanks in advanced. > > with best regards, > Pawan > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From b.m.forde at umail.ucc.ie Tue Dec 13 06:03:50 2011 From: b.m.forde at umail.ucc.ie (BForde) Date: Tue, 13 Dec 2011 03:03:50 -0800 (PST) Subject: [Bioperl-l] Genbank files In-Reply-To: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> References: <32941955.post@talk.nabble.com> <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> Message-ID: <32965574.post@talk.nabble.com> Than you for the replies. My script (below) reads in a list of locus_tags from a tab delimited text file. Compares these locus_tags to the locus_tags in a genbank file and where they are equal adds new features. the line $feat->add_tag_value() needs to be defined. In the bioperl wiki this variable appears to be defined by giving it coordinates etc (creating a new feature). I wish to add features to CDS key when the locus_tags are identical. Is this possible? use strict; use Bio::SeqIO; my @V; open (LIST1, 'list') ||die; while (){ push @V, (split(/\t/, $_))[0]; } close(LIST1); my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); my $seq_object = $seqio_object->next_seq; for my $feat_object ($seq_object->get_SeqFeatures){ if ($feat_object->primary_tag eq "CDS"){ if ($feat_object->has_tag('locus_tag')){ for my $V3 ($feat_object->get_tag_values('locus_tag')){ for my $V1 (@V) { if ($V1 eq $V3){ ADD NEW FEATURES } } } } } } The script works down as far as the comparison point where locus_tags in the genbankfile "Contig100.gb" are compared against a list of locus_tags from a delimited txt file. regards Brian Jason Stajich-5 wrote: > > $feature->add_tag_value('color','blue'); > > On Dec 9, 2011, at 8:52 AM, BForde wrote: > >> >> Hello all, >> >> I am new to Bioperl so I apologise if this is stupid question. >> >> For CDS features I which to add additional qualifiers e.g. /colour and >> /note >> qualifiers. I have looked at the BioPerl wiki but am still unsure as how >> to >> do this? >> >> regards >> >> Brian >> -- >> View this message in context: >> http://old.nabble.com/Genbank-files-tp32941955p32941955.html >> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Jason Stajich > jason.stajich at gmail.com > jason at bioperl.org > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32965574.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From roy.chaudhuri at gmail.com Tue Dec 13 06:52:05 2011 From: roy.chaudhuri at gmail.com (Roy Chaudhuri) Date: Tue, 13 Dec 2011 11:52:05 +0000 Subject: [Bioperl-l] Genbank files In-Reply-To: <32965574.post@talk.nabble.com> References: <32941955.post@talk.nabble.com> <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> <32965574.post@talk.nabble.com> Message-ID: <4EE73C65.1080101@gmail.com> Hi Brian, Just to check I have understood you, you want to read through a genbank file and add additional tags to features which are listed in a tab-delimited file of locus tags? Your code is on the right lines, but it would be much more efficient to read your tab-delimited locus_tags into a hash, and check using exists, rather than ploughing through the (potentially very long) list of locus tags every time. Also, be careful with new lines in your tab file (you can safely get rid of them using "chomp"). You can miss out the "has_tag" check by using "get_tagset_values" instead of "get_tag_values", since the former does not complain if the tag is not present. Once you have modified your sequence object, you need to write it out to a new file (or STDOUT) using Bio::SeqIO. Also, just a couple of general points, you should always "use warnings" (or even better "use warnings FATAL=>qw(all)") since that can help solve many problems, and your code may be easier to read if you don't include the word "object" in all your variable names (after all you wouldn't say you write on a paper object using a pen object). use strict; use warnings FATAL=>qw(all); use Bio::SeqIO; open (my $list, 'list') or die $!; my %V; while (<$list>){ chomp; $V{(split(/\t/, $_))[0]}=1; } my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); my $seq_object = $seqio_object->next_seq; for my $feat_object ($seq_object->remove_SeqFeatures){ if ($feat_object->primary_tag eq "CDS"){ for my $V3 ($feat_object->get_tagset_values('locus_tag')){ if (exists $V{$V3}){ $feat_object->add_tag_value(listed_in_tab_file=>'yes'); next; } } } $seq_object->add_SeqFeature($feat_object); } Bio::SeqIO->new(-format=>'genbank')->write_seq($seq_object); Hope this helps. Cheers, Roy. On 13/12/2011 11:03, BForde wrote: > > Than you for the replies. > > My script (below) reads in a list of locus_tags from a tab delimited text > file. Compares these locus_tags to the locus_tags in a genbank file and > where they are equal adds new features. > the line > $feat->add_tag_value() > needs to be defined. In the bioperl wiki this variable appears to be defined > by giving it coordinates etc (creating a new feature). I wish to add > features to CDS key when the locus_tags are identical. Is this possible? > > use strict; > use Bio::SeqIO; > > my @V; > open (LIST1, 'list') ||die; > while (){ > push @V, (split(/\t/, $_))[0]; > } > close(LIST1); > > my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); > my $seq_object = $seqio_object->next_seq; > > for my $feat_object ($seq_object->get_SeqFeatures){ > if ($feat_object->primary_tag eq "CDS"){ > if ($feat_object->has_tag('locus_tag')){ > for my $V3 ($feat_object->get_tag_values('locus_tag')){ > for my $V1 (@V) { > if ($V1 eq $V3){ > ADD NEW FEATURES > > } > } > } > } > } > } > > The script works down as far as the comparison point where locus_tags in the > genbankfile "Contig100.gb" are compared against a list of locus_tags from a > delimited txt file. > > > regards > > Brian > > Jason Stajich-5 wrote: >> >> $feature->add_tag_value('color','blue'); >> >> On Dec 9, 2011, at 8:52 AM, BForde wrote: >> >>> >>> Hello all, >>> >>> I am new to Bioperl so I apologise if this is stupid question. >>> >>> For CDS features I which to add additional qualifiers e.g. /colour and >>> /note >>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how >>> to >>> do this? >>> >>> regards >>> >>> Brian >>> -- >>> View this message in context: >>> http://old.nabble.com/Genbank-files-tp32941955p32941955.html >>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> Jason Stajich >> jason.stajich at gmail.com >> jason at bioperl.org >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > From b.m.forde at umail.ucc.ie Tue Dec 13 09:22:01 2011 From: b.m.forde at umail.ucc.ie (Brian Forde) Date: Tue, 13 Dec 2011 14:22:01 +0000 Subject: [Bioperl-l] Genbank files In-Reply-To: <4EE73C65.1080101@gmail.com> References: <32941955.post@talk.nabble.com> <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> <32965574.post@talk.nabble.com> <4EE73C65.1080101@gmail.com> Message-ID: Hi Roy, Thank you. That works perfectly. I have to confess that someone else told me to use hashes but I could not get them to work.. Thanks again regards Brian On Tue, Dec 13, 2011 at 11:52 AM, Roy Chaudhuri wrote: > Hi Brian, > > Just to check I have understood you, you want to read through a genbank > file and add additional tags to features which are listed in a > tab-delimited file of locus tags? > > Your code is on the right lines, but it would be much more efficient to > read your tab-delimited locus_tags into a hash, and check using exists, > rather than ploughing through the (potentially very long) list of locus > tags every time. Also, be careful with new lines in your tab file (you can > safely get rid of them using "chomp"). You can miss out the "has_tag" check > by using "get_tagset_values" instead of "get_tag_values", since the former > does not complain if the tag is not present. Once you have modified your > sequence object, you need to write it out to a new file (or STDOUT) using > Bio::SeqIO. > > Also, just a couple of general points, you should always "use warnings" > (or even better "use warnings FATAL=>qw(all)") since that can help solve > many problems, and your code may be easier to read if you don't include the > word "object" in all your variable names (after all you wouldn't say you > write on a paper object using a pen object). > > use strict; > use warnings FATAL=>qw(all); > use Bio::SeqIO; > open (my $list, 'list') or die $!; > my %V; > while (<$list>){ > chomp; > $V{(split(/\t/, $_))[0]}=1; > > } > my $seqio_object = Bio::SeqIO->new(-file=>"**Contig100.gb"); > my $seq_object = $seqio_object->next_seq; > for my $feat_object ($seq_object->remove_**SeqFeatures){ > > if ($feat_object->primary_tag eq "CDS"){ > for my $V3 ($feat_object->get_tagset_**values('locus_tag')){ > if (exists $V{$V3}){ > $feat_object->add_tag_value(**listed_in_tab_file=>'yes'); > next; > } > } > } > $seq_object->add_SeqFeature($**feat_object); > } > Bio::SeqIO->new(-format=>'**genbank')->write_seq($seq_**object); > > Hope this helps. > Cheers, > Roy. > > > On 13/12/2011 11:03, BForde wrote: > >> >> Than you for the replies. >> >> My script (below) reads in a list of locus_tags from a tab delimited text >> file. Compares these locus_tags to the locus_tags in a genbank file and >> where they are equal adds new features. >> the line >> $feat->add_tag_value() >> needs to be defined. In the bioperl wiki this variable appears to be >> defined >> by giving it coordinates etc (creating a new feature). I wish to add >> features to CDS key when the locus_tags are identical. Is this possible? >> >> use strict; >> use Bio::SeqIO; >> >> my @V; >> open (LIST1, 'list') ||die; >> while (){ >> push @V, (split(/\t/, $_))[0]; >> } >> close(LIST1); >> >> my $seqio_object = Bio::SeqIO->new(-file=>"**Contig100.gb"); >> my $seq_object = $seqio_object->next_seq; >> >> for my $feat_object ($seq_object->get_SeqFeatures)**{ >> if ($feat_object->primary_tag eq "CDS"){ >> if ($feat_object->has_tag('locus_**tag')){ >> for my $V3 ($feat_object->get_tag_values(**'locus_tag')){ >> for my $V1 (@V) { >> if ($V1 eq $V3){ >> ADD NEW FEATURES >> >> } >> } >> } >> } >> } >> } >> >> The script works down as far as the comparison point where locus_tags in >> the >> genbankfile "Contig100.gb" are compared against a list of locus_tags from >> a >> delimited txt file. >> >> >> regards >> >> Brian >> >> Jason Stajich-5 wrote: >> >>> >>> $feature->add_tag_value('**color','blue'); >>> >>> On Dec 9, 2011, at 8:52 AM, BForde wrote: >>> >>> >>>> Hello all, >>>> >>>> I am new to Bioperl so I apologise if this is stupid question. >>>> >>>> For CDS features I which to add additional qualifiers e.g. /colour and >>>> /note >>>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how >>>> to >>>> do this? >>>> >>>> regards >>>> >>>> Brian >>>> -- >>>> View this message in context: >>>> http://old.nabble.com/Genbank-**files-tp32941955p32941955.html >>>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >>>> >>>> ______________________________**_________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/**mailman/listinfo/bioperl-l >>>> >>> >>> Jason Stajich >>> jason.stajich at gmail.com >>> jason at bioperl.org >>> >>> >>> ______________________________**_________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/**mailman/listinfo/bioperl-l >>> >>> >>> >> > -- Brian Forde Microbiology Dept. Bioscience Institute. Room 4.11 University College Cork Cork Ireland tel:+353 21 4901306 email: b.m.forde at umail.ucc.ie From b.m.forde at umail.ucc.ie Mon Dec 12 12:20:53 2011 From: b.m.forde at umail.ucc.ie (BForde) Date: Mon, 12 Dec 2011 09:20:53 -0800 (PST) Subject: [Bioperl-l] Genbank files In-Reply-To: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> References: <32941955.post@talk.nabble.com> <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> Message-ID: <32959999.post@talk.nabble.com> Than you for the replies. I am unsure as to how to use the line below with my script. My script so far reads use strict; use Bio::SeqIO; my @V; open (LIST1, 'list') ||die; while (){ push @V, (split(/\t/, $_))[0]; } close(LIST1); my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); my $seq_object = $seqio_object->next_seq; for my $feat_object ($seq_object->get_SeqFeatures){ if ($feat_object->primary_tag eq "CDS"){ if ($feat_object->has_tag('locus_tag')){ for my $V3 ($feat_object->get_tag_values('locus_tag')){ for my $V1 (@V) { if ($V1 eq $V3){ ADD NEW FEATURES } } } } } } The script works down as far as the comparison point where locus_tags in the genbankfile "Contig100.gb" are compared against a list of locus_tags from a delimited txt file. I possbile could you show me how to amend my script so I can add new features regards Brian Jason Stajich-5 wrote: > > $feature->add_tag_value('color','blue'); > > On Dec 9, 2011, at 8:52 AM, BForde wrote: > >> >> Hello all, >> >> I am new to Bioperl so I apologise if this is stupid question. >> >> For CDS features I which to add additional qualifiers e.g. /colour and >> /note >> qualifiers. I have looked at the BioPerl wiki but am still unsure as how >> to >> do this? >> >> regards >> >> Brian >> -- >> View this message in context: >> http://old.nabble.com/Genbank-files-tp32941955p32941955.html >> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Jason Stajich > jason.stajich at gmail.com > jason at bioperl.org > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32959999.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From Russell.Smithies at agresearch.co.nz Tue Dec 13 22:17:02 2011 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Wed, 14 Dec 2011 16:17:02 +1300 Subject: [Bioperl-l] Genbank files In-Reply-To: <32959999.post@talk.nabble.com> References: <32941955.post@talk.nabble.com> <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> <32959999.post@talk.nabble.com> Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF27A@exchsth.agresearch.co.nz> Something like this: use strict; use Bio::SeqIO; my @V; open (LIST1, 'list') ||die; while (){ push @V, (split(/\t/, $_))[0]; } close(LIST1); my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); my $seq_object = $seqio_object->next_seq; for my $feat_object ($seq_object->get_SeqFeatures){ if ($feat_object->primary_tag eq "CDS"){ if ($feat_object->has_tag('locus_tag')){ for my $V3 ($feat_object->get_tag_values('locus_tag')){ for my $V1 (@V) { if ($V1 eq $V3){ #ADD NEW FEATURES $feat_object->add_tag_value('color','blue'); } } } } } } #write the new annotations my $io = Bio::SeqIO->new(-format => "genbank", -file => ">new.gb" ); $io->write_seq($seq_object); Take another look at http://www.bioperl.org/wiki/HOWTO:Feature-Annotation#Building_Your_Own_Sequences --Russell > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of BForde > Sent: Tuesday, 13 December 2011 6:21 a.m. > To: Bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Genbank files > > > Than you for the replies. > > I am unsure as to how to use the line below with my script. My script so far > reads > > use strict; > use Bio::SeqIO; > > my @V; > open (LIST1, 'list') ||die; > while (){ > push @V, (split(/\t/, $_))[0]; > } > close(LIST1); > > my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); > my $seq_object = $seqio_object->next_seq; > > for my $feat_object ($seq_object->get_SeqFeatures){ > if ($feat_object->primary_tag eq "CDS"){ > if ($feat_object->has_tag('locus_tag')){ > for my $V3 ($feat_object->get_tag_values('locus_tag')){ > for my $V1 (@V) { > if ($V1 eq $V3){ > ADD NEW FEATURES > > } > } > } > } > } > } > > The script works down as far as the comparison point where locus_tags in the > genbankfile "Contig100.gb" are compared against a list of locus_tags from a > delimited txt file. > I possbile could you show me how to amend my script so I can add new > features > > regards > > Brian > > Jason Stajich-5 wrote: > > > > $feature->add_tag_value('color','blue'); > > > > On Dec 9, 2011, at 8:52 AM, BForde wrote: > > > >> > >> Hello all, > >> > >> I am new to Bioperl so I apologise if this is stupid question. > >> > >> For CDS features I which to add additional qualifiers e.g. /colour > >> and /note qualifiers. I have looked at the BioPerl wiki but am still > >> unsure as how to do this? > >> > >> regards > >> > >> Brian > >> -- > >> View this message in context: > >> http://old.nabble.com/Genbank-files-tp32941955p32941955.html > >> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > Jason Stajich > > jason.stajich at gmail.com > > jason at bioperl.org > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > -- > View this message in context: http://old.nabble.com/Genbank-files- > tp32941955p32959999.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From l.m.timmermans at students.uu.nl Wed Dec 14 10:43:24 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Wed, 14 Dec 2011 16:43:24 +0100 Subject: [Bioperl-l] Announcing Bio::SFF Message-ID: Hi all, As already mentioned on IRC, I recently wrote a SFF parser and uploaded it to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time to write one I'd be most grateful. Leon From p.j.a.cock at googlemail.com Wed Dec 14 11:03:05 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 14 Dec 2011 16:03:05 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Wed, Dec 14, 2011 at 3:43 PM, Leon Timmermans wrote: > Hi all, > > As already mentioned on IRC, I recently wrote a SFF parser and uploaded it > to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF > entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time > to write one I'd be most grateful. > > Leon Hi Leon, Have you looked at the index block at all, in order to offer random access by read ID, or to access the Roche XML manifest? Please ask if you need more information about this - or if you can read Python: https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py Is this building on Miguel Pignatelli's work? I don't recall seeing any follow up posts from him after this one: http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html Peter From cjfields at illinois.edu Wed Dec 14 11:12:58 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 14 Dec 2011 16:12:58 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: , Message-ID: <3BBC1132-E768-45D9-A107-ACD51791722D@illinois.edu> Leon, Nice! Definitely a good idea to have the lower-level parser and the BioPerl-bridging code separate, one of my concerns with the various parsers we have right now which hard-wire BioPerl classes in with the parser (makes it hard for optimization). Chris PS- Peter, I don't think the two projects are related, but I suppose Leon is the best to answer that. Sent from my stupid iPad, now my laptop's on the fritz On Dec 14, 2011, at 10:04 AM, "Peter Cock" wrote: > On Wed, Dec 14, 2011 at 3:43 PM, Leon Timmermans > wrote: >> Hi all, >> >> As already mentioned on IRC, I recently wrote a SFF parser and uploaded it >> to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF >> entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time >> to write one I'd be most grateful. >> >> Leon > > Hi Leon, > > Have you looked at the index block at all, in order to offer random > access by read ID, or to access the Roche XML manifest? Please > ask if you need more information about this - or if you can read Python: > https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py > > Is this building on Miguel Pignatelli's work? I don't recall seeing > any follow up posts from him after this one: > http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html > > Peter > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From l.m.timmermans at students.uu.nl Wed Dec 14 11:27:58 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Wed, 14 Dec 2011 17:27:58 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Wed, Dec 14, 2011 at 5:03 PM, Peter Cock wrote: > Hi Leon, > > Have you looked at the index block at all, in order to offer random > access by read ID, or to access the Roche XML manifest? Please > ask if you need more information about this - or if you can read Python: > https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py > I have looked at it, but not implemented it yet. There is no standardized index, and the ones that are in common use either seem stupid (the Roche index, which is essentially just a weirdly formatted sequential list, though that should still be faster than a table scan) or undocumented (hash based index). Is this building on Miguel Pignatelli's work? I don't recall seeing > any follow up posts from him after this one: > http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html > It isn't. I like his idea for reusing BioPython's test files though. Leon From p.j.a.cock at googlemail.com Wed Dec 14 11:44:28 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 14 Dec 2011 16:44:28 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Wed, Dec 14, 2011 at 4:27 PM, Leon Timmermans wrote: > On Wed, Dec 14, 2011 at 5:03 PM, Peter Cock > wrote: >> >> Hi Leon, >> >> Have you looked at the index block at all, in order to offer random >> access by read ID, or to access the Roche XML manifest? Please >> ask if you need more information about this - or if you can read Python: >> https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py > > I have looked at it, but not implemented it yet. There is no standardized > index, and the ones that are in common use either seem stupid (the Roche > index, which is essentially just a weirdly formatted sequential list, though > that should still be faster than a table scan) or undocumented (hash based > index). There are two widely used indexes, both from Roche (one with and one without an XML manifest, magic bytes .mft and .srt). They are both just a simple table of the reads names and offsets, sorted alphabetically. This works pretty well for rapid lookup for SFF files (because the read count is not so high), and is pretty easy. I don't think anyone used the hash table style indexes (.hsh), which I assume was a proof of principle or trial in the early days of SFF. One thing to check is what Ion Torrent's SFF files use. I would guess they've followed Roche, but I don't know. After all, the index structure is not defined in the SFF specification - it was left extensible on purpose. >> Is this building on Miguel Pignatelli's work? I don't recall seeing >> any follow up posts from him after this one: >> http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html > > It isn't. I like his idea for reusing BioPython's test files though. Yes, please do. Peter From gingerplum at gmail.com Wed Dec 14 00:18:55 2011 From: gingerplum at gmail.com (plum ginger) Date: Tue, 13 Dec 2011 21:18:55 -0800 (PST) Subject: [Bioperl-l] a problem about BLAST Message-ID: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com> Hi. I'm trying to run BLAST with Bio::Tools::Run::StandAloneBlast. I need run BLAST on more than one sequences. However the blast outfile only store the result of last sequence. How to make the outfile store all results? Wish your help. Thanks very much! Best regards From jason.stajich at gmail.com Thu Dec 15 12:02:47 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Thu, 15 Dec 2011 11:02:47 -0600 Subject: [Bioperl-l] a problem about BLAST In-Reply-To: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com> References: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com> Message-ID: <58E5B487-7FF0-4018-B109-D6595DC2E493@gmail.com> you are probably setting the outfile in each parsing iteration -- you need to show your code if you want someone to help you debug the problem. On Dec 13, 2011, at 11:18 PM, plum ginger wrote: > Hi. I'm trying to run BLAST with Bio::Tools::Run::StandAloneBlast. I > need run BLAST on more than one sequences. However the blast outfile > only store the result of last sequence. How to make the outfile store > all results? > > Wish your help. Thanks very much! > > > Best regards > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason.stajich at gmail.com jason at bioperl.org From pengyu.ut at gmail.com Fri Dec 16 17:10:27 2011 From: pengyu.ut at gmail.com (Peng Yu) Date: Fri, 16 Dec 2011 16:10:27 -0600 Subject: [Bioperl-l] How to stop rather than emit warnings with Bio::Das::segment? Message-ID: Hi, Bio::Das::segment can give me the following warnings without stopping the whole program when the position for the query doesn't exist. I could test the return result and quit when it is []. But this would cause my program have an test whenever I call segment. I'm wondering if there is an automatic way to let Bio::Das::segment stop in such cases. --------------------- WARNING --------------------- MSG: Sequence is not dna or rna, but []. Attempting to revcom, but unsure if this is right --------------------------------------------------- -- Regards, Peng From cjfields at illinois.edu Fri Dec 16 21:48:07 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sat, 17 Dec 2011 02:48:07 +0000 Subject: [Bioperl-l] How to stop rather than emit warnings with Bio::Das::segment? In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF0881CF8E@CITESMBX4.ad.uillinois.edu> Setting verbosity to 2 should convert warnings to exceptions. IIRC, set '-verbose => 2' in the Bio::Das constructor, set '$das->verbose(2)' explicitly, or set the env variable BIOPERLDEBUG=2. chris ________________________________________ From: bioperl-l-bounces at lists.open-bio.org [bioperl-l-bounces at lists.open-bio.org] on behalf of Peng Yu [pengyu.ut at gmail.com] Sent: Friday, December 16, 2011 4:10 PM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] How to stop rather than emit warnings with Bio::Das::segment? Hi, Bio::Das::segment can give me the following warnings without stopping the whole program when the position for the query doesn't exist. I could test the return result and quit when it is []. But this would cause my program have an test whenever I call segment. I'm wondering if there is an automatic way to let Bio::Das::segment stop in such cases. --------------------- WARNING --------------------- MSG: Sequence is not dna or rna, but []. Attempting to revcom, but unsure if this is right --------------------------------------------------- -- Regards, Peng _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From anna.fr at gmail.com Mon Dec 19 02:09:15 2011 From: anna.fr at gmail.com (Anna Friedlander) Date: Mon, 19 Dec 2011 20:09:15 +1300 Subject: [Bioperl-l] StandAloneBlastPlus blastdbcmd question Message-ID: Hi all I have a question about using blastdbcmd via Bio::Tools::Run::StandAloneBlastPlus I have some Blast+ search results that I am manipulating in a perl programme, and I would like to retrieve some sequence information for some results using subject sequence IDs, and associated subject start and end indices. If I was using blastdbcmd directly, I would do so using the -entry and -range options. My question is, can I use all the blastdbcmd options (or more specifically, just the -entry and -range options) from within the StandAloneBlastPlus module? My apologies if I don't properly understand how this "wrapper" works! Thanks in advance for your help Anna Friedlander From l.m.timmermans at students.uu.nl Mon Dec 19 09:19:14 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Mon, 19 Dec 2011 15:19:14 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Wed, Dec 14, 2011 at 5:44 PM, Peter Cock wrote: > There are two widely used indexes, both from Roche (one with and > one without an XML manifest, magic bytes .mft and .srt). They are > both just a simple table of the reads names and offsets, sorted > alphabetically. Yeah, that's what I got from the BioPython code. I didn't know it was sorted though (it doesn't make much sense either, unless they wanted to do a binary search or something). This works pretty well for rapid lookup for SFF files > (because the read count is not so high), and is pretty easy. > It's implemented in Bio::SFF 0.003. I did restructure my code into two readers though, since doing sequential and random-access in the class didn't make much sense code-wise. I don't think anyone used the hash table style indexes (.hsh), which > I assume was a proof of principle or trial in the early days of SFF. > I see, too bad. > One thing to check is what Ion Torrent's SFF files use. I would > guess they've followed Roche, but I don't know. After all, the > index structure is not defined in the SFF specification - it was > left extensible on purpose. > Yeah, we should check that too. Yes, please do. > It's added to 0.003. The lack of tests was bothering me, but the SFFs I had at hand were not suitable. Leon From p.j.a.cock at googlemail.com Mon Dec 19 09:31:18 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 19 Dec 2011 14:31:18 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 2:19 PM, Leon Timmermans wrote: > On Wed, Dec 14, 2011 at 5:44 PM, Peter Cock wrote: > >> There are two widely used indexes, both from Roche (one with and >> one without an XML manifest, magic bytes .mft and .srt). They are >> both just a simple table of the reads names and offsets, sorted >> alphabetically. > > Yeah, that's what I got from the BioPython code. I didn't know it > was sorted though (it doesn't make much sense either, unless they > wanted to do a binary search or something). I presume that's what Roche uses if they keep the index on disk. The alternative is to load the index into RAM, which is really fast. You just open the SFF, read the header, seek to the index, load the index. Without the index, you have to scan the entire SFF file to find each record and its offset - which is much slower. >> This works pretty well for rapid lookup for SFF files >> (because the read count is not so high), and is pretty easy. > > It's implemented in Bio::SFF 0.003. I did restructure my code into two > readers though, since doing sequential and random-access in the class > didn't make much sense code-wise. > >> I don't think anyone used the hash table style indexes (.hsh), which >> I assume was a proof of principle or trial in the early days of SFF. > > I see, too bad. > >> One thing to check is what Ion Torrent's SFF files use. I would >> guess they've followed Roche, but I don't know. After all, the >> index structure is not defined in the SFF specification - it was >> left extensible on purpose. > > Yeah, we should check that too. I don't have any Ion Torrent data first hand, and the public samples I've seen were FASTQ not SFF. But I know a few people with Ion Torrent machines that might be able to help... > It's added to 0.003. The lack of tests was bothering me, but the > SFFs I had at hand were not suitable. Have you looked at the sample SFF data in Biopython? Please use them for the BioPerl unit tests (we're been talking about a cross project collection of test data files like this), the README file should be self-explanatory: https://github.com/biopython/biopython/tree/master/Tests/Roche Peter From p.j.a.cock at googlemail.com Mon Dec 19 10:13:53 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 19 Dec 2011 15:13:53 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: <4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk> References: <4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk> Message-ID: On Mon, Dec 19, 2011 at 3:03 PM, Adam Witney wrote: >> I don't have any Ion Torrent data first hand, and the public >> samples I've seen were FASTQ not SFF. But I know a few >> people with Ion Torrent machines that might be able to help? > > I can you let you have some Ion Torrent SFF files if it helps > > adam Hi Adam, I've just had a quick look at a file from an IonTorrent 314 chip that a colleague kindly sent me, and that SFF file had no index (but only 50k reads so this isn't so important). If you can send me (and Leon?) one of two original SFF files that would be useful, even if just to confirm that Ion Torrent's SFF files do indeed typically lack an index. If that is the case, I may need to remove the warning message Biopython currently prints when indexing these files: No SFF index, doing it the slow way Off list is fine if you'd like to keep the data private, use dropbox or something if you don't have an FTP server. Thanks, Peter From awitney at sgul.ac.uk Mon Dec 19 10:03:16 2011 From: awitney at sgul.ac.uk (Adam Witney) Date: Mon, 19 Dec 2011 15:03:16 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: <4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk> >>> One thing to check is what Ion Torrent's SFF files use. I would >>> guess they've followed Roche, but I don't know. After all, the >>> index structure is not defined in the SFF specification - it was >>> left extensible on purpose. >> >> Yeah, we should check that too. > > I don't have any Ion Torrent data first hand, and the public > samples I've seen were FASTQ not SFF. But I know a few > people with Ion Torrent machines that might be able to help? I can you let you have some Ion Torrent SFF files if it helps adam From l.m.timmermans at students.uu.nl Mon Dec 19 10:48:34 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Mon, 19 Dec 2011 16:48:34 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 3:31 PM, Peter Cock wrote: > I presume that's what Roche uses if they keep the index on disk. > > The alternative is to load the index into RAM, which is really fast. > You just open the SFF, read the header, seek to the index, load > the index. Without the index, you have to scan the entire SFF file > to find each record and its offset - which is much slower. > That's what I'm doing now. It's much faster, but it still takes a noticeable amount of time on large files. Have you looked at the sample SFF data in Biopython? Please > use them for the BioPerl unit tests (we're been talking about a > cross project collection of test data files like this), the README > file should be self-explanatory: > https://github.com/biopython/biopython/tree/master/Tests/Roche > Yeah, I'm using those now ( https://github.com/Leont/bio-sff/blob/master/t/reader.t). I must say there were some interesting corner cases in it. Leon From p.j.a.cock at googlemail.com Mon Dec 19 11:15:15 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 19 Dec 2011 16:15:15 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 3:48 PM, Leon Timmermans wrote: > On Mon, Dec 19, 2011 at 3:31 PM, Peter Cock wrote: > >> Have you looked at the sample SFF data in Biopython? Please >> use them for the BioPerl unit tests (we're been talking about a >> cross project collection of test data files like this), the README >> file should be self-explanatory: >> https://github.com/biopython/biopython/tree/master/Tests/Roche > > Yeah, I'm using those now > (https://github.com/Leont/bio-sff/blob/master/t/reader.t). Could you a link to your /corpus/README.txt file pointing back to the Biopython original for acknowledgement and future reference? > > I must say there were some interesting corner cases in it. > I'm glad you agree - and if you can think of any more special cases to verify that would be great. Are you doing just SFF parsing for now? Not writing? Now, as to Bio::SeqIO integration, Biopython's SeqIO uses format name "sff" to mean the full read sequence (with mixed case, upper case for the good sequence, lower cases for any left/right clipping - as in the Roche tools), and "sff-trim" to mean the trimmed sequences. I would encourage you to do the same, as part of the general aim of having consistent sequence format names between BioPerl, Biopython, and EMBOSS, where possible. Peter From l.m.timmermans at students.uu.nl Mon Dec 19 11:47:41 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Mon, 19 Dec 2011 17:47:41 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock wrote: > Could you a link to your /corpus/README.txt file pointing > back to the Biopython original for acknowledgement and > future reference? > I forgot about that, I will add it to the next release. Are you doing just SFF parsing for now? Not writing? > I haven't written the writer yet (haven't needed it so far). I'd rather release working code early instead of waiting until everything is complete. Now, as to Bio::SeqIO integration, Biopython's SeqIO uses > format name "sff" to mean the full read sequence (with mixed > case, upper case for the good sequence, lower cases for any > left/right clipping - as in the Roche tools), and "sff-trim" to mean > the trimmed sequences. I would encourage you to do the > same, as part of the general aim of having consistent > sequence format names between BioPerl, Biopython, and > EMBOSS, where possible. > I agree, consistency is good. Leon From p.j.a.cock at googlemail.com Mon Dec 19 12:00:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 19 Dec 2011 17:00:03 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 4:47 PM, Leon Timmermans wrote: > On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock > wrote: >> >> Could you a link to your /corpus/README.txt file pointing >> back to the Biopython original for acknowledgement and >> future reference? > > I forgot about that, I will add it to the next release. Thanks. >> Are you doing just SFF parsing for now? Not writing? > > > I haven't written the writer yet (haven't needed it so far). I'd rather > release working code early instead of waiting until everything is complete. I understand - but make sure you've designed the data structures in the parser so as to allow the original record to be re-built as SFF. >> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses >> format name "sff" to mean the full read sequence (with mixed >> case, upper case for the good sequence, lower cases for any >> left/right clipping - as in the Roche tools), and "sff-trim" to mean >> the trimmed sequences. I would encourage you to do the >> same, as part of the general aim of having consistent >> sequence format names between BioPerl, Biopython, and >> EMBOSS, where possible. > > I agree, consistency is good. Great. I'd guess Bio::SeqIO integration would be more important that SFF output initially. Peter From cjfields at illinois.edu Mon Dec 19 14:44:22 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 19 Dec 2011 19:44:22 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: , Message-ID: Kinda joining this a little late, but I think if there is a way to have a low-level parser/writer that generically parses the data into simple (possibly hash-tagged) data structures, that would be best. Barring that, a very simple class for storing data. We've found BioPerl objects/classes pretty heavy. (for an example of this, see Heng Li's readfq parser on github, which has some stats for Fastq/fasta parsing). Any way we can separate the parser from object instantiation would enable us to optimize the object/class layer and parser/writer layers separately, with the possible nice side effect of making the parser more broadly used. For instance, if someone wanted a faster parser, use the low level, otherwise use the higher level (possibly BioPerl-specific) API. Lincoln does this do a certain degree with Bio-samtools; I would go further and make the bp- and non-bp code in separate dists. Chris Sent from my iPad On Dec 19, 2011, at 11:05 AM, "Peter Cock" wrote: > On Mon, Dec 19, 2011 at 4:47 PM, Leon Timmermans > wrote: >> On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock >> wrote: >>> >>> Could you a link to your /corpus/README.txt file pointing >>> back to the Biopython original for acknowledgement and >>> future reference? >> >> I forgot about that, I will add it to the next release. > > Thanks. > >>> Are you doing just SFF parsing for now? Not writing? >> >> >> I haven't written the writer yet (haven't needed it so far). I'd rather >> release working code early instead of waiting until everything is complete. > > I understand - but make sure you've designed the data structures > in the parser so as to allow the original record to be re-built as SFF. > >>> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses >>> format name "sff" to mean the full read sequence (with mixed >>> case, upper case for the good sequence, lower cases for any >>> left/right clipping - as in the Roche tools), and "sff-trim" to mean >>> the trimmed sequences. I would encourage you to do the >>> same, as part of the general aim of having consistent >>> sequence format names between BioPerl, Biopython, and >>> EMBOSS, where possible. >> >> I agree, consistency is good. > > Great. I'd guess Bio::SeqIO integration would be more important > that SFF output initially. > > Peter > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Mon Dec 19 19:28:25 2011 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 19 Dec 2011 18:28:25 -0600 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: <4EEFD6A9.3010303@illinois.edu> On 12/19/2011 10:47 AM, Leon Timmermans wrote: > On Mon, Dec 19, 2011 at 5:15 PM, Peter Cockwrote: > >> Could you a link to your /corpus/README.txt file pointing >> back to the Biopython original for acknowledgement and >> future reference? >> > I forgot about that, I will add it to the next release. > > Are you doing just SFF parsing for now? Not writing? > I haven't written the writer yet (haven't needed it so far). I'd rather > release working code early instead of waiting until everything is complete. > > Now, as to Bio::SeqIO integration, Biopython's SeqIO uses >> format name "sff" to mean the full read sequence (with mixed >> case, upper case for the good sequence, lower cases for any >> left/right clipping - as in the Roche tools), and "sff-trim" to mean >> the trimmed sequences. I would encourage you to do the >> same, as part of the general aim of having consistent >> sequence format names between BioPerl, Biopython, and >> EMBOSS, where possible. >> > I agree, consistency is good. > > Leon This is already implemented in Bio::SeqIO I believe. This is the same line of thinking with the FASTQ format, that one can have a 'format-variant' combination that (as one might guess) indicates to the parser any variation of the parser so logic within the parser can deal with it. You can also pass the '-variant => "foo"' parameter as well IIRC. You would just check the variant with the variant() method. chris From l.m.timmermans at students.uu.nl Tue Dec 20 10:25:13 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Tue, 20 Dec 2011 16:25:13 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 6:00 PM, Peter Cock wrote: > I understand - but make sure you've designed the data structures > in the parser so as to allow the original record to be re-built as SFF. > I did, though currently it's rather hard to make new entries from scratch. That said, I can hardly imagine anyone wanting to do this. Great. I'd guess Bio::SeqIO integration would be more important > that SFF output initially. > Probably. It looks like it's quite easy, it's just rather underdocumented. Leon From l.m.timmermans at students.uu.nl Tue Dec 20 10:26:11 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Tue, 20 Dec 2011 16:26:11 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 8:44 PM, Fields, Christopher J < cjfields at illinois.edu> wrote: > Kinda joining this a little late, but I think if there is a way to have a > low-level parser/writer that generically parses the data into simple > (possibly hash-tagged) data structures, that would be best. Barring that, > a very simple class for storing data. We've found BioPerl objects/classes > pretty heavy. > > (for an example of this, see Heng Li's readfq parser on github, which has > some stats for Fastq/fasta parsing). > > Any way we can separate the parser from object instantiation would enable > us to optimize the object/class layer and parser/writer layers separately, > with the possible nice side effect of making the parser more broadly used. > > For insn Sance, if someone wanted a faster parser, use the low level, > otherwise use the higher level (possibly BioPerl-specific) API. Lincoln > does this do a certain degree with Bio-samtools; I would go further and > make the bp- and non-bp code in separate dists. > A good OO system can actually help make things faster. For example, I'm unpacking the flowspace and quality data lazily, which made scanning through an SFF file 2.5-3 times as fast while having marginal extra costs when you do need them. Leon From l.m.timmermans at students.uu.nl Tue Dec 20 10:30:54 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Tue, 20 Dec 2011 16:30:54 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: <4EEFD6A9.3010303@illinois.edu> References: <4EEFD6A9.3010303@illinois.edu> Message-ID: On Tue, Dec 20, 2011 at 1:28 AM, Chris Fields wrote: > This is already implemented in Bio::SeqIO I believe. This is the same > line of thinking with the FASTQ format, that one can have a > 'format-variant' combination that (as one might guess) indicates to the > parser any variation of the parser so logic within the parser can deal with > it. You can also pass the '-variant => "foo"' parameter as well IIRC. You > would just check the variant with the variant() method. > Great. That makes life much easier :-) Leon From p.j.a.cock at googlemail.com Tue Dec 20 10:31:59 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 20 Dec 2011 15:31:59 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Tue, Dec 20, 2011 at 3:25 PM, Leon Timmermans wrote: > On Mon, Dec 19, 2011 at 6:00 PM, Peter Cock > wrote: >> >> I understand - but make sure you've designed the data structures >> in the parser so as to allow the original record to be re-built as SFF. > > ?I did, though currently it's rather hard to make new entries from scratch. > That said, I can hardly imagine anyone wanting to do this. Typical use cases I've found in using the Biopython SFF code are filtering an SFF file (taking some records only), and modifying the clipping values. In both cases, the user isn't creating the SFF records from scratch. Peter From cjfields at illinois.edu Tue Dec 20 17:40:31 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 20 Dec 2011 22:40:31 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: , Message-ID: On Dec 20, 2011, at 9:26 AM, "Leon Timmermans" > wrote: On Mon, Dec 19, 2011 at 8:44 PM, Fields, Christopher J > wrote: Kinda joining this a little late, but I think if there is a way to have a low-level parser/writer that generically parses the data into simple (possibly hash-tagged) data structures, that would be best. Barring that, a very simple class for storing data. We've found BioPerl objects/classes pretty heavy. (for an example of this, see Heng Li's readfq parser on github, which has some stats for Fastq/fasta parsing). Any way we can separate the parser from object instantiation would enable us to optimize the object/class layer and parser/writer layers separately, with the possible nice side effect of making the parser more broadly used. For insn Sance, if someone wanted a faster parser, use the low level, otherwise use the higher level (possibly BioPerl-specific) API. Lincoln does this do a certain degree with Bio-samtools; I would go further and make the bp- and non-bp code in separate dists. A good OO system can actually help make things faster. For example, I'm unpacking the flowspace and quality data lazily, which made scanning through an SFF file 2.5-3 times as fast while having marginal extra costs when you do need them. Leon Yep, thinking about using the same approach for the Fastq variants. Chris Sent from my ancient iPad b/c my laptop's borked From dgacquer at ulb.ac.be Wed Dec 21 08:26:07 2011 From: dgacquer at ulb.ac.be (David Gacquer) Date: Wed, 21 Dec 2011 14:26:07 +0100 Subject: [Bioperl-l] Strange behaviour in the write_seq function for large fasta Message-ID: <4EF1DE6F.4070508@ulb.ac.be> Dear BioPerl users/developers, I am facing a strange issue with the $seq_out->write_seq function when using large fasta files I have downloaded the hg19 chromosome 1, and applied the following code (basically I wanted to mask some regions in it but the problem also appears when copying the sequence without modifications): sub main{ my $seq_in = Bio::SeqIO->new( -format => 'largefasta', -file => $ARGV[0]); my $seq_out = Bio::SeqIO->new( -format => 'largefasta', -file => '>'.$ARGV[1]); my $seq_obj_in = $seq_in->next_seq(); my $modified_seq = $seq_obj_in->seq(); my $seq_obj_out = Bio::Seq::LargePrimarySeq->new( -seq => $modified_seq, -id => $seq_obj_in->id, -desc => $seq_obj_in->desc); $seq_out->write_seq($seq_obj_out); } when checking the output fasta file, the sequence of chr1 is 1-bp shorter. I have noticed that in the original fasta file, each line contains exactly 50 nucleotides, while the output of the $seq_out->write_seq function contains always 60 characters per line. chr1 is exactly 249,250,621 bp (4,154,177 * 60 + 1) so to verify that the very last base was missing, I created the following fasta files: chr121.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAG chr122.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAG They contain respectively 121 (60*2+1) and 122 (60*2+2) bp, the last character being a G. When running the above code: chr121.out.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA chr122.out.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AG The output for the 122 bp chromosome is correct (2 lines of 60 bp and the last line with 2 bp, AG) but for the 121 bp chromosome, the last character is missing (2 lines of 60 bp only, last G is missing). When replacing -format => 'largefasta' by -format => 'fasta' or writing the output without the write_seq function however, the problem is solved. Am I missing something or is there a problem with the write_seq function used with large fasta files? (I am running BioPerl on a Mac under OS X Snow Leopard) Best regards David -- David Gacquer, Ph. D. IRIBHM - Universite Libre de Bruxelles Bldg C, room C.4.117 ULB, Campus Erasme, CP602 808 route de Lennik B-1070 Brussels Belgium Phone: +32-2-555 4187 Fax: +32-2-555 4655 E-mail: dgacquer at ulb.ac.be From koraydogankaya at gmail.com Sat Dec 24 03:44:43 2011 From: koraydogankaya at gmail.com (Koray) Date: Sat, 24 Dec 2011 00:44:43 -0800 (PST) Subject: [Bioperl-l] exons Message-ID: <8454dd72-4fed-41c5-977e-83d9300cd68b@z25g2000vbs.googlegroups.com> I need an explicit code for getting exon sequences of an mrna or gene fetched by get_Seq_by_acc or id. in ensembl it is easy but here it is not easy many ios exists. for example: here how can i get such a $gene object from DBs (GeneBank or EntrezGene) by acc numberor ids? exons code prev next Top Title : exons() Usage : @exons = $gene->exons(); @inital_exons = $gene->exons('Initial'); Function: Get all exon features or all exons of a specified type of this gene structure. Exon type is treated as a case-insensitive regular expression and optional. For consistency, use only the following types: initial, internal, terminal, utr, utr5prime, and utr3prime. A special and virtual type is 'coding', which refers to all types except utr. This method basically merges the exons returned by transcripts. Returns : An array of Bio::SeqFeature::Gene::ExonI implementing objects. Args : An optional string specifying the type of exon. From challa_ghanashyam at yahoo.com Sat Dec 24 15:09:09 2011 From: challa_ghanashyam at yahoo.com (GSC) Date: Sat, 24 Dec 2011 12:09:09 -0800 (PST) Subject: [Bioperl-l] re trieve description for a list of gi ids.. Message-ID: <33034438.post@talk.nabble.com> Hi all: I am new to perl. I am working on a script to retrieve the record description (name given for a sequence record in genbank) for a list of gi ids. the script works fine for 1000 ids but my list is about 250,000 ids long and it is not working for me. Any suggestions on this. GS -- View this message in context: http://old.nabble.com/retrieve-description-for-a-list-of-gi-ids..-tp33034438p33034438.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From cjfields at illinois.edu Tue Dec 27 10:03:28 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 27 Dec 2011 15:03:28 +0000 Subject: [Bioperl-l] Strange behaviour in the write_seq function for large fasta In-Reply-To: <4EF1DE6F.4070508@ulb.ac.be> References: <4EF1DE6F.4070508@ulb.ac.be> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF0BB29547@CITESMBX4.ad.uillinois.edu> This is a strange one. Personally I haven't seen this behavior, but that maybe it's OS-dependent? We'll need more information, particularly what version of BioPerl you are using, the OS, version of perl, etc. Also, in general to make sure we don't lose track of this issue it is best to submit a bug report: https://redmine.open-bio.org/projects/bioperl I'm planning on triaging bugs next week, I could take a look then. chris ________________________________________ From: bioperl-l-bounces at lists.open-bio.org [bioperl-l-bounces at lists.open-bio.org] on behalf of David Gacquer [dgacquer at ulb.ac.be] Sent: Wednesday, December 21, 2011 7:26 AM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] Strange behaviour in the write_seq function for large fasta Dear BioPerl users/developers, I am facing a strange issue with the $seq_out->write_seq function when using large fasta files I have downloaded the hg19 chromosome 1, and applied the following code (basically I wanted to mask some regions in it but the problem also appears when copying the sequence without modifications): sub main{ my $seq_in = Bio::SeqIO->new( -format => 'largefasta', -file => $ARGV[0]); my $seq_out = Bio::SeqIO->new( -format => 'largefasta', -file => '>'.$ARGV[1]); my $seq_obj_in = $seq_in->next_seq(); my $modified_seq = $seq_obj_in->seq(); my $seq_obj_out = Bio::Seq::LargePrimarySeq->new( -seq => $modified_seq, -id => $seq_obj_in->id, -desc => $seq_obj_in->desc); $seq_out->write_seq($seq_obj_out); } when checking the output fasta file, the sequence of chr1 is 1-bp shorter. I have noticed that in the original fasta file, each line contains exactly 50 nucleotides, while the output of the $seq_out->write_seq function contains always 60 characters per line. chr1 is exactly 249,250,621 bp (4,154,177 * 60 + 1) so to verify that the very last base was missing, I created the following fasta files: chr121.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAG chr122.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAG They contain respectively 121 (60*2+1) and 122 (60*2+2) bp, the last character being a G. When running the above code: chr121.out.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA chr122.out.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AG The output for the 122 bp chromosome is correct (2 lines of 60 bp and the last line with 2 bp, AG) but for the 121 bp chromosome, the last character is missing (2 lines of 60 bp only, last G is missing). When replacing -format => 'largefasta' by -format => 'fasta' or writing the output without the write_seq function however, the problem is solved. Am I missing something or is there a problem with the write_seq function used with large fasta files? (I am running BioPerl on a Mac under OS X Snow Leopard) Best regards David -- David Gacquer, Ph. D. IRIBHM - Universite Libre de Bruxelles Bldg C, room C.4.117 ULB, Campus Erasme, CP602 808 route de Lennik B-1070 Brussels Belgium Phone: +32-2-555 4187 Fax: +32-2-555 4655 E-mail: dgacquer at ulb.ac.be _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From jdeuts01 at students.poly.edu Thu Dec 1 14:09:19 2011 From: jdeuts01 at students.poly.edu (jdeuts01 at students.poly.edu) Date: Thu, 1 Dec 2011 14:09:19 +0000 Subject: [Bioperl-l] question Message-ID: Dear Bioperl, This is my first experience with bioperl and I need help please. 1. The version of bioperl is 1.6.1 and have also installed Bundle-BioPerl 2.1.8 and mGen 1.03. I was unable to install Bribes and trouchelle DB. Will this prevent the BioPerl package from functioning correctly? 2. The operating platform is windows 7 - 64 bit using ActiveState - Perl v5.12.2 3. The script is as follows: #!/usr/bin/perl # Write a script using OOP to write protein sequences to the file fasta.txt.use strict;use warnings;use Bio::SeqIO::fasta; # Declare and initialize input and output files.my $protein_fasta = "protein.fa";my $protein_out = ">fasta.txt"; # Setup objects for input and output.my $seq_in = Bio::SeqIO->new(-file => "$protein_fasta", -format => 'Fasta');my $seq_out = Bio::SeqIO->new(-file => "$protein_out", -format => 'fasta'); # Establish while loop using "next_seq" method # to read in multiple sequences from "protein.fa" file# one by one until none were remaining.while(my $seq = $seq_in -> next_seq){ $seq_out->write_seq($seq);} The information is successfully written to the file: fasta.txt. 4. Receiving the following error messages: Replacement list is longer than search list at C:/Perl64/site/lib/Bio/Range.pm line 251.Subroutine _initialize redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 92.Subroutine next_seq redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 112.Subroutine write_seq redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 193.Subroutine width redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 272.Subroutine preferred_id_type redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 295. Thanks in advance for your help.John Deutsch From jboddu at illinois.edu Thu Dec 1 16:38:00 2011 From: jboddu at illinois.edu (Boddu, Jayanand) Date: Thu, 1 Dec 2011 16:38:00 +0000 Subject: [Bioperl-l] Chromosome coordinates Message-ID: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu> Hello I am newbie to Perl scripts. I have a file with short reads mapped to the MAIZE genome The format is a simple BLASTN output. READ_ID Chr % Similarity Alignment Mismatches Gaps READ Start READ End Chr Start Chr End E Value Score READ1 chrPt 100 17 0 0 1 17 35021 35037 0.21 34.2 READ1 chr10 100 17 0 0 1 17 128587356 128587372 0.21 34.2 READ1 chr6 100 17 0 0 1 17 160769803 160769787 0.21 34.2 READ1 chr5 100 17 0 0 1 17 172103083 172103067 0.21 34.2 READ1 chr4 100 17 0 0 1 17 213173683 213173699 0.21 34.2 READ1 chr3 100 17 0 0 1 17 23689132 23689116 0.21 34.2 READ2 chr8 100 17 0 0 1 17 161048603 161048587 0.21 34.2 READ2 chr6 100 17 0 0 1 17 155768884 155768868 0.21 34.2 READ2 chr5 100 17 0 0 1 17 32958812 32958828 0.21 34.2 READ2 chr3 100 17 0 0 1 17 212451090 212451074 0.21 34.2 READ2 chr2 100 17 0 0 1 17 2046449 2046465 0.21 34.2 READ2 chr1 100 17 0 0 1 17 223233801 223233785 0.21 34.2 READ2 chr1 100 17 0 0 1 17 277573037 277573021 0.21 34.2 As expected the same read maps to multiple places on the same/different chromosome. I have a GFF file with annotated coordinates. I would like to run a PERL script to find out READS that are within the GENES in the GFF file and that are not. The anticipated script should; 1. Take the READ coordinates on the genome (by chromosome); 2. Go the GFF file; 3. Find the Chromosome; 4. Find the GENE (by coordinates); 5. and report READ-its coordinates-Chromosome-GENE-and its coordinates. It doesn't need to be in the same order. After this, I guess I could use simple Microsoft ACCESS query to pull out READS that are not mapped to the GENEs. I would greatly appreciate if anyone can has a script that more or less similar job. Thanks Jay From scott at scottcain.net Thu Dec 1 16:59:56 2011 From: scott at scottcain.net (Scott Cain) Date: Thu, 1 Dec 2011 11:59:56 -0500 Subject: [Bioperl-l] Chromosome coordinates In-Reply-To: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu> References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu> Message-ID: Hi Jay, Since the maize GFF file is likely to be fairly large, I would consider putting it in a database, using either Bio::DB::GFF if it is GFF2 or Bio::DB::SeqFeature::Store if it is gff3. Then you can use the methods that come along with either of those modules to search regions for for genes. They both support a get_features_by_location method, so you could get the range for each of the regions you want to look at, and check the database with that method to see if anything is there. Scott On Thu, Dec 1, 2011 at 11:38 AM, Boddu, Jayanand wrote: > Hello > I am newbie to Perl scripts. > I have a file with short reads mapped to the MAIZE genome > The format is a simple BLASTN output. > READ_ID > > Chr > > % Similarity > > Alignment > > Mismatches > > Gaps > > READ Start > > READ End > > Chr Start > > Chr End > > E Value > > Score > > READ1 > > chrPt > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 35021 > > 35037 > > 0.21 > > 34.2 > > READ1 > > chr10 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 128587356 > > 128587372 > > 0.21 > > 34.2 > > READ1 > > chr6 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 160769803 > > 160769787 > > 0.21 > > 34.2 > > READ1 > > chr5 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 172103083 > > 172103067 > > 0.21 > > 34.2 > > READ1 > > chr4 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 213173683 > > 213173699 > > 0.21 > > 34.2 > > READ1 > > chr3 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 23689132 > > 23689116 > > 0.21 > > 34.2 > > READ2 > > chr8 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 161048603 > > 161048587 > > 0.21 > > 34.2 > > READ2 > > chr6 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 155768884 > > 155768868 > > 0.21 > > 34.2 > > READ2 > > chr5 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 32958812 > > 32958828 > > 0.21 > > 34.2 > > READ2 > > chr3 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 212451090 > > 212451074 > > 0.21 > > 34.2 > > READ2 > > chr2 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 2046449 > > 2046465 > > 0.21 > > 34.2 > > READ2 > > chr1 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 223233801 > > 223233785 > > 0.21 > > 34.2 > > READ2 > > chr1 > > 100 > > 17 > > 0 > > 0 > > 1 > > 17 > > 277573037 > > 277573021 > > 0.21 > > 34.2 > > > > > > > > > > > > > > > > > > > > > > > > > > As expected the same read maps to multiple places on the same/different > chromosome. > I have a GFF file with annotated coordinates. > I would like to run a PERL script to find out READS that are within the > GENES in the GFF file and that are not. > The anticipated script should; > > 1. Take the READ coordinates on the genome (by chromosome); > > 2. Go the GFF file; > > 3. Find the Chromosome; > > 4. Find the GENE (by coordinates); > > 5. and report READ-its coordinates-Chromosome-GENE-and its > coordinates. > > It doesn't need to be in the same order. > After this, I guess I could use simple Microsoft ACCESS query to pull out > READS that are not mapped to the GENEs. > I would greatly appreciate if anyone can has a script that more or less > similar job. > > Thanks > Jay > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research From jason.stajich at gmail.com Thu Dec 1 17:31:29 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Thu, 1 Dec 2011 09:31:29 -0800 Subject: [Bioperl-l] Chromosome coordinates In-Reply-To: References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu> Message-ID: <470A08F7-E238-44E1-B0D1-69FDBC0BA2A3@gmail.com> You might try using BEDtools, intersectBED which can do a lot of what you are doing in simple command line program. Jason On Dec 1, 2011, at 8:59 AM, Scott Cain wrote: > Hi Jay, > > Since the maize GFF file is likely to be fairly large, I would consider > putting it in a database, using either Bio::DB::GFF if it is GFF2 or > Bio::DB::SeqFeature::Store if it is gff3. Then you can use the methods > that come along with either of those modules to search regions for for > genes. They both support a get_features_by_location method, so you could > get the range for each of the regions you want to look at, and check the > database with that method to see if anything is there. > > Scott > > > On Thu, Dec 1, 2011 at 11:38 AM, Boddu, Jayanand wrote: > >> Hello >> I am newbie to Perl scripts. >> I have a file with short reads mapped to the MAIZE genome >> The format is a simple BLASTN output. >> READ_ID >> >> Chr >> >> % Similarity >> >> Alignment >> >> Mismatches >> >> Gaps >> >> READ Start >> >> READ End >> >> Chr Start >> >> Chr End >> >> E Value >> >> Score >> >> READ1 >> >> chrPt >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 35021 >> >> 35037 >> >> 0.21 >> >> 34.2 >> >> READ1 >> >> chr10 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 128587356 >> >> 128587372 >> >> 0.21 >> >> 34.2 >> >> READ1 >> >> chr6 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 160769803 >> >> 160769787 >> >> 0.21 >> >> 34.2 >> >> READ1 >> >> chr5 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 172103083 >> >> 172103067 >> >> 0.21 >> >> 34.2 >> >> READ1 >> >> chr4 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 213173683 >> >> 213173699 >> >> 0.21 >> >> 34.2 >> >> READ1 >> >> chr3 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 23689132 >> >> 23689116 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr8 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 161048603 >> >> 161048587 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr6 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 155768884 >> >> 155768868 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr5 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 32958812 >> >> 32958828 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr3 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 212451090 >> >> 212451074 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr2 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 2046449 >> >> 2046465 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr1 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 223233801 >> >> 223233785 >> >> 0.21 >> >> 34.2 >> >> READ2 >> >> chr1 >> >> 100 >> >> 17 >> >> 0 >> >> 0 >> >> 1 >> >> 17 >> >> 277573037 >> >> 277573021 >> >> 0.21 >> >> 34.2 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> As expected the same read maps to multiple places on the same/different >> chromosome. >> I have a GFF file with annotated coordinates. >> I would like to run a PERL script to find out READS that are within the >> GENES in the GFF file and that are not. >> The anticipated script should; >> >> 1. Take the READ coordinates on the genome (by chromosome); >> >> 2. Go the GFF file; >> >> 3. Find the Chromosome; >> >> 4. Find the GENE (by coordinates); >> >> 5. and report READ-its coordinates-Chromosome-GENE-and its >> coordinates. >> >> It doesn't need to be in the same order. >> After this, I guess I could use simple Microsoft ACCESS query to pull out >> READS that are not mapped to the GENEs. >> I would greatly appreciate if anyone can has a script that more or less >> similar job. >> >> Thanks >> Jay >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > > > -- > ------------------------------------------------------------------------ > Scott Cain, Ph. D. scott at scottcain dot > net > GMOD Coordinator (http://gmod.org/) 216-392-3087 > Ontario Institute for Cancer Research > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From jovel_juan at hotmail.com Thu Dec 1 17:36:32 2011 From: jovel_juan at hotmail.com (Juan Jovel) Date: Thu, 1 Dec 2011 17:36:32 +0000 Subject: [Bioperl-l] Error when using SearchIO In-Reply-To: References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>, Message-ID: Hello Everybody! I have been running endless standalone Blasts these days, in two different clusters. The first one runs on Rock Linux, while the other one on Devian. In the Rock Linux one, when I use SeachIO to parse my blastp or blastx reports, I get invariably a message: "Replacement list is longer than search list at /share/apps/lib/perl5/site_perl/5.14.1/Bio/Range.pm line 251" What it does mean? Would it have any effect on my parsing results? Thanks, JUAN From cjfields at illinois.edu Thu Dec 1 19:03:45 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 1 Dec 2011 19:03:45 +0000 Subject: [Bioperl-l] Error when using SearchIO In-Reply-To: References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>, Message-ID: <7233AC75-E03B-401A-8D0A-260ED76956A4@illinois.edu> On Dec 1, 2011, at 11:36 AM, Juan Jovel wrote: > Hello Everybody! > I have been running endless standalone Blasts these days, in two different clusters. The first one runs on Rock Linux, while the other one on Devian. In the Rock Linux one, when I use SeachIO to parse my blastp or blastx reports, I get invariably a message: > "Replacement list is longer than search list at /share/apps/lib/perl5/site_perl/5.14.1/Bio/Range.pm line 251" > What it does mean? Would it have any effect on my parsing results? > Thanks, > JUAN This is a bug that was fixed, I think this was in the latest BioPerl release (1.6.901). There was a transliteration error that produced this warning, but otherwise it's harmless. The warning is perl version dependent, I think it pops up in perl 5.12 and up. This user post (about halfway down) runs into the same issue: http://www.sysarchitects.com/bioperl. chris From David.Messina at sbc.su.se Thu Dec 1 22:02:20 2011 From: David.Messina at sbc.su.se (Dave Messina) Date: Thu, 1 Dec 2011 23:02:20 +0100 Subject: [Bioperl-l] re trieving blast multiple alignment in fasta form In-Reply-To: <32886592.post@talk.nabble.com> References: <32886592.post@talk.nabble.com> Message-ID: Hi Eric, Wait, do you want multiple pairwise alignments in your output FASTA file, or a single multiple alignment of your query and all the hits? If the former, get_aln() will give you one pairwise alignment per hsp, but you'll need to move the output file creation statement (my $alnIO = ...) before the loops so it gets created only once. Then, when you do the write statement ($alnIO->write_aln($aln);), all of the alignments will go to the same file. If on the other hand you'd like to have a multiple alignment between a query and all of its hits, you'll have to take the IDs of the hits, pull the corresponding sequences out of the database, and then run a multiple alignment algorithm on them. Dave From scuoppo at gmail.com Fri Dec 2 22:50:28 2011 From: scuoppo at gmail.com (Claudio Scuoppo) Date: Fri, 2 Dec 2011 17:50:28 -0500 Subject: [Bioperl-l] List of genes from genomic intervals Message-ID: Hi, I am new to BioPerl. I was wondering what`s the best strategy to get the genes contained in a a series of human genomic interval. Basically, I have a table with: Chromosome Start End Which module should I be looking at? Thanks, Claudio From awitney at sgul.ac.uk Mon Dec 5 11:09:39 2011 From: awitney at sgul.ac.uk (Adam Witney) Date: Mon, 5 Dec 2011 11:09:39 +0000 Subject: [Bioperl-l] Bio::Graphics imagemap and padding Message-ID: <44A27378-CC4B-4396-817E-AA31004847C7@sgul.ac.uk> Hi, Image maps seem to be out of position if you use padding in the Panel, like this: my $panel = Bio::Graphics::Panel->new( ?.. -pad_left => 20, -pad_right => 20 ?? ); Without these options, the image map is fine. Is this a known issue? Also on a side note, I noticed that when using Bio::Graphics with Dancer, some of the CGI code was blocking somewhere (I found a reference to a similar problem with CGI and Catalyst), swapping CGI with HTML::Entities fixes it: sub create_web_map { ?. eval "require HTML::Entities" unless HTML::Entities->can('encode_entities'); ?. my $title = HTML::Entities::encode_entities($self->make_link($tr,$feature,1)); my $target = HTML::Entities::encode_entities($self->make_link($tgr,$feature,1)); ?.. } Thanks Adam From momin.amin at gmail.com Mon Dec 5 23:00:23 2011 From: momin.amin at gmail.com (Amin Momin) Date: Mon, 5 Dec 2011 15:00:23 -0800 (PST) Subject: [Bioperl-l] SimpleAlign and consensus_string Message-ID: Hi , I am generating a consensus sequence by aligning two protein homologs using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to understand the criteria consensus_string() method of simpleAlign uses to determine the consensus at position with dissimilar aminoacids/ nucleotide. Also how would the % cutoffs provided to consensus_string() affect the outcome. Thanks, Amin From jason.stajich at gmail.com Mon Dec 5 23:58:59 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Mon, 5 Dec 2011 15:58:59 -0800 Subject: [Bioperl-l] SimpleAlign and consensus_string In-Reply-To: References: Message-ID: <4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com> There are several methods that do related things. Below is the documentation for the method - it allows a more strict calling besides majority rule voting for the position - if you set the cutoff at 50% then the called AA or NT has to appear at least 50% of the time. There is also an IUPAC consensus string which is useful for DNA consensus generating patterns. If you want a summary as to whether or not there are only conservative amino acid changes in the column use match_line method which generates the similar string that CLUSTALW reports as a summary string for each column. =head2 consensus_string Title : consensus_string Usage : $str = $ali->consensus_string($threshold_percent) Function : Makes a strict consensus Returns : Consensus string Argument : Optional treshold ranging from 0 to 100. The consensus residue has to appear at least threshold % of the sequences at a given location, otherwise a '?' character will be placed at that location. (Default value = 0%) =cut On Dec 5, 2011, at 3:00 PM, Amin Momin wrote: > Hi , > > I am generating a consensus sequence by aligning two protein homologs > using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to > understand the criteria consensus_string() method of simpleAlign uses > to determine the consensus at position with dissimilar aminoacids/ > nucleotide. Also how would the % cutoffs provided to > consensus_string() affect the outcome. > > > Thanks, > Amin > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From wenbinmei at gmail.com Tue Dec 6 16:09:35 2011 From: wenbinmei at gmail.com (wenbin mei) Date: Tue, 6 Dec 2011 11:09:35 -0500 Subject: [Bioperl-l] revcom the multiple sequence alignment Message-ID: Hi, I have a question about revcom the multiple sequence alignment. One way I can do convert the format into fasta and revcom individual sequences. I wonder is there a easy way to convert the multiple sequence alignment as a whole. Thank you for help. -best, wenbin From jason.stajich at gmail.com Tue Dec 6 17:40:37 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Tue, 6 Dec 2011 09:40:37 -0800 Subject: [Bioperl-l] revcom the multiple sequence alignment In-Reply-To: References: Message-ID: I think this would work to update it in place though I haven't tried it myself for my $seq ( $aln->each_seq ) { $seq->seq( $seq->revcom->seq ); } $out->write_aln($aln); This may also work - not entirely sure if there is any extra work done on the meta data (start/end) of the Seq object when this is done. You may want to flip start/end for the sequences (the seqs are Bio::LocatableSeq objects) explicitly if not. Or you may not care about those data and can ignore. $seq = $seq->revcom Jason On Dec 6, 2011, at 8:09 AM, wenbin mei wrote: > Hi, > > I have a question about revcom the multiple sequence alignment. One way I > can do convert the format into fasta and revcom individual sequences. I > wonder is there a easy way to convert the multiple sequence alignment as a > whole. Thank you for help. > > -best, > wenbin > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From wenbinmei at gmail.com Tue Dec 6 17:51:18 2011 From: wenbinmei at gmail.com (wenbin mei) Date: Tue, 6 Dec 2011 12:51:18 -0500 Subject: [Bioperl-l] revcom the multiple sequence alignment In-Reply-To: References: Message-ID: I think I might not explain clearly my questions. I extract the individual gene alignment from the whole genome alignment. Since some gene are on the reverse strand, I want to revcom the gene alignment. There is part of my scripts. I can read the strand information from another file. my $newstart = $refseq->column_from_residue_number($start); my $newend = $refseq->column_from_residue_number($end); $seq{$genename} = $aln->slice($newstart, $newend); Any suggestion to help me revcom some gene alignment on the minus strand is helpful. Thank you. -best, wenbin On Tue, Dec 6, 2011 at 12:40 PM, Jason Stajich wrote: > I think this would work to update it in place though I haven't tried it > myself > > for my $seq ( $aln->each_seq ) { > $seq->seq( $seq->revcom->seq ); > } > $out->write_aln($aln); > > This may also work - not entirely sure if there is any extra work done on > the meta data (start/end) of the Seq object when this is done. You may > want to flip start/end for the sequences (the seqs are Bio::LocatableSeq > objects) explicitly if not. Or you may not care about those data and can > ignore. > > $seq = $seq->revcom > > Jason > On Dec 6, 2011, at 8:09 AM, wenbin mei wrote: > > > Hi, > > > > I have a question about revcom the multiple sequence alignment. One way I > > can do convert the format into fasta and revcom individual sequences. I > > wonder is there a easy way to convert the multiple sequence alignment as > a > > whole. Thank you for help. > > > > -best, > > wenbin > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Wenbin Mei Ph.D. Student Dr. Brad Barbazuk's Lab Department of Biology University of Florida 509-899-3067 wmei at ufl.edu From kellert at ohsu.edu Tue Dec 6 18:21:39 2011 From: kellert at ohsu.edu (Tom Keller) Date: Tue, 6 Dec 2011 10:21:39 -0800 Subject: [Bioperl-l] Bioperl-l Digest, Vol 104, Issue 3 In-Reply-To: References: Message-ID: I'd start by looking for the section "Searching for genes in genomic DNA" in the HOWTO:Beginners - BioPerl website. Thomas (Tom) Keller, PhD kellert at ohsu.edu 503.494.2442 6588 R Jones Hall (BSc/CROET) MMI DNA Services Member of OHSU Shared Resources On Dec 3, 2011, at 9:00 AM, wrote: > Send Bioperl-l mailing list submissions to > bioperl-l at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/bioperl-l > or, via email, send a message with subject or body 'help' to > bioperl-l-request at lists.open-bio.org > > You can reach the person managing the list at > bioperl-l-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Bioperl-l digest..." > > > Today's Topics: > > 1. List of genes from genomic intervals (Claudio Scuoppo) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 2 Dec 2011 17:50:28 -0500 > From: Claudio Scuoppo > Subject: [Bioperl-l] List of genes from genomic intervals > To: bioperl-l at lists.open-bio.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > Hi, > > I am new to BioPerl. I was wondering what`s the best strategy to get > the genes contained in a a series of human genomic interval. > Basically, I have a table with: > > Chromosome Start End > > Which module should I be looking at? > Thanks, > Claudio > > > ------------------------------ > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > End of Bioperl-l Digest, Vol 104, Issue 3 > ***************************************** From wenbinmei at gmail.com Tue Dec 6 22:54:51 2011 From: wenbinmei at gmail.com (wenbin mei) Date: Tue, 6 Dec 2011 17:54:51 -0500 Subject: [Bioperl-l] revcom the multiple sequence alignment In-Reply-To: References: Message-ID: Figured out! Thanks for help. -best, wenbin On Tue, Dec 6, 2011 at 12:40 PM, Jason Stajich wrote: > I think this would work to update it in place though I haven't tried it > myself > > for my $seq ( $aln->each_seq ) { > $seq->seq( $seq->revcom->seq ); > } > $out->write_aln($aln); > > This may also work - not entirely sure if there is any extra work done on > the meta data (start/end) of the Seq object when this is done. You may > want to flip start/end for the sequences (the seqs are Bio::LocatableSeq > objects) explicitly if not. Or you may not care about those data and can > ignore. > > $seq = $seq->revcom > > Jason > On Dec 6, 2011, at 8:09 AM, wenbin mei wrote: > > > Hi, > > > > I have a question about revcom the multiple sequence alignment. One way I > > can do convert the format into fasta and revcom individual sequences. I > > wonder is there a easy way to convert the multiple sequence alignment as > a > > whole. Thank you for help. > > > > -best, > > wenbin > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Wenbin Mei Ph.D. Student Dr. Brad Barbazuk's Lab Department of Biology University of Florida 509-899-3067 wmei at ufl.edu From momin.amin at gmail.com Tue Dec 6 17:37:16 2011 From: momin.amin at gmail.com (Amin Momin) Date: Tue, 6 Dec 2011 11:37:16 -0600 Subject: [Bioperl-l] SimpleAlign and consensus_string In-Reply-To: <4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com> References: <4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com> Message-ID: Thanks Jason On Mon, Dec 5, 2011 at 5:58 PM, Jason Stajich wrote: > There are several methods that do related things. > > Below is the documentation for the method - it allows a more strict calling besides majority rule voting for the position - if you set the cutoff at 50% then the called AA or NT has to appear at least 50% of the time. There is also an IUPAC consensus string which is useful for DNA consensus generating patterns. > > If you want a summary as to whether or not there are only conservative amino acid changes in the column use match_line method which generates the similar string that CLUSTALW reports as a summary string for each column. > > =head2 consensus_string > > ?Title ? ? : consensus_string > ?Usage ? ? : $str = $ali->consensus_string($threshold_percent) > ?Function ?: Makes a strict consensus > ?Returns ? : Consensus string > ?Argument ?: Optional treshold ranging from 0 to 100. > ? ? ? ? ? ? The consensus residue has to appear at least threshold % > ? ? ? ? ? ? of the sequences at a given location, otherwise a '?' > ? ? ? ? ? ? character will be placed at that location. > ? ? ? ? ? ? (Default value = 0%) > > =cut > > On Dec 5, 2011, at 3:00 PM, Amin Momin wrote: > >> Hi , >> >> I am generating a consensus sequence by aligning two protein homologs >> using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to >> understand the criteria consensus_string() method of simpleAlign uses >> to determine the consensus at position with dissimilar aminoacids/ >> nucleotide. Also how would the % cutoffs provided to >> consensus_string() affect the outcome. >> >> >> Thanks, >> Amin >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > From sunwukong at potc.net Wed Dec 7 19:05:20 2011 From: sunwukong at potc.net (sunwukong) Date: Wed, 07 Dec 2011 11:05:20 -0800 Subject: [Bioperl-l] DNA Sequencing two questions Message-ID: <4EDFB8F0.8080001@potc.net> I am not a medical professional but I have two DNA related questions. A year or so ago I realized that if the standard building blocks of life were the amino acids GATC then they could be represented as a base 4 number system (e.g., 0,1,2 and 3). Then any life form could be represented by a number (it would be very long). So I set out on a quest to do this with a small life form. For fun I chose the Spanish Flu which I believe I found on an NIH site. Then I set out and realized that there was no standard. And I did not know if the number would be built with the most significant digit on the left or right. 1. Is there a standard method for representing the ATCD molecules as numbers g = 0 a = 1 t = 2 c = 3 2. is the sequence read left to right or right to left? note: It may be biologically significant if the right values are assigned to the letters GATC, there could be a pattern somewhere that holds significant information. One idea might be to look at DNA sequences in bases other than 4 to see if something jumps out. http://www.insectscience.org/2.10/ref/fig5a.gif VR Pat Kirol 509 442-2214 From Russell.Smithies at agresearch.co.nz Wed Dec 7 21:59:18 2011 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Thu, 8 Dec 2011 10:59:18 +1300 Subject: [Bioperl-l] DNA Sequencing two questions In-Reply-To: <4EDFB8F0.8080001@potc.net> References: <4EDFB8F0.8080001@potc.net> Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz> I did something similar a few years ago (after watching the movie "Contact" I think) and encoded codons as RGB values and drew an image of a genome. Looked much like random noise but I might try it again and draw as a space filling curve. I guess if you're looking for "hidden messages", why restrict yourself to 2 dimensions? Perhaps something pops out as a single-image stereogram eg. http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Random_Dot_Shark.png Perhaps it's a 3D "object" represented by slices drawn in a series of 2D planes? But you need a bit of biological background as there will be patterns simply because of the way genes "work" and are laid out in chromosomes. You need to remember that DNA is effectively a 2D representation of a 3D protein structure and there is already much hidden information we know we don't understand - a "simple" task like how proteins fold is barely understood and why some become prions is still a mystery. But don't let this stop you uncovering the great secret hidden in our genes :-) On a similar note, have a look at http://medgadget.com/2011/10/send-your-secret-message-hidden-in-bacteria.html --Russell > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of sunwukong > Sent: Thursday, 8 December 2011 8:05 a.m. > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] DNA Sequencing two questions > > I am not a medical professional but I have two DNA related questions. > > A year or so ago I realized that if the standard building blocks of life were the > amino acids GATC then they could be represented as a base 4 number > system (e.g., 0,1,2 and 3). Then any life form could be represented by a > number (it would be very long). So I set out on a quest to do this with a small > life form. For fun I chose the Spanish Flu which I believe I found on an NIH > site. Then I set out and realized that there was no standard. And I did not > know if the number would be built with the most significant digit on the left > or right. > > 1. Is there a standard method for representing the ATCD molecules as > numbers g = 0 a = 1 t = 2 c = 3 > > 2. is the sequence read left to right or right to left? > > note: It may be biologically significant if the right values are assigned to the > letters GATC, there could be a pattern somewhere that holds significant > information. One idea might be to look at DNA sequences in bases other > than 4 to see if something jumps out. > > http://www.insectscience.org/2.10/ref/fig5a.gif > > VR > Pat Kirol > 509 442-2214 > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From jason.stajich at gmail.com Wed Dec 7 22:53:10 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 7 Dec 2011 14:53:10 -0800 Subject: [Bioperl-l] DNA Sequencing two questions In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz> References: <4EDFB8F0.8080001@potc.net> <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz> Message-ID: <9BDA2BFA-264F-45B2-9234-A6BC9402FBA8@gmail.com> For other fun picture games -- You can look at patterns of motifs/words in a chaos game representation of genomes. http://mbe.oxfordjournals.org/content/16/10/1391.long http://mbe.oxfordjournals.org/content/20/6/901.long On Dec 7, 2011, at 1:59 PM, Smithies, Russell wrote: > I did something similar a few years ago (after watching the movie "Contact" I think) and encoded codons as RGB values and drew an image of a genome. Looked much like random noise but I might try it again and draw as a space filling curve. > I guess if you're looking for "hidden messages", why restrict yourself to 2 dimensions? Perhaps something pops out as a single-image stereogram eg. http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Random_Dot_Shark.png > Perhaps it's a 3D "object" represented by slices drawn in a series of 2D planes? > > But you need a bit of biological background as there will be patterns simply because of the way genes "work" and are laid out in chromosomes. You need to remember that DNA is effectively a 2D representation of a 3D protein structure and there is already much hidden information we know we don't understand - a "simple" task like how proteins fold is barely understood and why some become prions is still a mystery. > > But don't let this stop you uncovering the great secret hidden in our genes :-) > > On a similar note, have a look at http://medgadget.com/2011/10/send-your-secret-message-hidden-in-bacteria.html > > --Russell > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of sunwukong >> Sent: Thursday, 8 December 2011 8:05 a.m. >> To: bioperl-l at bioperl.org >> Subject: [Bioperl-l] DNA Sequencing two questions >> >> I am not a medical professional but I have two DNA related questions. >> >> A year or so ago I realized that if the standard building blocks of life were the >> amino acids GATC then they could be represented as a base 4 number >> system (e.g., 0,1,2 and 3). Then any life form could be represented by a >> number (it would be very long). So I set out on a quest to do this with a small >> life form. For fun I chose the Spanish Flu which I believe I found on an NIH >> site. Then I set out and realized that there was no standard. And I did not >> know if the number would be built with the most significant digit on the left >> or right. >> >> 1. Is there a standard method for representing the ATCD molecules as >> numbers g = 0 a = 1 t = 2 c = 3 >> >> 2. is the sequence read left to right or right to left? >> >> note: It may be biologically significant if the right values are assigned to the >> letters GATC, there could be a pattern somewhere that holds significant >> information. One idea might be to look at DNA sequences in bases other >> than 4 to see if something jumps out. >> >> http://www.insectscience.org/2.10/ref/fig5a.gif >> >> VR >> Pat Kirol >> 509 442-2214 >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From Russell.Smithies at agresearch.co.nz Thu Dec 8 00:29:47 2011 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Thu, 8 Dec 2011 13:29:47 +1300 Subject: [Bioperl-l] DNA Sequencing two questions In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz> References: <4EDFB8F0.8080001@potc.net> <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz> Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF245@exchsth.agresearch.co.nz> I tried again and came up with this: http://www.bioperl.org/w/images/7/7a/Autostereogram.png If you look carefully, you can see the answer to life, the universe, and everything!! --Russell > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Smithies, Russell > Sent: Thursday, 8 December 2011 10:59 a.m. > To: 'sunwukong'; bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] DNA Sequencing two questions > > I did something similar a few years ago (after watching the movie "Contact" I > think) and encoded codons as RGB values and drew an image of a genome. > Looked much like random noise but I might try it again and draw as a space > filling curve. > I guess if you're looking for "hidden messages", why restrict yourself to 2 > dimensions? Perhaps something pops out as a single-image stereogram eg. > http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Ra > ndom_Dot_Shark.png > Perhaps it's a 3D "object" represented by slices drawn in a series of 2D > planes? > > But you need a bit of biological background as there will be patterns simply > because of the way genes "work" and are laid out in chromosomes. You > need to remember that DNA is effectively a 2D representation of a 3D > protein structure and there is already much hidden information we know we > don't understand - a "simple" task like how proteins fold is barely understood > and why some become prions is still a mystery. > > But don't let this stop you uncovering the great secret hidden in our genes :-) > > On a similar note, have a look at http://medgadget.com/2011/10/send-your- > secret-message-hidden-in-bacteria.html > > --Russell > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of sunwukong > > Sent: Thursday, 8 December 2011 8:05 a.m. > > To: bioperl-l at bioperl.org > > Subject: [Bioperl-l] DNA Sequencing two questions > > > > I am not a medical professional but I have two DNA related questions. > > > > A year or so ago I realized that if the standard building blocks of > > life were the amino acids GATC then they could be represented as a > > base 4 number system (e.g., 0,1,2 and 3). Then any life form could be > > represented by a number (it would be very long). So I set out on a > > quest to do this with a small life form. For fun I chose the Spanish > > Flu which I believe I found on an NIH site. Then I set out and > > realized that there was no standard. And I did not know if the number > > would be built with the most significant digit on the left or right. > > > > 1. Is there a standard method for representing the ATCD molecules as > > numbers g = 0 a = 1 t = 2 c = 3 > > > > 2. is the sequence read left to right or right to left? > > > > note: It may be biologically significant if the right values are > > assigned to the letters GATC, there could be a pattern somewhere that > > holds significant information. One idea might be to look at DNA > > sequences in bases other than 4 to see if something jumps out. > > > > http://www.insectscience.org/2.10/ref/fig5a.gif > > > > VR > > Pat Kirol > > 509 442-2214 > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > ========================================================== > ============= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities to which > it is addressed and may contain confidential and/or privileged material. Any > review, retransmission, dissemination or other use of, or taking of any action > in reliance upon, this information by persons or entities other than the > intended recipients is prohibited by AgResearch Limited. If you have received > this message in error, please notify the sender immediately. > ========================================================== > ============= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From lumos.lumos.lumos at gmail.com Fri Dec 9 16:47:36 2011 From: lumos.lumos.lumos at gmail.com (lumos lumos) Date: Fri, 9 Dec 2011 08:47:36 -0800 Subject: [Bioperl-l] Mouse->Human homologues ? Message-ID: Hello, Is there a way to get human homologues for a mouse gene list where I get all human genes(symbols) as text output ? Thank you LM From cjfields at illinois.edu Fri Dec 9 17:17:20 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 9 Dec 2011 17:17:20 +0000 Subject: [Bioperl-l] Mouse->Human homologues ? In-Reply-To: References: Message-ID: There are lots of databases that have this capability (ensembl, orthodb, homologene, oma, to name only a few). Have you tried a simple search for this, or did you want expert opinion on the matter? chris PS - Just to note, there is a lot of controversy swirling about re: the ortholog conjecture and some recently published papers calling it into question using human-mouse data, worth a look if you're trotting this path to know the current situation. If you have access to F1000, see the following (paper itself is open :) Faculty of 1000 evaluations, dissents and comments for [Nehrt NL et al. Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Comput Biol. 2011 Jun; 7(6):e1002073; doi: 10.1371/journal.pcbi.1002073]. Faculty of 1000, 31 Aug 2011. F1000.com/12462957 On Dec 9, 2011, at 10:47 AM, lumos lumos wrote: > Hello, > > Is there a way to get human homologues for a mouse gene list where I get > all human genes(symbols) as text output ? > > Thank you > LM > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From lumos.lumos.lumos at gmail.com Fri Dec 9 17:29:24 2011 From: lumos.lumos.lumos at gmail.com (lumos lumos) Date: Fri, 9 Dec 2011 09:29:24 -0800 Subject: [Bioperl-l] Mouse->Human homologues ? In-Reply-To: References: Message-ID: Hi Chris, Thanks for your reply. I wanted to know if there is anyway you can do it via script/automatically in perl for a list of mouse genes whose human homologues I require. LM On Fri, Dec 9, 2011 at 9:17 AM, Fields, Christopher J wrote: > There are lots of databases that have this capability (ensembl, orthodb, > homologene, oma, to name only a few). Have you tried a simple search for > this, or did you want expert opinion on the matter? > > chris > > PS - Just to note, there is a lot of controversy swirling about re: the > ortholog conjecture and some recently published papers calling it into > question using human-mouse data, worth a look if you're trotting this path > to know the current situation. If you have access to F1000, see the > following (paper itself is open :) > > Faculty of 1000 evaluations, dissents and comments for [Nehrt NL et al. > Testing the ortholog conjecture with comparative functional genomic data > from mammals. PLoS Comput Biol. 2011 Jun; 7(6):e1002073; doi: > 10.1371/journal.pcbi.1002073]. Faculty of 1000, 31 Aug 2011. > F1000.com/12462957 > > On Dec 9, 2011, at 10:47 AM, lumos lumos wrote: > > > Hello, > > > > Is there a way to get human homologues for a mouse gene list where I get > > all human genes(symbols) as text output ? > > > > Thank you > > LM > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From lumos.lumos.lumos at gmail.com Thu Dec 8 04:47:19 2011 From: lumos.lumos.lumos at gmail.com (lumos lumos) Date: Wed, 7 Dec 2011 20:47:19 -0800 Subject: [Bioperl-l] Perl parsing Message-ID: Hello, I have a text file(tab-delim) with some gene names as shown below. *BRCA1: breast cancer 1, early onset TNF: tumor necrosis factor OMG: oligodendrocyte myelin glycoprotein* I would like to get the list of gene name BRCA1,TNF,OMG that is before the colon(:) . How do I parse in perl this text file with this list of genes? Thanks in advance. LM From b.m.forde at umail.ucc.ie Fri Dec 9 16:52:56 2011 From: b.m.forde at umail.ucc.ie (BForde) Date: Fri, 9 Dec 2011 08:52:56 -0800 (PST) Subject: [Bioperl-l] Genbank files Message-ID: <32941955.post@talk.nabble.com> Hello all, I am new to Bioperl so I apologise if this is stupid question. For CDS features I which to add additional qualifiers e.g. /colour and /note qualifiers. I have looked at the BioPerl wiki but am still unsure as how to do this? regards Brian -- View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From jboddu at illinois.edu Fri Dec 9 19:59:39 2011 From: jboddu at illinois.edu (Boddu, Jayanand) Date: Fri, 9 Dec 2011 19:59:39 +0000 Subject: [Bioperl-l] Batch processing of Data Message-ID: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu> Hi Anyone: Please let me know if the following is practical with PERL. My data output can be described as following. 1. Hundreds of samples are run. 2. A batch output sends data from each sample to its own "folder". Output is in the form of few text files, spreadsheets and PDF files. 3. One of the spreadsheet has the data of most interest. 4. This means I end up having hundreds of folders. 5. The spreadsheet with the data has multiple worksheets out of which a couple have the interesting data to be processed (Please find attached a spreadsheet output in which the data is organized and the worksheets of my interest are named as "Compound" and "Peak". Yellow high-lighted columns in each worksheet has the data to be processed). OK. That's long description. NOW. Is it practical to write a PERL/or any script to; 1. Enter each folder. 2. Look for the spreadsheet of interest. 3. Look for worksheets named "Compound" and "Peak". 4. Look for the specific columns of interest. 5. Copy paste the columns of interest into a new spreadsheet/text file with data from each folder next to each other. This final spreadsheet will pass through a bunch of other calculations. I apologize for this long and painful description. However, it would be great if this can be done. Thanks Jay -------------- next part -------------- A non-text attachment was scrubbed... Name: REPORT01.xls Type: application/vnd.ms-excel Size: 93696 bytes Desc: REPORT01.xls URL: From cjfields at illinois.edu Fri Dec 9 20:37:48 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 9 Dec 2011 20:37:48 +0000 Subject: [Bioperl-l] Perl parsing In-Reply-To: References: Message-ID: On Dec 7, 2011, at 10:47 PM, lumos lumos wrote: > Hello, > > I have a text file(tab-delim) with some gene names as shown below. > > *BRCA1: breast cancer 1, early onset > > TNF: tumor necrosis factor > > OMG: oligodendrocyte myelin glycoprotein* > > I would like to get the list of gene name BRCA1,TNF,OMG that is before the > colon(:) . > How do I parse in perl this text file with this list of genes? 'Very carefully?' Okay, I'll try to refrain from further sarcasm, but I'm confused, what does this have to do with BioPerl (*the toolkit*) specifically? That is what this mailing list is for. Just to note, this is a very common perl task. The answer is attainable by searching for it (not to mention taking the time to learn basic perl). For instance: http://stackoverflow.com/questions/4500407/in-perl-how-can-i-correctly-parse-tab-space-delimited-files-with-quoted-strings One of the many links found by simply using Google: http://lmgtfy.com/?q=perl+parse+tab+file I'll leave the regex munging to you. (okay, I failed at refraining from sarcasm, ah well it's friday). chris > Thanks in advance. > LM From jason.stajich at gmail.com Fri Dec 9 21:18:38 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Fri, 9 Dec 2011 13:18:38 -0800 Subject: [Bioperl-l] Genbank files In-Reply-To: <32941955.post@talk.nabble.com> References: <32941955.post@talk.nabble.com> Message-ID: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> $feature->add_tag_value('color','blue'); On Dec 9, 2011, at 8:52 AM, BForde wrote: > > Hello all, > > I am new to Bioperl so I apologise if this is stupid question. > > For CDS features I which to add additional qualifiers e.g. /colour and /note > qualifiers. I have looked at the BioPerl wiki but am still unsure as how to > do this? > > regards > > Brian > -- > View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason.stajich at gmail.com jason at bioperl.org From bosborne11 at verizon.net Fri Dec 9 20:31:15 2011 From: bosborne11 at verizon.net (Brian Osborne) Date: Fri, 09 Dec 2011 15:31:15 -0500 Subject: [Bioperl-l] Genbank files In-Reply-To: <32941955.post@talk.nabble.com> References: <32941955.post@talk.nabble.com> Message-ID: <3AB0716B-EC07-4BD6-9FC8-0C47A29FC0BA@verizon.net> Brian, Reasonable question. Start here: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation If you've never used Bioperl then: http://www.bioperl.org/wiki/HOWTO:Beginners Brian On Dec 9, 2011, at 11:52 AM, BForde wrote: > > Hello all, > > I am new to Bioperl so I apologise if this is stupid question. > > For CDS features I which to add additional qualifiers e.g. /colour and /note > qualifiers. I have looked at the BioPerl wiki but am still unsure as how to > do this? > > regards > > Brian > -- > View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From asjo at koldfront.dk Fri Dec 9 22:25:00 2011 From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=) Date: Fri, 09 Dec 2011 23:25:00 +0100 Subject: [Bioperl-l] Batch processing of Data References: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu> Message-ID: <871usdpemb.fsf@topper.koldfront.dk> On Fri, 9 Dec 2011 19:59:39 +0000, Boddu, wrote: > Please let me know if the following is practical with PERL. It might very well be, yes. Modules you might be interested in include Spreadsheet::ParseExcel, Spreadsheet::XLSX, Spreadsheet::WriteExcel and Excel::Writer::XLSX?. A big help in finding interesting CPAN modules is the search engine on https://metacpan.org/ Depending on your platform and preference using find(1) might also be helpful to traverse the folders, rather than doing so in Perl. Note that none of this has anything to do with BioPerl as such, though, and you'll need to do some actual programming to get the job done. Best regards, Adam ? http://blogs.perl.org/users/john_mcnamara/2011/10/spreadsheetwriteexcel-is-dead-long-live-excelwriterxlsx.html -- "Angels can fly because they take themselves lightly." Adam Sj?gren asjo at koldfront.dk From David.Messina at sbc.su.se Fri Dec 9 22:30:23 2011 From: David.Messina at sbc.su.se (Dave Messina) Date: Fri, 9 Dec 2011 23:30:23 +0100 Subject: [Bioperl-l] Batch processing of Data In-Reply-To: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu> References: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu> Message-ID: Yes, it can be done. However, it has nothing to do with this mailing list. Steps 1 and 2 are basic Perl. For steps 3 through 5, try googling "perl parse excel". Dave On Fri, Dec 9, 2011 at 20:59, Boddu, Jayanand wrote: > Hi Anyone: > Please let me know if the following is practical with PERL. > My data output can be described as following. > > 1. Hundreds of samples are run. > > 2. A batch output sends data from each sample to its own "folder". > Output is in the form of few text files, spreadsheets and PDF files. > > 3. One of the spreadsheet has the data of most interest. > > 4. This means I end up having hundreds of folders. > > 5. The spreadsheet with the data has multiple worksheets out of > which a couple have the interesting data to be processed (Please find > attached a spreadsheet output in which the data is organized and the > worksheets of my interest are named as "Compound" and "Peak". Yellow > high-lighted columns in each worksheet has the data to be processed). > OK. That's long description. > NOW. Is it practical to write a PERL/or any script to; > > 1. Enter each folder. > > 2. Look for the spreadsheet of interest. > > 3. Look for worksheets named "Compound" and "Peak". > > 4. Look for the specific columns of interest. > > 5. Copy paste the columns of interest into a new spreadsheet/text > file with data from each folder next to each other. > > This final spreadsheet will pass through a bunch of other calculations. > > I apologize for this long and painful description. > However, it would be great if this can be done. > Thanks > Jay > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From lsbrath at gmail.com Sat Dec 10 21:39:44 2011 From: lsbrath at gmail.com (Mgavi Brathwaite) Date: Sat, 10 Dec 2011 16:39:44 -0500 Subject: [Bioperl-l] Perl parsing In-Reply-To: References: Message-ID: Yes grasshopper you have to suffer a little bit. Learn Perl first, then step up to BioPerl. Chris I feel you concerning the power of Regex, and the sarcasm. Lom On Fri, Dec 9, 2011 at 3:37 PM, Fields, Christopher J wrote: > On Dec 7, 2011, at 10:47 PM, lumos lumos wrote: > > > Hello, > > > > I have a text file(tab-delim) with some gene names as shown below. > > > > *BRCA1: breast cancer 1, early onset > > > > TNF: tumor necrosis factor > > > > OMG: oligodendrocyte myelin glycoprotein* > > > > I would like to get the list of gene name BRCA1,TNF,OMG that is before > the > > colon(:) . > > How do I parse in perl this text file with this list of genes? > > 'Very carefully?' > > Okay, I'll try to refrain from further sarcasm, but I'm confused, what > does this have to do with BioPerl (*the toolkit*) specifically? That is > what this mailing list is for. > > Just to note, this is a very common perl task. The answer is attainable by > searching for it (not to mention taking the time to learn basic perl). For > instance: > > > http://stackoverflow.com/questions/4500407/in-perl-how-can-i-correctly-parse-tab-space-delimited-files-with-quoted-strings > > One of the many links found by simply using Google: > > http://lmgtfy.com/?q=perl+parse+tab+file > > I'll leave the regex munging to you. > > (okay, I failed at refraining from sarcasm, ah well it's friday). > > chris > > > > Thanks in advance. > > LM > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From pawan.mani2 at gmail.com Mon Dec 5 22:00:09 2011 From: pawan.mani2 at gmail.com (pawan.mani2 at gmail.com) Date: Tue, 6 Dec 2011 03:30:09 +0530 Subject: [Bioperl-l] bioperl in cygwin Message-ID: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC> Hi I would like to after the givibg following commands in cgwin terminal: perl -MCPAN -e shell then I type o conf prerequisites_policy follow o conf commit install Bundle::CPAN install Module::Build d /bioperl/ then we you get a list of different versions. I selected CJFIELDS/BioPerl-1.6.1.96 install CJFIELDS/BioPerl-1.6.1.96.tar.gz but build.install was not ok. Kindly allow me to know the step of complete bioprl installatin cygwin undre windows 7. thanks in advanced. with best regards, Pawan From cjfields at illinois.edu Sun Dec 11 18:22:01 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sun, 11 Dec 2011 18:22:01 +0000 Subject: [Bioperl-l] bioperl in cygwin In-Reply-To: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC> References: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC> Message-ID: Pawan, Hard to say what the problem is w/o supplying warnings/errors. Prior to doing this, you should try installing BioPerl-1.6.901 (the latest CPAN release). You can try a direct installation of the distribution, but the easiest way to get the latest version is to try installing Bio::Perl. (I'm not sure what BioPerl-1.6.1.96 is, but this seems wrong) chris On Dec 5, 2011, at 4:00 PM, wrote: > Hi > I would like to after the givibg following commands in cgwin terminal: > > > perl -MCPAN -e shell > > then I type > > o conf prerequisites_policy follow > o conf commit > install Bundle::CPAN > install Module::Build > d /bioperl/ > then we you get a list of different versions. > I selected CJFIELDS/BioPerl-1.6.1.96 > install CJFIELDS/BioPerl-1.6.1.96.tar.gz > > > but build.install was not ok. > > Kindly allow me to know the step of complete bioprl installatin cygwin undre windows 7. > > thanks in advanced. > > with best regards, > Pawan > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From b.m.forde at umail.ucc.ie Tue Dec 13 11:03:50 2011 From: b.m.forde at umail.ucc.ie (BForde) Date: Tue, 13 Dec 2011 03:03:50 -0800 (PST) Subject: [Bioperl-l] Genbank files In-Reply-To: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> References: <32941955.post@talk.nabble.com> <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> Message-ID: <32965574.post@talk.nabble.com> Than you for the replies. My script (below) reads in a list of locus_tags from a tab delimited text file. Compares these locus_tags to the locus_tags in a genbank file and where they are equal adds new features. the line $feat->add_tag_value() needs to be defined. In the bioperl wiki this variable appears to be defined by giving it coordinates etc (creating a new feature). I wish to add features to CDS key when the locus_tags are identical. Is this possible? use strict; use Bio::SeqIO; my @V; open (LIST1, 'list') ||die; while (){ push @V, (split(/\t/, $_))[0]; } close(LIST1); my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); my $seq_object = $seqio_object->next_seq; for my $feat_object ($seq_object->get_SeqFeatures){ if ($feat_object->primary_tag eq "CDS"){ if ($feat_object->has_tag('locus_tag')){ for my $V3 ($feat_object->get_tag_values('locus_tag')){ for my $V1 (@V) { if ($V1 eq $V3){ ADD NEW FEATURES } } } } } } The script works down as far as the comparison point where locus_tags in the genbankfile "Contig100.gb" are compared against a list of locus_tags from a delimited txt file. regards Brian Jason Stajich-5 wrote: > > $feature->add_tag_value('color','blue'); > > On Dec 9, 2011, at 8:52 AM, BForde wrote: > >> >> Hello all, >> >> I am new to Bioperl so I apologise if this is stupid question. >> >> For CDS features I which to add additional qualifiers e.g. /colour and >> /note >> qualifiers. I have looked at the BioPerl wiki but am still unsure as how >> to >> do this? >> >> regards >> >> Brian >> -- >> View this message in context: >> http://old.nabble.com/Genbank-files-tp32941955p32941955.html >> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Jason Stajich > jason.stajich at gmail.com > jason at bioperl.org > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32965574.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From roy.chaudhuri at gmail.com Tue Dec 13 11:52:05 2011 From: roy.chaudhuri at gmail.com (Roy Chaudhuri) Date: Tue, 13 Dec 2011 11:52:05 +0000 Subject: [Bioperl-l] Genbank files In-Reply-To: <32965574.post@talk.nabble.com> References: <32941955.post@talk.nabble.com> <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> <32965574.post@talk.nabble.com> Message-ID: <4EE73C65.1080101@gmail.com> Hi Brian, Just to check I have understood you, you want to read through a genbank file and add additional tags to features which are listed in a tab-delimited file of locus tags? Your code is on the right lines, but it would be much more efficient to read your tab-delimited locus_tags into a hash, and check using exists, rather than ploughing through the (potentially very long) list of locus tags every time. Also, be careful with new lines in your tab file (you can safely get rid of them using "chomp"). You can miss out the "has_tag" check by using "get_tagset_values" instead of "get_tag_values", since the former does not complain if the tag is not present. Once you have modified your sequence object, you need to write it out to a new file (or STDOUT) using Bio::SeqIO. Also, just a couple of general points, you should always "use warnings" (or even better "use warnings FATAL=>qw(all)") since that can help solve many problems, and your code may be easier to read if you don't include the word "object" in all your variable names (after all you wouldn't say you write on a paper object using a pen object). use strict; use warnings FATAL=>qw(all); use Bio::SeqIO; open (my $list, 'list') or die $!; my %V; while (<$list>){ chomp; $V{(split(/\t/, $_))[0]}=1; } my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); my $seq_object = $seqio_object->next_seq; for my $feat_object ($seq_object->remove_SeqFeatures){ if ($feat_object->primary_tag eq "CDS"){ for my $V3 ($feat_object->get_tagset_values('locus_tag')){ if (exists $V{$V3}){ $feat_object->add_tag_value(listed_in_tab_file=>'yes'); next; } } } $seq_object->add_SeqFeature($feat_object); } Bio::SeqIO->new(-format=>'genbank')->write_seq($seq_object); Hope this helps. Cheers, Roy. On 13/12/2011 11:03, BForde wrote: > > Than you for the replies. > > My script (below) reads in a list of locus_tags from a tab delimited text > file. Compares these locus_tags to the locus_tags in a genbank file and > where they are equal adds new features. > the line > $feat->add_tag_value() > needs to be defined. In the bioperl wiki this variable appears to be defined > by giving it coordinates etc (creating a new feature). I wish to add > features to CDS key when the locus_tags are identical. Is this possible? > > use strict; > use Bio::SeqIO; > > my @V; > open (LIST1, 'list') ||die; > while (){ > push @V, (split(/\t/, $_))[0]; > } > close(LIST1); > > my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); > my $seq_object = $seqio_object->next_seq; > > for my $feat_object ($seq_object->get_SeqFeatures){ > if ($feat_object->primary_tag eq "CDS"){ > if ($feat_object->has_tag('locus_tag')){ > for my $V3 ($feat_object->get_tag_values('locus_tag')){ > for my $V1 (@V) { > if ($V1 eq $V3){ > ADD NEW FEATURES > > } > } > } > } > } > } > > The script works down as far as the comparison point where locus_tags in the > genbankfile "Contig100.gb" are compared against a list of locus_tags from a > delimited txt file. > > > regards > > Brian > > Jason Stajich-5 wrote: >> >> $feature->add_tag_value('color','blue'); >> >> On Dec 9, 2011, at 8:52 AM, BForde wrote: >> >>> >>> Hello all, >>> >>> I am new to Bioperl so I apologise if this is stupid question. >>> >>> For CDS features I which to add additional qualifiers e.g. /colour and >>> /note >>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how >>> to >>> do this? >>> >>> regards >>> >>> Brian >>> -- >>> View this message in context: >>> http://old.nabble.com/Genbank-files-tp32941955p32941955.html >>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> Jason Stajich >> jason.stajich at gmail.com >> jason at bioperl.org >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > From b.m.forde at umail.ucc.ie Tue Dec 13 14:22:01 2011 From: b.m.forde at umail.ucc.ie (Brian Forde) Date: Tue, 13 Dec 2011 14:22:01 +0000 Subject: [Bioperl-l] Genbank files In-Reply-To: <4EE73C65.1080101@gmail.com> References: <32941955.post@talk.nabble.com> <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> <32965574.post@talk.nabble.com> <4EE73C65.1080101@gmail.com> Message-ID: Hi Roy, Thank you. That works perfectly. I have to confess that someone else told me to use hashes but I could not get them to work.. Thanks again regards Brian On Tue, Dec 13, 2011 at 11:52 AM, Roy Chaudhuri wrote: > Hi Brian, > > Just to check I have understood you, you want to read through a genbank > file and add additional tags to features which are listed in a > tab-delimited file of locus tags? > > Your code is on the right lines, but it would be much more efficient to > read your tab-delimited locus_tags into a hash, and check using exists, > rather than ploughing through the (potentially very long) list of locus > tags every time. Also, be careful with new lines in your tab file (you can > safely get rid of them using "chomp"). You can miss out the "has_tag" check > by using "get_tagset_values" instead of "get_tag_values", since the former > does not complain if the tag is not present. Once you have modified your > sequence object, you need to write it out to a new file (or STDOUT) using > Bio::SeqIO. > > Also, just a couple of general points, you should always "use warnings" > (or even better "use warnings FATAL=>qw(all)") since that can help solve > many problems, and your code may be easier to read if you don't include the > word "object" in all your variable names (after all you wouldn't say you > write on a paper object using a pen object). > > use strict; > use warnings FATAL=>qw(all); > use Bio::SeqIO; > open (my $list, 'list') or die $!; > my %V; > while (<$list>){ > chomp; > $V{(split(/\t/, $_))[0]}=1; > > } > my $seqio_object = Bio::SeqIO->new(-file=>"**Contig100.gb"); > my $seq_object = $seqio_object->next_seq; > for my $feat_object ($seq_object->remove_**SeqFeatures){ > > if ($feat_object->primary_tag eq "CDS"){ > for my $V3 ($feat_object->get_tagset_**values('locus_tag')){ > if (exists $V{$V3}){ > $feat_object->add_tag_value(**listed_in_tab_file=>'yes'); > next; > } > } > } > $seq_object->add_SeqFeature($**feat_object); > } > Bio::SeqIO->new(-format=>'**genbank')->write_seq($seq_**object); > > Hope this helps. > Cheers, > Roy. > > > On 13/12/2011 11:03, BForde wrote: > >> >> Than you for the replies. >> >> My script (below) reads in a list of locus_tags from a tab delimited text >> file. Compares these locus_tags to the locus_tags in a genbank file and >> where they are equal adds new features. >> the line >> $feat->add_tag_value() >> needs to be defined. In the bioperl wiki this variable appears to be >> defined >> by giving it coordinates etc (creating a new feature). I wish to add >> features to CDS key when the locus_tags are identical. Is this possible? >> >> use strict; >> use Bio::SeqIO; >> >> my @V; >> open (LIST1, 'list') ||die; >> while (){ >> push @V, (split(/\t/, $_))[0]; >> } >> close(LIST1); >> >> my $seqio_object = Bio::SeqIO->new(-file=>"**Contig100.gb"); >> my $seq_object = $seqio_object->next_seq; >> >> for my $feat_object ($seq_object->get_SeqFeatures)**{ >> if ($feat_object->primary_tag eq "CDS"){ >> if ($feat_object->has_tag('locus_**tag')){ >> for my $V3 ($feat_object->get_tag_values(**'locus_tag')){ >> for my $V1 (@V) { >> if ($V1 eq $V3){ >> ADD NEW FEATURES >> >> } >> } >> } >> } >> } >> } >> >> The script works down as far as the comparison point where locus_tags in >> the >> genbankfile "Contig100.gb" are compared against a list of locus_tags from >> a >> delimited txt file. >> >> >> regards >> >> Brian >> >> Jason Stajich-5 wrote: >> >>> >>> $feature->add_tag_value('**color','blue'); >>> >>> On Dec 9, 2011, at 8:52 AM, BForde wrote: >>> >>> >>>> Hello all, >>>> >>>> I am new to Bioperl so I apologise if this is stupid question. >>>> >>>> For CDS features I which to add additional qualifiers e.g. /colour and >>>> /note >>>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how >>>> to >>>> do this? >>>> >>>> regards >>>> >>>> Brian >>>> -- >>>> View this message in context: >>>> http://old.nabble.com/Genbank-**files-tp32941955p32941955.html >>>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >>>> >>>> ______________________________**_________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/**mailman/listinfo/bioperl-l >>>> >>> >>> Jason Stajich >>> jason.stajich at gmail.com >>> jason at bioperl.org >>> >>> >>> ______________________________**_________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/**mailman/listinfo/bioperl-l >>> >>> >>> >> > -- Brian Forde Microbiology Dept. Bioscience Institute. Room 4.11 University College Cork Cork Ireland tel:+353 21 4901306 email: b.m.forde at umail.ucc.ie From b.m.forde at umail.ucc.ie Mon Dec 12 17:20:53 2011 From: b.m.forde at umail.ucc.ie (BForde) Date: Mon, 12 Dec 2011 09:20:53 -0800 (PST) Subject: [Bioperl-l] Genbank files In-Reply-To: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> References: <32941955.post@talk.nabble.com> <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> Message-ID: <32959999.post@talk.nabble.com> Than you for the replies. I am unsure as to how to use the line below with my script. My script so far reads use strict; use Bio::SeqIO; my @V; open (LIST1, 'list') ||die; while (){ push @V, (split(/\t/, $_))[0]; } close(LIST1); my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); my $seq_object = $seqio_object->next_seq; for my $feat_object ($seq_object->get_SeqFeatures){ if ($feat_object->primary_tag eq "CDS"){ if ($feat_object->has_tag('locus_tag')){ for my $V3 ($feat_object->get_tag_values('locus_tag')){ for my $V1 (@V) { if ($V1 eq $V3){ ADD NEW FEATURES } } } } } } The script works down as far as the comparison point where locus_tags in the genbankfile "Contig100.gb" are compared against a list of locus_tags from a delimited txt file. I possbile could you show me how to amend my script so I can add new features regards Brian Jason Stajich-5 wrote: > > $feature->add_tag_value('color','blue'); > > On Dec 9, 2011, at 8:52 AM, BForde wrote: > >> >> Hello all, >> >> I am new to Bioperl so I apologise if this is stupid question. >> >> For CDS features I which to add additional qualifiers e.g. /colour and >> /note >> qualifiers. I have looked at the BioPerl wiki but am still unsure as how >> to >> do this? >> >> regards >> >> Brian >> -- >> View this message in context: >> http://old.nabble.com/Genbank-files-tp32941955p32941955.html >> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Jason Stajich > jason.stajich at gmail.com > jason at bioperl.org > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32959999.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From Russell.Smithies at agresearch.co.nz Wed Dec 14 03:17:02 2011 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Wed, 14 Dec 2011 16:17:02 +1300 Subject: [Bioperl-l] Genbank files In-Reply-To: <32959999.post@talk.nabble.com> References: <32941955.post@talk.nabble.com> <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com> <32959999.post@talk.nabble.com> Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF27A@exchsth.agresearch.co.nz> Something like this: use strict; use Bio::SeqIO; my @V; open (LIST1, 'list') ||die; while (){ push @V, (split(/\t/, $_))[0]; } close(LIST1); my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); my $seq_object = $seqio_object->next_seq; for my $feat_object ($seq_object->get_SeqFeatures){ if ($feat_object->primary_tag eq "CDS"){ if ($feat_object->has_tag('locus_tag')){ for my $V3 ($feat_object->get_tag_values('locus_tag')){ for my $V1 (@V) { if ($V1 eq $V3){ #ADD NEW FEATURES $feat_object->add_tag_value('color','blue'); } } } } } } #write the new annotations my $io = Bio::SeqIO->new(-format => "genbank", -file => ">new.gb" ); $io->write_seq($seq_object); Take another look at http://www.bioperl.org/wiki/HOWTO:Feature-Annotation#Building_Your_Own_Sequences --Russell > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of BForde > Sent: Tuesday, 13 December 2011 6:21 a.m. > To: Bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Genbank files > > > Than you for the replies. > > I am unsure as to how to use the line below with my script. My script so far > reads > > use strict; > use Bio::SeqIO; > > my @V; > open (LIST1, 'list') ||die; > while (){ > push @V, (split(/\t/, $_))[0]; > } > close(LIST1); > > my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); > my $seq_object = $seqio_object->next_seq; > > for my $feat_object ($seq_object->get_SeqFeatures){ > if ($feat_object->primary_tag eq "CDS"){ > if ($feat_object->has_tag('locus_tag')){ > for my $V3 ($feat_object->get_tag_values('locus_tag')){ > for my $V1 (@V) { > if ($V1 eq $V3){ > ADD NEW FEATURES > > } > } > } > } > } > } > > The script works down as far as the comparison point where locus_tags in the > genbankfile "Contig100.gb" are compared against a list of locus_tags from a > delimited txt file. > I possbile could you show me how to amend my script so I can add new > features > > regards > > Brian > > Jason Stajich-5 wrote: > > > > $feature->add_tag_value('color','blue'); > > > > On Dec 9, 2011, at 8:52 AM, BForde wrote: > > > >> > >> Hello all, > >> > >> I am new to Bioperl so I apologise if this is stupid question. > >> > >> For CDS features I which to add additional qualifiers e.g. /colour > >> and /note qualifiers. I have looked at the BioPerl wiki but am still > >> unsure as how to do this? > >> > >> regards > >> > >> Brian > >> -- > >> View this message in context: > >> http://old.nabble.com/Genbank-files-tp32941955p32941955.html > >> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > Jason Stajich > > jason.stajich at gmail.com > > jason at bioperl.org > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > -- > View this message in context: http://old.nabble.com/Genbank-files- > tp32941955p32959999.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From l.m.timmermans at students.uu.nl Wed Dec 14 15:43:24 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Wed, 14 Dec 2011 16:43:24 +0100 Subject: [Bioperl-l] Announcing Bio::SFF Message-ID: Hi all, As already mentioned on IRC, I recently wrote a SFF parser and uploaded it to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time to write one I'd be most grateful. Leon From p.j.a.cock at googlemail.com Wed Dec 14 16:03:05 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 14 Dec 2011 16:03:05 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Wed, Dec 14, 2011 at 3:43 PM, Leon Timmermans wrote: > Hi all, > > As already mentioned on IRC, I recently wrote a SFF parser and uploaded it > to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF > entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time > to write one I'd be most grateful. > > Leon Hi Leon, Have you looked at the index block at all, in order to offer random access by read ID, or to access the Roche XML manifest? Please ask if you need more information about this - or if you can read Python: https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py Is this building on Miguel Pignatelli's work? I don't recall seeing any follow up posts from him after this one: http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html Peter From cjfields at illinois.edu Wed Dec 14 16:12:58 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 14 Dec 2011 16:12:58 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: , Message-ID: <3BBC1132-E768-45D9-A107-ACD51791722D@illinois.edu> Leon, Nice! Definitely a good idea to have the lower-level parser and the BioPerl-bridging code separate, one of my concerns with the various parsers we have right now which hard-wire BioPerl classes in with the parser (makes it hard for optimization). Chris PS- Peter, I don't think the two projects are related, but I suppose Leon is the best to answer that. Sent from my stupid iPad, now my laptop's on the fritz On Dec 14, 2011, at 10:04 AM, "Peter Cock" wrote: > On Wed, Dec 14, 2011 at 3:43 PM, Leon Timmermans > wrote: >> Hi all, >> >> As already mentioned on IRC, I recently wrote a SFF parser and uploaded it >> to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF >> entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time >> to write one I'd be most grateful. >> >> Leon > > Hi Leon, > > Have you looked at the index block at all, in order to offer random > access by read ID, or to access the Roche XML manifest? Please > ask if you need more information about this - or if you can read Python: > https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py > > Is this building on Miguel Pignatelli's work? I don't recall seeing > any follow up posts from him after this one: > http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html > > Peter > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From l.m.timmermans at students.uu.nl Wed Dec 14 16:27:58 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Wed, 14 Dec 2011 17:27:58 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Wed, Dec 14, 2011 at 5:03 PM, Peter Cock wrote: > Hi Leon, > > Have you looked at the index block at all, in order to offer random > access by read ID, or to access the Roche XML manifest? Please > ask if you need more information about this - or if you can read Python: > https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py > I have looked at it, but not implemented it yet. There is no standardized index, and the ones that are in common use either seem stupid (the Roche index, which is essentially just a weirdly formatted sequential list, though that should still be faster than a table scan) or undocumented (hash based index). Is this building on Miguel Pignatelli's work? I don't recall seeing > any follow up posts from him after this one: > http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html > It isn't. I like his idea for reusing BioPython's test files though. Leon From p.j.a.cock at googlemail.com Wed Dec 14 16:44:28 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 14 Dec 2011 16:44:28 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Wed, Dec 14, 2011 at 4:27 PM, Leon Timmermans wrote: > On Wed, Dec 14, 2011 at 5:03 PM, Peter Cock > wrote: >> >> Hi Leon, >> >> Have you looked at the index block at all, in order to offer random >> access by read ID, or to access the Roche XML manifest? Please >> ask if you need more information about this - or if you can read Python: >> https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py > > I have looked at it, but not implemented it yet. There is no standardized > index, and the ones that are in common use either seem stupid (the Roche > index, which is essentially just a weirdly formatted sequential list, though > that should still be faster than a table scan) or undocumented (hash based > index). There are two widely used indexes, both from Roche (one with and one without an XML manifest, magic bytes .mft and .srt). They are both just a simple table of the reads names and offsets, sorted alphabetically. This works pretty well for rapid lookup for SFF files (because the read count is not so high), and is pretty easy. I don't think anyone used the hash table style indexes (.hsh), which I assume was a proof of principle or trial in the early days of SFF. One thing to check is what Ion Torrent's SFF files use. I would guess they've followed Roche, but I don't know. After all, the index structure is not defined in the SFF specification - it was left extensible on purpose. >> Is this building on Miguel Pignatelli's work? I don't recall seeing >> any follow up posts from him after this one: >> http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html > > It isn't. I like his idea for reusing BioPython's test files though. Yes, please do. Peter From gingerplum at gmail.com Wed Dec 14 05:18:55 2011 From: gingerplum at gmail.com (plum ginger) Date: Tue, 13 Dec 2011 21:18:55 -0800 (PST) Subject: [Bioperl-l] a problem about BLAST Message-ID: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com> Hi. I'm trying to run BLAST with Bio::Tools::Run::StandAloneBlast. I need run BLAST on more than one sequences. However the blast outfile only store the result of last sequence. How to make the outfile store all results? Wish your help. Thanks very much! Best regards From jason.stajich at gmail.com Thu Dec 15 17:02:47 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Thu, 15 Dec 2011 11:02:47 -0600 Subject: [Bioperl-l] a problem about BLAST In-Reply-To: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com> References: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com> Message-ID: <58E5B487-7FF0-4018-B109-D6595DC2E493@gmail.com> you are probably setting the outfile in each parsing iteration -- you need to show your code if you want someone to help you debug the problem. On Dec 13, 2011, at 11:18 PM, plum ginger wrote: > Hi. I'm trying to run BLAST with Bio::Tools::Run::StandAloneBlast. I > need run BLAST on more than one sequences. However the blast outfile > only store the result of last sequence. How to make the outfile store > all results? > > Wish your help. Thanks very much! > > > Best regards > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason.stajich at gmail.com jason at bioperl.org From pengyu.ut at gmail.com Fri Dec 16 22:10:27 2011 From: pengyu.ut at gmail.com (Peng Yu) Date: Fri, 16 Dec 2011 16:10:27 -0600 Subject: [Bioperl-l] How to stop rather than emit warnings with Bio::Das::segment? Message-ID: Hi, Bio::Das::segment can give me the following warnings without stopping the whole program when the position for the query doesn't exist. I could test the return result and quit when it is []. But this would cause my program have an test whenever I call segment. I'm wondering if there is an automatic way to let Bio::Das::segment stop in such cases. --------------------- WARNING --------------------- MSG: Sequence is not dna or rna, but []. Attempting to revcom, but unsure if this is right --------------------------------------------------- -- Regards, Peng From cjfields at illinois.edu Sat Dec 17 02:48:07 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sat, 17 Dec 2011 02:48:07 +0000 Subject: [Bioperl-l] How to stop rather than emit warnings with Bio::Das::segment? In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF0881CF8E@CITESMBX4.ad.uillinois.edu> Setting verbosity to 2 should convert warnings to exceptions. IIRC, set '-verbose => 2' in the Bio::Das constructor, set '$das->verbose(2)' explicitly, or set the env variable BIOPERLDEBUG=2. chris ________________________________________ From: bioperl-l-bounces at lists.open-bio.org [bioperl-l-bounces at lists.open-bio.org] on behalf of Peng Yu [pengyu.ut at gmail.com] Sent: Friday, December 16, 2011 4:10 PM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] How to stop rather than emit warnings with Bio::Das::segment? Hi, Bio::Das::segment can give me the following warnings without stopping the whole program when the position for the query doesn't exist. I could test the return result and quit when it is []. But this would cause my program have an test whenever I call segment. I'm wondering if there is an automatic way to let Bio::Das::segment stop in such cases. --------------------- WARNING --------------------- MSG: Sequence is not dna or rna, but []. Attempting to revcom, but unsure if this is right --------------------------------------------------- -- Regards, Peng _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From anna.fr at gmail.com Mon Dec 19 07:09:15 2011 From: anna.fr at gmail.com (Anna Friedlander) Date: Mon, 19 Dec 2011 20:09:15 +1300 Subject: [Bioperl-l] StandAloneBlastPlus blastdbcmd question Message-ID: Hi all I have a question about using blastdbcmd via Bio::Tools::Run::StandAloneBlastPlus I have some Blast+ search results that I am manipulating in a perl programme, and I would like to retrieve some sequence information for some results using subject sequence IDs, and associated subject start and end indices. If I was using blastdbcmd directly, I would do so using the -entry and -range options. My question is, can I use all the blastdbcmd options (or more specifically, just the -entry and -range options) from within the StandAloneBlastPlus module? My apologies if I don't properly understand how this "wrapper" works! Thanks in advance for your help Anna Friedlander From l.m.timmermans at students.uu.nl Mon Dec 19 14:19:14 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Mon, 19 Dec 2011 15:19:14 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Wed, Dec 14, 2011 at 5:44 PM, Peter Cock wrote: > There are two widely used indexes, both from Roche (one with and > one without an XML manifest, magic bytes .mft and .srt). They are > both just a simple table of the reads names and offsets, sorted > alphabetically. Yeah, that's what I got from the BioPython code. I didn't know it was sorted though (it doesn't make much sense either, unless they wanted to do a binary search or something). This works pretty well for rapid lookup for SFF files > (because the read count is not so high), and is pretty easy. > It's implemented in Bio::SFF 0.003. I did restructure my code into two readers though, since doing sequential and random-access in the class didn't make much sense code-wise. I don't think anyone used the hash table style indexes (.hsh), which > I assume was a proof of principle or trial in the early days of SFF. > I see, too bad. > One thing to check is what Ion Torrent's SFF files use. I would > guess they've followed Roche, but I don't know. After all, the > index structure is not defined in the SFF specification - it was > left extensible on purpose. > Yeah, we should check that too. Yes, please do. > It's added to 0.003. The lack of tests was bothering me, but the SFFs I had at hand were not suitable. Leon From p.j.a.cock at googlemail.com Mon Dec 19 14:31:18 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 19 Dec 2011 14:31:18 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 2:19 PM, Leon Timmermans wrote: > On Wed, Dec 14, 2011 at 5:44 PM, Peter Cock wrote: > >> There are two widely used indexes, both from Roche (one with and >> one without an XML manifest, magic bytes .mft and .srt). They are >> both just a simple table of the reads names and offsets, sorted >> alphabetically. > > Yeah, that's what I got from the BioPython code. I didn't know it > was sorted though (it doesn't make much sense either, unless they > wanted to do a binary search or something). I presume that's what Roche uses if they keep the index on disk. The alternative is to load the index into RAM, which is really fast. You just open the SFF, read the header, seek to the index, load the index. Without the index, you have to scan the entire SFF file to find each record and its offset - which is much slower. >> This works pretty well for rapid lookup for SFF files >> (because the read count is not so high), and is pretty easy. > > It's implemented in Bio::SFF 0.003. I did restructure my code into two > readers though, since doing sequential and random-access in the class > didn't make much sense code-wise. > >> I don't think anyone used the hash table style indexes (.hsh), which >> I assume was a proof of principle or trial in the early days of SFF. > > I see, too bad. > >> One thing to check is what Ion Torrent's SFF files use. I would >> guess they've followed Roche, but I don't know. After all, the >> index structure is not defined in the SFF specification - it was >> left extensible on purpose. > > Yeah, we should check that too. I don't have any Ion Torrent data first hand, and the public samples I've seen were FASTQ not SFF. But I know a few people with Ion Torrent machines that might be able to help... > It's added to 0.003. The lack of tests was bothering me, but the > SFFs I had at hand were not suitable. Have you looked at the sample SFF data in Biopython? Please use them for the BioPerl unit tests (we're been talking about a cross project collection of test data files like this), the README file should be self-explanatory: https://github.com/biopython/biopython/tree/master/Tests/Roche Peter From p.j.a.cock at googlemail.com Mon Dec 19 15:13:53 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 19 Dec 2011 15:13:53 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: <4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk> References: <4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk> Message-ID: On Mon, Dec 19, 2011 at 3:03 PM, Adam Witney wrote: >> I don't have any Ion Torrent data first hand, and the public >> samples I've seen were FASTQ not SFF. But I know a few >> people with Ion Torrent machines that might be able to help? > > I can you let you have some Ion Torrent SFF files if it helps > > adam Hi Adam, I've just had a quick look at a file from an IonTorrent 314 chip that a colleague kindly sent me, and that SFF file had no index (but only 50k reads so this isn't so important). If you can send me (and Leon?) one of two original SFF files that would be useful, even if just to confirm that Ion Torrent's SFF files do indeed typically lack an index. If that is the case, I may need to remove the warning message Biopython currently prints when indexing these files: No SFF index, doing it the slow way Off list is fine if you'd like to keep the data private, use dropbox or something if you don't have an FTP server. Thanks, Peter From awitney at sgul.ac.uk Mon Dec 19 15:03:16 2011 From: awitney at sgul.ac.uk (Adam Witney) Date: Mon, 19 Dec 2011 15:03:16 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: <4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk> >>> One thing to check is what Ion Torrent's SFF files use. I would >>> guess they've followed Roche, but I don't know. After all, the >>> index structure is not defined in the SFF specification - it was >>> left extensible on purpose. >> >> Yeah, we should check that too. > > I don't have any Ion Torrent data first hand, and the public > samples I've seen were FASTQ not SFF. But I know a few > people with Ion Torrent machines that might be able to help? I can you let you have some Ion Torrent SFF files if it helps adam From l.m.timmermans at students.uu.nl Mon Dec 19 15:48:34 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Mon, 19 Dec 2011 16:48:34 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 3:31 PM, Peter Cock wrote: > I presume that's what Roche uses if they keep the index on disk. > > The alternative is to load the index into RAM, which is really fast. > You just open the SFF, read the header, seek to the index, load > the index. Without the index, you have to scan the entire SFF file > to find each record and its offset - which is much slower. > That's what I'm doing now. It's much faster, but it still takes a noticeable amount of time on large files. Have you looked at the sample SFF data in Biopython? Please > use them for the BioPerl unit tests (we're been talking about a > cross project collection of test data files like this), the README > file should be self-explanatory: > https://github.com/biopython/biopython/tree/master/Tests/Roche > Yeah, I'm using those now ( https://github.com/Leont/bio-sff/blob/master/t/reader.t). I must say there were some interesting corner cases in it. Leon From p.j.a.cock at googlemail.com Mon Dec 19 16:15:15 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 19 Dec 2011 16:15:15 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 3:48 PM, Leon Timmermans wrote: > On Mon, Dec 19, 2011 at 3:31 PM, Peter Cock wrote: > >> Have you looked at the sample SFF data in Biopython? Please >> use them for the BioPerl unit tests (we're been talking about a >> cross project collection of test data files like this), the README >> file should be self-explanatory: >> https://github.com/biopython/biopython/tree/master/Tests/Roche > > Yeah, I'm using those now > (https://github.com/Leont/bio-sff/blob/master/t/reader.t). Could you a link to your /corpus/README.txt file pointing back to the Biopython original for acknowledgement and future reference? > > I must say there were some interesting corner cases in it. > I'm glad you agree - and if you can think of any more special cases to verify that would be great. Are you doing just SFF parsing for now? Not writing? Now, as to Bio::SeqIO integration, Biopython's SeqIO uses format name "sff" to mean the full read sequence (with mixed case, upper case for the good sequence, lower cases for any left/right clipping - as in the Roche tools), and "sff-trim" to mean the trimmed sequences. I would encourage you to do the same, as part of the general aim of having consistent sequence format names between BioPerl, Biopython, and EMBOSS, where possible. Peter From l.m.timmermans at students.uu.nl Mon Dec 19 16:47:41 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Mon, 19 Dec 2011 17:47:41 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock wrote: > Could you a link to your /corpus/README.txt file pointing > back to the Biopython original for acknowledgement and > future reference? > I forgot about that, I will add it to the next release. Are you doing just SFF parsing for now? Not writing? > I haven't written the writer yet (haven't needed it so far). I'd rather release working code early instead of waiting until everything is complete. Now, as to Bio::SeqIO integration, Biopython's SeqIO uses > format name "sff" to mean the full read sequence (with mixed > case, upper case for the good sequence, lower cases for any > left/right clipping - as in the Roche tools), and "sff-trim" to mean > the trimmed sequences. I would encourage you to do the > same, as part of the general aim of having consistent > sequence format names between BioPerl, Biopython, and > EMBOSS, where possible. > I agree, consistency is good. Leon From p.j.a.cock at googlemail.com Mon Dec 19 17:00:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 19 Dec 2011 17:00:03 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 4:47 PM, Leon Timmermans wrote: > On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock > wrote: >> >> Could you a link to your /corpus/README.txt file pointing >> back to the Biopython original for acknowledgement and >> future reference? > > I forgot about that, I will add it to the next release. Thanks. >> Are you doing just SFF parsing for now? Not writing? > > > I haven't written the writer yet (haven't needed it so far). I'd rather > release working code early instead of waiting until everything is complete. I understand - but make sure you've designed the data structures in the parser so as to allow the original record to be re-built as SFF. >> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses >> format name "sff" to mean the full read sequence (with mixed >> case, upper case for the good sequence, lower cases for any >> left/right clipping - as in the Roche tools), and "sff-trim" to mean >> the trimmed sequences. I would encourage you to do the >> same, as part of the general aim of having consistent >> sequence format names between BioPerl, Biopython, and >> EMBOSS, where possible. > > I agree, consistency is good. Great. I'd guess Bio::SeqIO integration would be more important that SFF output initially. Peter From cjfields at illinois.edu Mon Dec 19 19:44:22 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 19 Dec 2011 19:44:22 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: , Message-ID: Kinda joining this a little late, but I think if there is a way to have a low-level parser/writer that generically parses the data into simple (possibly hash-tagged) data structures, that would be best. Barring that, a very simple class for storing data. We've found BioPerl objects/classes pretty heavy. (for an example of this, see Heng Li's readfq parser on github, which has some stats for Fastq/fasta parsing). Any way we can separate the parser from object instantiation would enable us to optimize the object/class layer and parser/writer layers separately, with the possible nice side effect of making the parser more broadly used. For instance, if someone wanted a faster parser, use the low level, otherwise use the higher level (possibly BioPerl-specific) API. Lincoln does this do a certain degree with Bio-samtools; I would go further and make the bp- and non-bp code in separate dists. Chris Sent from my iPad On Dec 19, 2011, at 11:05 AM, "Peter Cock" wrote: > On Mon, Dec 19, 2011 at 4:47 PM, Leon Timmermans > wrote: >> On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock >> wrote: >>> >>> Could you a link to your /corpus/README.txt file pointing >>> back to the Biopython original for acknowledgement and >>> future reference? >> >> I forgot about that, I will add it to the next release. > > Thanks. > >>> Are you doing just SFF parsing for now? Not writing? >> >> >> I haven't written the writer yet (haven't needed it so far). I'd rather >> release working code early instead of waiting until everything is complete. > > I understand - but make sure you've designed the data structures > in the parser so as to allow the original record to be re-built as SFF. > >>> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses >>> format name "sff" to mean the full read sequence (with mixed >>> case, upper case for the good sequence, lower cases for any >>> left/right clipping - as in the Roche tools), and "sff-trim" to mean >>> the trimmed sequences. I would encourage you to do the >>> same, as part of the general aim of having consistent >>> sequence format names between BioPerl, Biopython, and >>> EMBOSS, where possible. >> >> I agree, consistency is good. > > Great. I'd guess Bio::SeqIO integration would be more important > that SFF output initially. > > Peter > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Tue Dec 20 00:28:25 2011 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 19 Dec 2011 18:28:25 -0600 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: <4EEFD6A9.3010303@illinois.edu> On 12/19/2011 10:47 AM, Leon Timmermans wrote: > On Mon, Dec 19, 2011 at 5:15 PM, Peter Cockwrote: > >> Could you a link to your /corpus/README.txt file pointing >> back to the Biopython original for acknowledgement and >> future reference? >> > I forgot about that, I will add it to the next release. > > Are you doing just SFF parsing for now? Not writing? > I haven't written the writer yet (haven't needed it so far). I'd rather > release working code early instead of waiting until everything is complete. > > Now, as to Bio::SeqIO integration, Biopython's SeqIO uses >> format name "sff" to mean the full read sequence (with mixed >> case, upper case for the good sequence, lower cases for any >> left/right clipping - as in the Roche tools), and "sff-trim" to mean >> the trimmed sequences. I would encourage you to do the >> same, as part of the general aim of having consistent >> sequence format names between BioPerl, Biopython, and >> EMBOSS, where possible. >> > I agree, consistency is good. > > Leon This is already implemented in Bio::SeqIO I believe. This is the same line of thinking with the FASTQ format, that one can have a 'format-variant' combination that (as one might guess) indicates to the parser any variation of the parser so logic within the parser can deal with it. You can also pass the '-variant => "foo"' parameter as well IIRC. You would just check the variant with the variant() method. chris From l.m.timmermans at students.uu.nl Tue Dec 20 15:25:13 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Tue, 20 Dec 2011 16:25:13 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 6:00 PM, Peter Cock wrote: > I understand - but make sure you've designed the data structures > in the parser so as to allow the original record to be re-built as SFF. > I did, though currently it's rather hard to make new entries from scratch. That said, I can hardly imagine anyone wanting to do this. Great. I'd guess Bio::SeqIO integration would be more important > that SFF output initially. > Probably. It looks like it's quite easy, it's just rather underdocumented. Leon From l.m.timmermans at students.uu.nl Tue Dec 20 15:26:11 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Tue, 20 Dec 2011 16:26:11 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Mon, Dec 19, 2011 at 8:44 PM, Fields, Christopher J < cjfields at illinois.edu> wrote: > Kinda joining this a little late, but I think if there is a way to have a > low-level parser/writer that generically parses the data into simple > (possibly hash-tagged) data structures, that would be best. Barring that, > a very simple class for storing data. We've found BioPerl objects/classes > pretty heavy. > > (for an example of this, see Heng Li's readfq parser on github, which has > some stats for Fastq/fasta parsing). > > Any way we can separate the parser from object instantiation would enable > us to optimize the object/class layer and parser/writer layers separately, > with the possible nice side effect of making the parser more broadly used. > > For insn Sance, if someone wanted a faster parser, use the low level, > otherwise use the higher level (possibly BioPerl-specific) API. Lincoln > does this do a certain degree with Bio-samtools; I would go further and > make the bp- and non-bp code in separate dists. > A good OO system can actually help make things faster. For example, I'm unpacking the flowspace and quality data lazily, which made scanning through an SFF file 2.5-3 times as fast while having marginal extra costs when you do need them. Leon From l.m.timmermans at students.uu.nl Tue Dec 20 15:30:54 2011 From: l.m.timmermans at students.uu.nl (Leon Timmermans) Date: Tue, 20 Dec 2011 16:30:54 +0100 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: <4EEFD6A9.3010303@illinois.edu> References: <4EEFD6A9.3010303@illinois.edu> Message-ID: On Tue, Dec 20, 2011 at 1:28 AM, Chris Fields wrote: > This is already implemented in Bio::SeqIO I believe. This is the same > line of thinking with the FASTQ format, that one can have a > 'format-variant' combination that (as one might guess) indicates to the > parser any variation of the parser so logic within the parser can deal with > it. You can also pass the '-variant => "foo"' parameter as well IIRC. You > would just check the variant with the variant() method. > Great. That makes life much easier :-) Leon From p.j.a.cock at googlemail.com Tue Dec 20 15:31:59 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 20 Dec 2011 15:31:59 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: Message-ID: On Tue, Dec 20, 2011 at 3:25 PM, Leon Timmermans wrote: > On Mon, Dec 19, 2011 at 6:00 PM, Peter Cock > wrote: >> >> I understand - but make sure you've designed the data structures >> in the parser so as to allow the original record to be re-built as SFF. > > ?I did, though currently it's rather hard to make new entries from scratch. > That said, I can hardly imagine anyone wanting to do this. Typical use cases I've found in using the Biopython SFF code are filtering an SFF file (taking some records only), and modifying the clipping values. In both cases, the user isn't creating the SFF records from scratch. Peter From cjfields at illinois.edu Tue Dec 20 22:40:31 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 20 Dec 2011 22:40:31 +0000 Subject: [Bioperl-l] Announcing Bio::SFF In-Reply-To: References: , Message-ID: On Dec 20, 2011, at 9:26 AM, "Leon Timmermans" > wrote: On Mon, Dec 19, 2011 at 8:44 PM, Fields, Christopher J > wrote: Kinda joining this a little late, but I think if there is a way to have a low-level parser/writer that generically parses the data into simple (possibly hash-tagged) data structures, that would be best. Barring that, a very simple class for storing data. We've found BioPerl objects/classes pretty heavy. (for an example of this, see Heng Li's readfq parser on github, which has some stats for Fastq/fasta parsing). Any way we can separate the parser from object instantiation would enable us to optimize the object/class layer and parser/writer layers separately, with the possible nice side effect of making the parser more broadly used. For insn Sance, if someone wanted a faster parser, use the low level, otherwise use the higher level (possibly BioPerl-specific) API. Lincoln does this do a certain degree with Bio-samtools; I would go further and make the bp- and non-bp code in separate dists. A good OO system can actually help make things faster. For example, I'm unpacking the flowspace and quality data lazily, which made scanning through an SFF file 2.5-3 times as fast while having marginal extra costs when you do need them. Leon Yep, thinking about using the same approach for the Fastq variants. Chris Sent from my ancient iPad b/c my laptop's borked From dgacquer at ulb.ac.be Wed Dec 21 13:26:07 2011 From: dgacquer at ulb.ac.be (David Gacquer) Date: Wed, 21 Dec 2011 14:26:07 +0100 Subject: [Bioperl-l] Strange behaviour in the write_seq function for large fasta Message-ID: <4EF1DE6F.4070508@ulb.ac.be> Dear BioPerl users/developers, I am facing a strange issue with the $seq_out->write_seq function when using large fasta files I have downloaded the hg19 chromosome 1, and applied the following code (basically I wanted to mask some regions in it but the problem also appears when copying the sequence without modifications): sub main{ my $seq_in = Bio::SeqIO->new( -format => 'largefasta', -file => $ARGV[0]); my $seq_out = Bio::SeqIO->new( -format => 'largefasta', -file => '>'.$ARGV[1]); my $seq_obj_in = $seq_in->next_seq(); my $modified_seq = $seq_obj_in->seq(); my $seq_obj_out = Bio::Seq::LargePrimarySeq->new( -seq => $modified_seq, -id => $seq_obj_in->id, -desc => $seq_obj_in->desc); $seq_out->write_seq($seq_obj_out); } when checking the output fasta file, the sequence of chr1 is 1-bp shorter. I have noticed that in the original fasta file, each line contains exactly 50 nucleotides, while the output of the $seq_out->write_seq function contains always 60 characters per line. chr1 is exactly 249,250,621 bp (4,154,177 * 60 + 1) so to verify that the very last base was missing, I created the following fasta files: chr121.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAG chr122.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAG They contain respectively 121 (60*2+1) and 122 (60*2+2) bp, the last character being a G. When running the above code: chr121.out.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA chr122.out.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AG The output for the 122 bp chromosome is correct (2 lines of 60 bp and the last line with 2 bp, AG) but for the 121 bp chromosome, the last character is missing (2 lines of 60 bp only, last G is missing). When replacing -format => 'largefasta' by -format => 'fasta' or writing the output without the write_seq function however, the problem is solved. Am I missing something or is there a problem with the write_seq function used with large fasta files? (I am running BioPerl on a Mac under OS X Snow Leopard) Best regards David -- David Gacquer, Ph. D. IRIBHM - Universite Libre de Bruxelles Bldg C, room C.4.117 ULB, Campus Erasme, CP602 808 route de Lennik B-1070 Brussels Belgium Phone: +32-2-555 4187 Fax: +32-2-555 4655 E-mail: dgacquer at ulb.ac.be From koraydogankaya at gmail.com Sat Dec 24 08:44:43 2011 From: koraydogankaya at gmail.com (Koray) Date: Sat, 24 Dec 2011 00:44:43 -0800 (PST) Subject: [Bioperl-l] exons Message-ID: <8454dd72-4fed-41c5-977e-83d9300cd68b@z25g2000vbs.googlegroups.com> I need an explicit code for getting exon sequences of an mrna or gene fetched by get_Seq_by_acc or id. in ensembl it is easy but here it is not easy many ios exists. for example: here how can i get such a $gene object from DBs (GeneBank or EntrezGene) by acc numberor ids? exons code prev next Top Title : exons() Usage : @exons = $gene->exons(); @inital_exons = $gene->exons('Initial'); Function: Get all exon features or all exons of a specified type of this gene structure. Exon type is treated as a case-insensitive regular expression and optional. For consistency, use only the following types: initial, internal, terminal, utr, utr5prime, and utr3prime. A special and virtual type is 'coding', which refers to all types except utr. This method basically merges the exons returned by transcripts. Returns : An array of Bio::SeqFeature::Gene::ExonI implementing objects. Args : An optional string specifying the type of exon. From challa_ghanashyam at yahoo.com Sat Dec 24 20:09:09 2011 From: challa_ghanashyam at yahoo.com (GSC) Date: Sat, 24 Dec 2011 12:09:09 -0800 (PST) Subject: [Bioperl-l] re trieve description for a list of gi ids.. Message-ID: <33034438.post@talk.nabble.com> Hi all: I am new to perl. I am working on a script to retrieve the record description (name given for a sequence record in genbank) for a list of gi ids. the script works fine for 1000 ids but my list is about 250,000 ids long and it is not working for me. Any suggestions on this. GS -- View this message in context: http://old.nabble.com/retrieve-description-for-a-list-of-gi-ids..-tp33034438p33034438.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From cjfields at illinois.edu Tue Dec 27 15:03:28 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 27 Dec 2011 15:03:28 +0000 Subject: [Bioperl-l] Strange behaviour in the write_seq function for large fasta In-Reply-To: <4EF1DE6F.4070508@ulb.ac.be> References: <4EF1DE6F.4070508@ulb.ac.be> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF0BB29547@CITESMBX4.ad.uillinois.edu> This is a strange one. Personally I haven't seen this behavior, but that maybe it's OS-dependent? We'll need more information, particularly what version of BioPerl you are using, the OS, version of perl, etc. Also, in general to make sure we don't lose track of this issue it is best to submit a bug report: https://redmine.open-bio.org/projects/bioperl I'm planning on triaging bugs next week, I could take a look then. chris ________________________________________ From: bioperl-l-bounces at lists.open-bio.org [bioperl-l-bounces at lists.open-bio.org] on behalf of David Gacquer [dgacquer at ulb.ac.be] Sent: Wednesday, December 21, 2011 7:26 AM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] Strange behaviour in the write_seq function for large fasta Dear BioPerl users/developers, I am facing a strange issue with the $seq_out->write_seq function when using large fasta files I have downloaded the hg19 chromosome 1, and applied the following code (basically I wanted to mask some regions in it but the problem also appears when copying the sequence without modifications): sub main{ my $seq_in = Bio::SeqIO->new( -format => 'largefasta', -file => $ARGV[0]); my $seq_out = Bio::SeqIO->new( -format => 'largefasta', -file => '>'.$ARGV[1]); my $seq_obj_in = $seq_in->next_seq(); my $modified_seq = $seq_obj_in->seq(); my $seq_obj_out = Bio::Seq::LargePrimarySeq->new( -seq => $modified_seq, -id => $seq_obj_in->id, -desc => $seq_obj_in->desc); $seq_out->write_seq($seq_obj_out); } when checking the output fasta file, the sequence of chr1 is 1-bp shorter. I have noticed that in the original fasta file, each line contains exactly 50 nucleotides, while the output of the $seq_out->write_seq function contains always 60 characters per line. chr1 is exactly 249,250,621 bp (4,154,177 * 60 + 1) so to verify that the very last base was missing, I created the following fasta files: chr121.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAG chr122.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAG They contain respectively 121 (60*2+1) and 122 (60*2+2) bp, the last character being a G. When running the above code: chr121.out.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA chr122.out.fa >chrA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AG The output for the 122 bp chromosome is correct (2 lines of 60 bp and the last line with 2 bp, AG) but for the 121 bp chromosome, the last character is missing (2 lines of 60 bp only, last G is missing). When replacing -format => 'largefasta' by -format => 'fasta' or writing the output without the write_seq function however, the problem is solved. Am I missing something or is there a problem with the write_seq function used with large fasta files? (I am running BioPerl on a Mac under OS X Snow Leopard) Best regards David -- David Gacquer, Ph. D. IRIBHM - Universite Libre de Bruxelles Bldg C, room C.4.117 ULB, Campus Erasme, CP602 808 route de Lennik B-1070 Brussels Belgium Phone: +32-2-555 4187 Fax: +32-2-555 4655 E-mail: dgacquer at ulb.ac.be _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l