From florent.angly at gmail.com Wed May 1 22:16:02 2013 From: florent.angly at gmail.com (Florent Angly) Date: Thu, 02 May 2013 12:16:02 +1000 Subject: [Bioperl-l] Downloading sequences in batch from Trace Archive In-Reply-To: References: Message-ID: <5181CC62.9000609@gmail.com> Maybe using EUtilities? http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook http://www.bioperl.org/wiki/HOWTO:EUtilities_Web_Service Florent On 30/04/13 06:25, shalabh sharma wrote: > Hi All, > Is there any module in Bioperl that can download sequences from > NCBI's trace archive? > > Thanks > Shalabh > From jason.stajich at gmail.com Thu May 2 01:42:55 2013 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 1 May 2013 22:42:55 -0700 Subject: [Bioperl-l] Fwd: doubt References: Message-ID: Begin forwarded message: > From: ARYA DAS > Subject: doubt > Date: May 1, 2013 10:42:21 PM PDT > To: jason at bioperl.org > > sir, > > Am using windows7 n was trying to install bio perl in it..i have > already installed active perl.5.16.3.1603 . n was followeing the > installation procedure mentioned .when i tried GUI installation .. i cant > find bioperl package when i try to search them for installation. > while using command line.. > > ppm> install PPM-Repositories > > shows error like cant find package that provides PPM repositories, > > and when i try manually ,on reaching the > perl Build test > > it says build is recognised as an internal or external file. > > please help if time permits > > regards, > arya Jason Stajich jason.stajich at gmail.com jason at bioperl.org From voldrani at gmail.com Sun May 5 00:03:38 2013 From: voldrani at gmail.com (Chris Maloney) Date: Sun, 5 May 2013 00:03:38 -0400 Subject: [Bioperl-l] Wiki work, Template:Doclink Message-ID: The module pages on the wiki could look a little better, like this one for example: http://www.bioperl.org/wiki/Module:Bio::Tools::Run::RemoteBlast. There used to be a bunch of extra whitespace at the top of the page, which was caused by extra line breaks in Template:Doclink, which I just removed. But, I think there are other improvements that could be made. I would like to turn this into an infobox -- which are the helpful informative tables of info on Wikipedia that appear on many articles on the upper right. That would allow us to add more links -- like to metacpan, for example. It is not completely trivial to import infoboxes into a wiki though, I just discovered. I just went through the exercise on my home wiki, and it involves importing a lot of templates from Wikipedia, and fixing up the common.css. You can see the full list of imported templates here: http://chrismaloney.org/wiki/index.php?title=Special:RecentChanges&limit=100. I don't *think* this should cause any problems, but I'm not 100% sure. On the other hand, if it does, it should be easy to roll back -- it's a wiki, after all. Does anybody have a problem if I do this? I'll wait a day for responses, and tackle this tomorrow, if no one objects. -- Chris M. From armendarez77 at hotmail.com Tue May 7 20:32:22 2013 From: armendarez77 at hotmail.com (Veronica A.) Date: Tue, 7 May 2013 17:32:22 -0700 Subject: [Bioperl-l] Bio::SeqIO doesn't write all gbk sequences from Bio::DB::GenBank Message-ID: Hello, I'm currently running Bio::Perl 1.6.1 on Ubuntu 12.04.2 LTS and have noticed a problem with Bio::SeqIO when writing genbank files using the write_seq() function; some of the files do not include an 'ORIGIN' tag or the sequence. I am using GI#s (50 at a time every 2 minutes) to retrieve genbank files via Bio::DB::GenBank. ----------------------------------------START CODE---------------------------------- my $gb = Bio::DB::GenBank->new(-verbose=>-1);my $seqout = Bio::SeqIO->new(-file=>">$fileName", '-format'=>'Genbank', -alphabet=>'dna', -flush=>0, -verbose=>-1);while(@ids){ my @batchArray = splice(@ids, 0, 50); my $batchArrayRef = \@batchArray; my $streamObj; my $pid = fork(); if($pid == 0){ eval{ $streamObj = $gb->get_Stream_by_id($batchArrayRef); }; if($@){ print "Error: ".$@."\n"; } else{ while(my $seqObj = $streamObj->next_seq()){ unless($seqObj->accession_number() =~ /N[A-Z]\_/){ #print "ID: ".$seqObj->id()."\n"; #print "Seq:\n".$seqObj->seq()."\n"; $seqout->write_seq($seqObj); } } } exit 0; }}waitpid($pid,0);sleep(120); ----------------------------------------END CODE---------------------------------- Most of the Genbank files written to the output file have sequences, but there is a small portion that do not, even though they should. For example, JX287367, in NCBI includes an 'ORIGIN' tag and sequence and when I use the print function before writing to file, the sequence is printed to STDOUT, but the 'ORIGIN' tag and sequence are not written to the output gbk file. The following is found in the final output file: ----------------------------------START GBK----------------------------------------- LOCUS JX287367 588 bp DNA linear BCT 19-DEC-2012DEFINITION Chlamydia trachomatis strain UW-5/CX pyruvoyl-dependent arginine decarboxylase (aaxB) gene, complete cds.ACCESSION JX287367VERSION JX287367.1 GI:404351720KEYWORDS .SOURCE Chlamydia trachomatis ORGANISM Chlamydia trachomatis Bacteria; Chlamydiae; Chlamydiales; Chlamydiaceae; Chlamydia/Chlamydophila group; Chlamydia.REFERENCE 1 (bases 1 to 588) AUTHORS Bliven,K.A., Fisher,D.J. and Maurelli,A.T. TITLE Characterization of the activity and expression of arginine decarboxylase in human and animal Chlamydia pathogens JOURNAL FEMS Microbiol. Lett. 337 (2), 140-146 (2012) PUBMED 23043454REFERENCE 2 (bases 1 to 588) AUTHORS Bliven,K.A., Fisher,D.J. and Maurelli,A.T. TITLE Direct Submission JOURNAL Submitted (06-JUL-2012) Department of Microbiology and Immunology, F. Edward Hebert School of Medicine, Uniformed Services University of the Health Sciences, 4301 Jones Bridge Road, Bethesda, MD 20814, USAFEATURES Location/Qualifiers source 1..588 /mol_type="genomic DNA" /db_xref="taxon:813" /strain="UW-5/CX" /organism="Chlamydia trachomatis" /serovar="E" gene 1..588 /gene="aaxB" CDS 1..588 /protein_id="AFR60849.1" /gene="aaxB" /transl_table=11 /note="AaxB" /db_xref="GI:404351721" /codon_start=1 /product="pyruvoyl-dependent arginine decarboxylase" /translation="MPYGTRYPTLAFHTGGVGESDDGMPPQPFETFCYDSALLQAKIE NFNIVPYTSVLPKELFGNILPVDQCTKFFKHGAVLEVIMAGRGATVTDGTQAIATGVG ICWGKDKNGELIGGWAAEYVEFFPTWIDDEIAESHAKMWLKKSLQHELDLRSVSKHSE FQYFHNYINIRKKFGFCLTALGFLNFENVAPAVIQ" // ----------------------------------END GBK----------------------------------------- Can anyone tell what I am missing or why this is happening? I don't know if this has happened in earlier BioPerl versions as up until now, I usually downloaded sequences straight from NCBI, but that became too time consuming....but this seems to be as well :S Thank you in advance for any help, Veronica From cjfields at illinois.edu Tue May 7 22:17:43 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 8 May 2013 02:17:43 +0000 Subject: [Bioperl-l] Bio::SeqIO doesn't write all gbk sequences from Bio::DB::GenBank In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E166A0@CHIMBX5.ad.uillinois.edu> Veronica, Your mail may have garbled the script and example file. Can you paste these in a gist? https://gist.github.com/ chris On May 7, 2013, at 7:32 PM, Veronica A. wrote: > Hello, > I'm currently running Bio::Perl 1.6.1 on Ubuntu 12.04.2 LTS and have noticed a problem with Bio::SeqIO when writing genbank files using the write_seq() function; some of the files do not include an 'ORIGIN' tag or the sequence. > > I am using GI#s (50 at a time every 2 minutes) to retrieve genbank files via Bio::DB::GenBank. > > ----------------------------------------START CODE---------------------------------- > my $gb = Bio::DB::GenBank->new(-verbose=>-1);my $seqout = Bio::SeqIO->new(-file=>">$fileName", '-format'=>'Genbank', -alphabet=>'dna', -flush=>0, -verbose=>-1);while(@ids){ my @batchArray = splice(@ids, 0, 50); my $batchArrayRef = \@batchArray; > my $streamObj; my $pid = fork(); if($pid == 0){ eval{ $streamObj = $gb->get_Stream_by_id($batchArrayRef); }; if($@){ print "Error: ".$@."\n"; } else{ while(my $seqObj = $streamObj->next_seq()){ unless($seqObj->accession_number() =~ /N[A-Z]\_/){ #print "ID: ".$seqObj->id()."\n"; #print "Seq:\n".$seqObj->seq()."\n"; $seqout->write_seq($seqObj); } } } exit 0; }}waitpid($pid,0);sleep(120); > ----------------------------------------END CODE---------------------------------- > Most of the Genbank files written to the output file have sequences, but there is a small portion that do not, even though they should. For example, JX287367, in NCBI includes an 'ORIGIN' tag and sequence and when I use the print function before writing to file, the sequence is printed to STDOUT, but the 'ORIGIN' tag and sequence are not written to the output gbk file. The following is found in the final output file: > ----------------------------------START GBK----------------------------------------- > LOCUS JX287367 588 bp DNA linear BCT 19-DEC-2012DEFINITION Chlamydia trachomatis strain UW-5/CX pyruvoyl-dependent arginine decarboxylase (aaxB) gene, complete cds.ACCESSION JX287367VERSION JX287367.1 GI:404351720KEYWORDS .SOURCE Chlamydia trachomatis ORGANISM Chlamydia trachomatis Bacteria; Chlamydiae; Chlamydiales; Chlamydiaceae; Chlamydia/Chlamydophila group; Chlamydia.REFERENCE 1 (bases 1 to 588) AUTHORS Bliven,K.A., Fisher,D.J. and Maurelli,A.T. TITLE Characterization of the activity and expression of arginine decarboxylase in human and animal Chlamydia pathogens JOURNAL FEMS Microbiol. Lett. 337 (2), 140-146 (2012) PUBMED 23043454REFERENCE 2 (bases 1 to 588) AUTHORS Bliven,K.A., Fisher,D.J. and Maurelli,A.T. TITLE Direct Submission JOURNAL Submitted (06-JUL-2012) Department of Microbiology and Immunology, F. Edward Hebert School of Med! > icine, Uniformed Services University of the Health Sciences, 4301 Jones Bridge Road, Bethesda, MD 20814, USAFEATURES Location/Qualifiers source 1..588 /mol_type="genomic DNA" /db_xref="taxon:813" /strain="UW-5/CX" /organism="Chlamydia trachomatis" /serovar="E" gene 1..588 /gene="aaxB" CDS 1..588 /protein_id="AFR60849.1" /gene="aaxB" /transl_table=11 /note="AaxB" /db_xref="GI:404351721" /codon_start=1 /product="pyruvoyl-dependent arginine decarboxylase" /translation="MPYGTRYPTLAFHTGGVGESDDGMPPQPFETFCYDSALLQAKIE NFNIVPYTSVLPKELFGNILPVDQCTKFFKHGAVLEVIMAGRGATVTDGTQAIATGVG ICWGKDKNGELIGGW! > AAEYVEFFPTWIDDEIAESHAKMWLKKSLQHELDLRSVSKHSE FQYFHN > YINIRKKFGFCLTALGFLNFENVAPAVIQ" > // > ----------------------------------END GBK----------------------------------------- > Can anyone tell what I am missing or why this is happening? I don't know if this has happened in earlier BioPerl versions as up until now, I usually downloaded sequences straight from NCBI, but that became too time consuming....but this seems to be as well :S > Thank you in advance for any help, > Veronica > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From witch.of.agnessi at gmail.com Wed May 8 15:24:53 2013 From: witch.of.agnessi at gmail.com (WoA) Date: Wed, 8 May 2013 12:24:53 -0700 (PDT) Subject: [Bioperl-l] Extracting matching subsequence from pairwise alignment Message-ID: <1368041092972-16935.post@n3.nabble.com> Hello All, I've a pairwise global alignemnet of two DNA sequences generated by the program NEEDLE of EMBOSS package. I wish to extract the sub-sequence that matches/aligns to a given region of the other sequence. In this alignment (Pastebin Link) the given region (actually the CDS) falls between base number 24:485 in the original sequence with ID 'XM_001005073.' I wish to extract the sub-sequence in the sequence ID 'Homolog' that aligns with that 24:485 region of the other sequence. I'm using Bioperl to parse the alignment. I find out the the alignment column numbers corresponding to 24:485 region in the particular sequence, using 'column_from_residue_number'. Then I extract the sub-sequence from the 'aligned' other sequence(containing gaps) using the corresponding column numbers. Finally I remove the gap characters. Am I doing this thing correctly and are there any pitfalls ? Is there any better way to do it by (Bio)Perl/Python? The code goes here: use strict; use warnings; use Bio::AlignIO; # read in an alignment generated by the EMBOSS program Needle my $in = new Bio::AlignIO(-format => 'emboss', -file => 'test_needle.aln'); while( my $aln = $in->next_aln ) { #Seqnames: 'XM_001005073.'(CDS:24-485),'Homolog' my ($cds_start,$cds_end)=(24,485);# my $col_cdsstart = $aln->column_from_residue_number( 'XM_001005073.', $cds_start); my $col_cdsend= $aln->column_from_residue_number( 'XM_001005073.', $cds_end); foreach my $seq ($aln->each_seq) { if($seq->id() eq 'Homolog'){ my $homolog_cds=$seq->subseq($col_cdsstart,$col_cdsend); $homolog_cds=~s/\-//g; print $homolog_cds,"\n"; } } } -- View this message in context: http://bioperl.996286.n3.nabble.com/Extracting-matching-subsequence-from-pairwise-alignment-tp16935.html Sent from the Bioperl-L mailing list archive at Nabble.com. From hlapp at drycafe.net Wed May 15 16:44:07 2013 From: hlapp at drycafe.net (Hilmar Lapp) Date: Wed, 15 May 2013 16:44:07 -0400 Subject: [Bioperl-l] Workshop on Sustainable Software for Science: Practice and Experiences Message-ID: FYI, if you haven't seen this yet: http://wssspe.researchcomputing.org.uk/ It seems to me that the Bio* projects, perhaps led by BioPerl as the oldest and thus longest running (nowadays more fancily called "sustained") of them would have a lot to say about the subject. Anyone interested in a joint submission? Also, I notice that Biojava's Andreas is on the organizing committee, so maybe he's been conspiring on something already :-) -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 203 bytes Desc: Message signed with OpenPGP using GPGMail URL: From carandraug+dev at gmail.com Wed May 15 21:53:55 2013 From: carandraug+dev at gmail.com (=?ISO-8859-1?Q?Carn=EB_Draug?=) Date: Thu, 16 May 2013 02:53:55 +0100 Subject: [Bioperl-l] sets of sequences - how to read? Message-ID: Hi when accessing entrez gene using eutils to get multiple genes, NCBI now returns an Entrezgene-Set[1] rather than a list of EntrezGene. This change must have happened sometime on the last 2 months. Compare: use Bio::DB::EUtilities; my %sets = ( eutil => 'efetch', db => 'gene', retmode => 'text', rettype => 'asn1', email => 'bioperl-l at lists.open-bio.org', ); ## this mimics the previous behaviour of the NCBI server but the multiple requests will annoy their servers my @ids = (3014, 85235); my $response; foreach (@ids) { my $fetcher = Bio::DB::EUtilities->new(%sets, id => $_); $response .= $fetcher->get_Response->content; } print $fetcher->get_Response->content; ## this used to be the right way to do it, but now returns an Entrezgene-Set my $fetcher = Bio::DB::EUtilities->new(%sets, id => \@ids); $response .= $fetcher->get_Response->content; print $fetcher->get_Response->content; There is no module to read these Entrezgene-Set in Perl at the moment, since Bio::ASN1::EntrezGene; is not able to handle them. I have contacted the module author and set him a fix[2] and he said he'll try to look into it next week. However, even with the fix there is another problem. How would one access a set of sequences using the Bio::SeqIO API? There is no method to do that. One could say, to ignore them, and make next_seq return the next sequence of the set. But then we are losing data. After all, it's perfectly viable to have multiple Entrezgene-Set in one file. What would be the right way to do this? Carn? [1] http://0-www.ncbi.nlm.nih.gov.elis.tmu.edu.tw/IEB/ToolBox/CPP_DOC/asn_spec/Entrezgene-Set.html [2] https://github.com/carandraug/bio-asn1-entrezgene/commit/69d505056d8b7897df6271ffb7a5f39d58873c6b From cjfields at illinois.edu Thu May 16 00:43:22 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 16 May 2013 04:43:22 +0000 Subject: [Bioperl-l] Workshop on Sustainable Software for Science: Practice and Experiences In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E1F8C8@CHIMBX5.ad.uillinois.edu> Jason and I have discussed looking into opportunity's like this, I think it makes sense to try a joint submission. chris On May 15, 2013, at 3:44 PM, Hilmar Lapp wrote: > FYI, if you haven't seen this yet: > > http://wssspe.researchcomputing.org.uk/ > > It seems to me that the Bio* projects, perhaps led by BioPerl as the oldest and thus longest running (nowadays more fancily called "sustained") of them would have a lot to say about the subject. Anyone interested in a joint submission? > > Also, I notice that Biojava's Andreas is on the organizing committee, so maybe he's been conspiring on something already :-) > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : > =========================================================== > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From p.j.a.cock at googlemail.com Thu May 16 05:10:25 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 16 May 2013 10:10:25 +0100 Subject: [Bioperl-l] Workshop on Sustainable Software for Science: Practice and Experiences In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF74E1F8C8@CHIMBX5.ad.uillinois.edu> References: <118F034CF4C3EF48A96F86CE585B94BF74E1F8C8@CHIMBX5.ad.uillinois.edu> Message-ID: On Thu, May 16, 2013 at 5:43 AM, Fields, Christopher J wrote: > Jason and I have discussed looking into opportunity's like this, I think it makes > sense to try a joint submission. > > chris This sounds like a good idea, although given the time and place I am unlikely to be able to attend in person: First Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE) (to held in conjunction with SC13, Sunday, 17 November 2013, Denver, CO, USA) http://wssspe.researchcomputing.org.uk/ Rather than trying to discuss this over four mailing lists should we switch to the cross project list open-bio-l, or continue off-list? http://lists.open-bio.org/mailman/listinfo/open-bio-l Thanks, Peter From miquel.ramia at uab.cat Thu May 16 06:42:29 2013 From: miquel.ramia at uab.cat (=?ISO-8859-1?Q?Miquel_R=E0mia?=) Date: Thu, 16 May 2013 12:42:29 +0200 Subject: [Bioperl-l] Problem compiling Bio::DB::Sam Message-ID: <5194B815.2010401@uab.cat> Hi all, I get this message when compiling Bio::DB::Sam: Building Bio-SamTools gcc -g -Wall -O2 -fPIC -o bam2bedgraph bam2bedgraph.o -L/var/lib/gbrowse2/databases/samtools/samtools-0.1.19 -lbam -lm -lz /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `mt_destroy': /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:458: undefined reference to `pthread_join' /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `bgzf_mt': /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:445: undefined reference to `pthread_create' collect2: ld returned 1 exit status make: *** [bam2bedgraph] Error 1 Is this error related to the module or some dependencies? or maybe a problem with my system? Any help appreciated, thanks! -- Miquel R?mia Jes?s PhD. candidate (PIF) Evolutionary Bioinformatics Group (Genomics, Bioinformatics and Evolution Group) Lab MRB/014 - 93 586 89 58 MRB - Institut de Biologia i Biomedicina (IBB) Universitat Aut?noma de Barcelona (UAB) 08193, Cerdanyola del Vall?s Barcelona (Spain) From cjfields at illinois.edu Thu May 16 09:12:40 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 16 May 2013 13:12:40 +0000 Subject: [Bioperl-l] Problem compiling Bio::DB::Sam In-Reply-To: <5194B815.2010401@uab.cat> References: <5194B815.2010401@uab.cat> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E1FC2C@CHIMBX5.ad.uillinois.edu> It may be due to the new samtools release (v 0.1.19). I know Heng Li has been working on the code over the last year for threading support (notice the undefined functions). Have you tried v 0.1.18? chris On May 16, 2013, at 5:42 AM, Miquel R?mia wrote: > Hi all, > > I get this message when compiling Bio::DB::Sam: > > Building Bio-SamTools > > gcc -g -Wall -O2 -fPIC -o bam2bedgraph bam2bedgraph.o -L/var/lib/gbrowse2/databases/samtools/samtools-0.1.19 -lbam -lm -lz > > /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `mt_destroy': > > /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:458: undefined reference to `pthread_join' > > /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `bgzf_mt': > > /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:445: undefined reference to `pthread_create' > > collect2: ld returned 1 exit status > > make: *** [bam2bedgraph] Error 1 > > > Is this error related to the module or some dependencies? or maybe a problem with my system? > > Any help appreciated, thanks! > > > -- > Miquel R?mia Jes?s > PhD. candidate (PIF) > Evolutionary Bioinformatics Group > (Genomics, Bioinformatics and Evolution Group) > Lab MRB/014 - 93 586 89 58 > MRB - Institut de Biologia i Biomedicina (IBB) > Universitat Aut?noma de Barcelona (UAB) > 08193, Cerdanyola del Vall?s > Barcelona (Spain) > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Thu May 16 09:09:45 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 16 May 2013 13:09:45 +0000 Subject: [Bioperl-l] Workshop on Sustainable Software for Science: Practice and Experiences In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF74E1F8C8@CHIMBX5.ad.uillinois.edu> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E1FBCF@CHIMBX5.ad.uillinois.edu> Yes, though we need to make sure others (e.g. those not subscribed to open-bio-l) are in the loop. November is a possibility for me. chris On May 16, 2013, at 4:10 AM, Peter Cock wrote: > On Thu, May 16, 2013 at 5:43 AM, Fields, Christopher J > wrote: >> Jason and I have discussed looking into opportunity's like this, I think it makes >> sense to try a joint submission. >> >> chris > > This sounds like a good idea, although given the time and place I am > unlikely to be able to attend in person: > > First Workshop on Sustainable Software for Science: Practice and > Experiences (WSSSPE) > (to held in conjunction with SC13, Sunday, 17 November 2013, Denver, CO, USA) > http://wssspe.researchcomputing.org.uk/ > > Rather than trying to discuss this over four mailing lists should we switch > to the cross project list open-bio-l, or continue off-list? > http://lists.open-bio.org/mailman/listinfo/open-bio-l > > Thanks, > > Peter From andreas at sdsc.edu Thu May 16 00:31:34 2013 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 15 May 2013 21:31:34 -0700 Subject: [Bioperl-l] [Biojava-l] Workshop on Sustainable Software for Science: Practice and Experiences In-Reply-To: References: Message-ID: Thanks Hilmar, you were faster than me in sending this out.. You are right, it would be very interesting to hear what some of the long running open-bio projects have to say on the topic of sustainability. Let me know if anybody is interested in a submission! Andreas On Wed, May 15, 2013 at 1:44 PM, Hilmar Lapp wrote: > FYI, if you haven't seen this yet: > > http://wssspe.researchcomputing.org.uk/ > > It seems to me that the Bio* projects, perhaps led by BioPerl as the > oldest and thus longest running (nowadays more fancily called "sustained") > of them would have a lot to say about the subject. Anyone interested in a > joint submission? > > Also, I notice that Biojava's Andreas is on the organizing committee, so > maybe he's been conspiring on something already :-) > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : > =========================================================== > > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > From cjfields at illinois.edu Fri May 17 00:08:04 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 17 May 2013 04:08:04 +0000 Subject: [Bioperl-l] sets of sequences - how to read? In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E21252@CHIMBX5.ad.uillinois.edu> This doesn't surprise me too much; I know there have been some changes brewing, but didn't know when they would land. I guess that would be... ... now. My feeling is this will require writing some code for a higher-level layer of abstraction, say a Bio::DB::* (which would allow some internal indexing of the files maybe using a Bio::Index::*, look ups for specific gene IDs, etc). How hard that would be to implement is another thing, have no idea w/o seeing what the data look like beyond they are in ASN1. chris On May 15, 2013, at 8:53 PM, Carn? Draug wrote: > Hi > > when accessing entrez gene using eutils to get multiple genes, NCBI > now returns an Entrezgene-Set[1] rather than a list of EntrezGene. > This change must have happened sometime on the last 2 months. Compare: > > use Bio::DB::EUtilities; > > my %sets = ( > eutil => 'efetch', > db => 'gene', > retmode => 'text', > rettype => 'asn1', > email => 'bioperl-l at lists.open-bio.org', > ); > > ## this mimics the previous behaviour of the NCBI server but the > multiple requests will annoy their servers > my @ids = (3014, 85235); > my $response; > foreach (@ids) { > my $fetcher = Bio::DB::EUtilities->new(%sets, id => $_); > $response .= $fetcher->get_Response->content; > } > print $fetcher->get_Response->content; > > ## this used to be the right way to do it, but now returns an Entrezgene-Set > my $fetcher = Bio::DB::EUtilities->new(%sets, id => \@ids); > $response .= $fetcher->get_Response->content; > print $fetcher->get_Response->content; > > There is no module to read these Entrezgene-Set in Perl at the moment, > since Bio::ASN1::EntrezGene; is not able to handle them. I have > contacted the module author and set him a fix[2] and he said he'll try > to look into it next week. > > However, even with the fix there is another problem. How would one > access a set of sequences using the Bio::SeqIO API? There is no method > to do that. One could say, to ignore them, and make next_seq return > the next sequence of the set. But then we are losing data. After all, > it's perfectly viable to have multiple Entrezgene-Set in one file. > What would be the right way to do this? > > Carn? > > [1] http://0-www.ncbi.nlm.nih.gov.elis.tmu.edu.tw/IEB/ToolBox/CPP_DOC/asn_spec/Entrezgene-Set.html > [2] https://github.com/carandraug/bio-asn1-entrezgene/commit/69d505056d8b7897df6271ffb7a5f39d58873c6b > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Fri May 17 00:16:12 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 17 May 2013 04:16:12 +0000 Subject: [Bioperl-l] Problem compiling Bio::DB::Sam In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF74E1FC2C@CHIMBX5.ad.uillinois.edu> References: <5194B815.2010401@uab.cat> <118F034CF4C3EF48A96F86CE585B94BF74E1FC2C@CHIMBX5.ad.uillinois.edu> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E212B0@CHIMBX5.ad.uillinois.edu> For the record, this is now fixed in the latest Bio::Samtools (via Lincoln). chris On May 16, 2013, at 8:12 AM, "Fields, Christopher J" wrote: > It may be due to the new samtools release (v 0.1.19). I know Heng Li has been working on the code over the last year for threading support (notice the undefined functions). Have you tried v 0.1.18? > > chris > > On May 16, 2013, at 5:42 AM, Miquel R?mia wrote: > >> Hi all, >> >> I get this message when compiling Bio::DB::Sam: >> >> Building Bio-SamTools >> >> gcc -g -Wall -O2 -fPIC -o bam2bedgraph bam2bedgraph.o -L/var/lib/gbrowse2/databases/samtools/samtools-0.1.19 -lbam -lm -lz >> >> /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `mt_destroy': >> >> /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:458: undefined reference to `pthread_join' >> >> /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `bgzf_mt': >> >> /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:445: undefined reference to `pthread_create' >> >> collect2: ld returned 1 exit status >> >> make: *** [bam2bedgraph] Error 1 >> >> >> Is this error related to the module or some dependencies? or maybe a problem with my system? >> >> Any help appreciated, thanks! >> >> >> -- >> Miquel R?mia Jes?s >> PhD. candidate (PIF) >> Evolutionary Bioinformatics Group >> (Genomics, Bioinformatics and Evolution Group) >> Lab MRB/014 - 93 586 89 58 >> MRB - Institut de Biologia i Biomedicina (IBB) >> Universitat Aut?noma de Barcelona (UAB) >> 08193, Cerdanyola del Vall?s >> Barcelona (Spain) >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From carandraug+dev at gmail.com Fri May 17 01:12:24 2013 From: carandraug+dev at gmail.com (=?ISO-8859-1?Q?Carn=EB_Draug?=) Date: Fri, 17 May 2013 06:12:24 +0100 Subject: [Bioperl-l] sets of sequences - how to read? In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF74E21252@CHIMBX5.ad.uillinois.edu> References: <118F034CF4C3EF48A96F86CE585B94BF74E21252@CHIMBX5.ad.uillinois.edu> Message-ID: On 17 May 2013 05:08, Fields, Christopher J wrote: > This doesn't surprise me too much; I know there have been some changes brewing, but didn't know when they would land. I guess that would be... ... now. > > My feeling is this will require writing some code for a higher-level layer of abstraction, say a Bio::DB::* (which would allow some internal indexing of the files maybe using a Bio::Index::*, look ups for specific gene IDs, etc). How hard that would be to implement is another thing, have no idea w/o seeing what the data look like beyond they are in ASN1. :s I'm not sure I understood your suggestion. I think the problem is just the introduction of a new concept, a "set" of stuff (genes in this case), and how should SeqIO handle multiple sets. Carn? From shalabh.sharma7 at gmail.com Fri May 17 10:54:55 2013 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Fri, 17 May 2013 10:54:55 -0400 Subject: [Bioperl-l] Downloading annotated Gene nucleotide fasta files Message-ID: HI, First of all i am really sorry for sending this mail here, i am not sure if this is the right forum. I know lot of people work on similar stuff. I wrote to NCBI but nobody replied. Actually i am looking for all bacterial/microbial gene annotation nucleotide fasta files. Does anyone knows where to download these kind of files. I tried *ffn files but they are not annotated. Or is there any module in bioperl that i can use ? I would really appreciate your help. Thanks Shalabh -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From fossandonc at hotmail.com Fri May 17 11:59:04 2013 From: fossandonc at hotmail.com (=?iso-8859-1?Q?Francisco_J._Ossand=F3n?=) Date: Fri, 17 May 2013 11:59:04 -0400 Subject: [Bioperl-l] Downloading annotated Gene nucleotide fasta files In-Reply-To: References: Message-ID: Hi, You can get the annotations from here: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ The ".ffn" are the genes nucleotide fasta files but it does not show the product name, on the other hand the ".faa" are the genes aminoacid fasta files and shows the product name, but if you want both product and nucleotide is much better to use the Genbank ".gbk" files that contains the complete data and you can parse it easily using BioPerl to obtain all genes, and then print the /protein_id, /product, and the nucleotide sequences in a new fasta file. Check these to see how to do it: http://www.bioperl.org/wiki/HOWTO:SeqIO http://www.bioperl.org/wiki/HOWTO:Feature-Annotation Cheers, Francisco J. Ossandon -----Mensaje original----- De: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] En nombre de shalabh sharma Enviado el: viernes, 17 de mayo de 2013 10:55 Para: bioperl-l Asunto: [Bioperl-l] Downloading annotated Gene nucleotide fasta files HI, First of all i am really sorry for sending this mail here, i am not sure if this is the right forum. I know lot of people work on similar stuff. I wrote to NCBI but nobody replied. Actually i am looking for all bacterial/microbial gene annotation nucleotide fasta files. Does anyone knows where to download these kind of files. I tried *ffn files but they are not annotated. Or is there any module in bioperl that i can use ? I would really appreciate your help. Thanks Shalabh -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From shalabh.sharma7 at gmail.com Fri May 17 12:26:26 2013 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Fri, 17 May 2013 12:26:26 -0400 Subject: [Bioperl-l] Downloading annotated Gene nucleotide fasta files In-Reply-To: References: Message-ID: Hey Francisco, Thanks a lot. Basically i just wanted gene nucleotide fasta files with GI numbers. I think i will have to parse it from gbk files. -Shalabh On Fri, May 17, 2013 at 11:59 AM, Francisco J. Ossand?n < fossandonc at hotmail.com> wrote: > Hi, > You can get the annotations from here: > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ > > The ".ffn" are the genes nucleotide fasta files but it does not show the > product name, on the other hand the ".faa" are the genes aminoacid fasta > files and shows the product name, but if you want both product and > nucleotide is much better to use the Genbank ".gbk" files that contains the > complete data and you can parse it easily using BioPerl to obtain all > genes, > and then print the /protein_id, /product, and the nucleotide sequences in a > new fasta file. Check these to see how to do it: > http://www.bioperl.org/wiki/HOWTO:SeqIO > http://www.bioperl.org/wiki/HOWTO:Feature-Annotation > > Cheers, > > Francisco J. Ossandon > > -----Mensaje original----- > De: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] En nombre de shalabh sharma > Enviado el: viernes, 17 de mayo de 2013 10:55 > Para: bioperl-l > Asunto: [Bioperl-l] Downloading annotated Gene nucleotide fasta files > > HI, > First of all i am really sorry for sending this mail here, > i am not sure if this is the right forum. I know lot of people work on > similar stuff. > I wrote to NCBI but nobody replied. > > Actually i am looking for all bacterial/microbial gene annotation > nucleotide > fasta files. > Does anyone knows where to download these kind of files. > I tried *ffn files but they are not annotated. > Or is there any module in bioperl that i can use ? > I would really appreciate your help. > > Thanks > Shalabh > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences University of Georgia Athens, GA 30602-3636 > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From cjfields at illinois.edu Fri May 17 13:37:53 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 17 May 2013 17:37:53 +0000 Subject: [Bioperl-l] sets of sequences - how to read? In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF74E21252@CHIMBX5.ad.uillinois.edu> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E22218@CHIMBX5.ad.uillinois.edu> On May 17, 2013, at 12:12 AM, Carn? Draug wrote: > On 17 May 2013 05:08, Fields, Christopher J wrote: >> This doesn't surprise me too much; I know there have been some changes brewing, but didn't know when they would land. I guess that would be... ... now. >> >> My feeling is this will require writing some code for a higher-level layer of abstraction, say a Bio::DB::* (which would allow some internal indexing of the files maybe using a Bio::Index::*, look ups for specific gene IDs, etc). How hard that would be to implement is another thing, have no idea w/o seeing what the data look like beyond they are in ASN1. > > :s I'm not sure I understood your suggestion. I think the problem is > just the introduction of a new concept, a "set" of stuff (genes in > this case), and how should SeqIO handle multiple sets. > > Carn? (note: critical point in this is Bio::ASN1::Entrezgene would allow this, I'm not sure it would. Otherwise this is all really hand-wavy) To me a 'set of stuff', particularly when the 'stuff' is stored sequentially in a flat file, is a simple 'database' or 'store' of similar items, where the class allows one the ability to look up particular members in the set, but also could store higher level information about the set as a whole if needed. If it were me, I would implement a method particular to Bio::SeqIO::entrezgene that specifically creates and returns this ( next_geneset(), for instance ); next_seq() could then be implemented to iterate through the items in that database/store. Two useful things come out of this. First, if the data for the Entrez Gene file/chunk are parsed to store offsets per ID, one would only need to parse out the chunks needed (offset of ID to next offset), then pass that into the parser and create objects on the fly. This would probably be as fast or faster than (for instance) the greedy method of parsing the entire file and storing everything in objects up-front, then iterating through those objects one at a time, which I think is current behavior. Second: if an index is created, the upfront cost is already paid (you could reuse the same index when parsing the same data). An analogous example might be storing all FASTQ data in a sequencing run; I don't want to expend the effort to parse all the FASTQ data, but I may want to run operations on individual items in the set as well as store additional information about the data (barcodes per run, lanes, overall quality stats, etc). Does that make sense? The pieces for this are lying around (Bio::Index::* for instance has methods for indexing flat files, and classes like Bio::DB::Fasta). chris From shalabh.sharma7 at gmail.com Sun May 19 15:33:16 2013 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Sun, 19 May 2013 15:33:16 -0400 Subject: [Bioperl-l] Downloading annotated Gene nucleotide fasta files In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF37365BCAEC2@exchsth.agresearch.co.nz> References: <18DF7D20DFEC044098A1062202F5FFF37365BCAEC2@exchsth.agresearch.co.nz> Message-ID: Thanks Russell, Actually i wanted all the Bacterial gene nucleotide files, so i parsed it from *gbk. But yes these files might help me for my other parts of my work. Thanks Shalabh On Sun, May 19, 2013 at 3:26 PM, Smithies, Russell < Russell.Smithies at agresearch.co.nz> wrote: > Another option that I've used before is to download the gene2accession, > gene2refseq, and gene_info files from here > ftp://ftp.ncbi.nih.gov/gene/DATA/ then parse out the data you require. > It might work for you? > > --Russell > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto: > bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma > Sent: Saturday, 18 May 2013 4:26 a.m. > To: Francisco J. Ossand?n > Cc: bioperl-l > Subject: Re: [Bioperl-l] Downloading annotated Gene nucleotide fasta files > > Hey Francisco, > Thanks a lot. Basically i just wanted gene nucleotide fasta > files with GI numbers. > I think i will have to parse it from gbk files. > > -Shalabh > > > On Fri, May 17, 2013 at 11:59 AM, Francisco J. Ossand?n < > fossandonc at hotmail.com> wrote: > > > Hi, > > You can get the annotations from here: > > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ > > > > The ".ffn" are the genes nucleotide fasta files but it does not show > > the product name, on the other hand the ".faa" are the genes aminoacid > > fasta files and shows the product name, but if you want both product > > and nucleotide is much better to use the Genbank ".gbk" files that > > contains the complete data and you can parse it easily using BioPerl > > to obtain all genes, and then print the /protein_id, /product, and the > > nucleotide sequences in a new fasta file. Check these to see how to do > > it: > > http://www.bioperl.org/wiki/HOWTO:SeqIO > > http://www.bioperl.org/wiki/HOWTO:Feature-Annotation > > > > Cheers, > > > > Francisco J. Ossandon > > > > -----Mensaje original----- > > De: bioperl-l-bounces at lists.open-bio.org > > [mailto:bioperl-l-bounces at lists.open-bio.org] En nombre de shalabh > > sharma Enviado el: viernes, 17 de mayo de 2013 10:55 > > Para: bioperl-l > > Asunto: [Bioperl-l] Downloading annotated Gene nucleotide fasta files > > > > HI, > > First of all i am really sorry for sending this mail > > here, i am not sure if this is the right forum. I know lot of people > > work on similar stuff. > > I wrote to NCBI but nobody replied. > > > > Actually i am looking for all bacterial/microbial gene annotation > > nucleotide fasta files. > > Does anyone knows where to download these kind of files. > > I tried *ffn files but they are not annotated. > > Or is there any module in bioperl that i can use ? > > I would really appreciate your help. > > > > Thanks > > Shalabh > > > > -- > > Shalabh Sharma > > Scientific Computing Professional Associate (Bioinformatics > > Specialist) Department of Marine Sciences University of Georgia > > Athens, GA 30602-3636 _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences University of Georgia Athens, GA 30602-3636 > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From Russell.Smithies at agresearch.co.nz Sun May 19 15:26:35 2013 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Mon, 20 May 2013 07:26:35 +1200 Subject: [Bioperl-l] Downloading annotated Gene nucleotide fasta files In-Reply-To: References: Message-ID: <18DF7D20DFEC044098A1062202F5FFF37365BCAEC2@exchsth.agresearch.co.nz> Another option that I've used before is to download the gene2accession, gene2refseq, and gene_info files from here ftp://ftp.ncbi.nih.gov/gene/DATA/ then parse out the data you require. It might work for you? --Russell -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma Sent: Saturday, 18 May 2013 4:26 a.m. To: Francisco J. Ossand?n Cc: bioperl-l Subject: Re: [Bioperl-l] Downloading annotated Gene nucleotide fasta files Hey Francisco, Thanks a lot. Basically i just wanted gene nucleotide fasta files with GI numbers. I think i will have to parse it from gbk files. -Shalabh On Fri, May 17, 2013 at 11:59 AM, Francisco J. Ossand?n < fossandonc at hotmail.com> wrote: > Hi, > You can get the annotations from here: > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ > > The ".ffn" are the genes nucleotide fasta files but it does not show > the product name, on the other hand the ".faa" are the genes aminoacid > fasta files and shows the product name, but if you want both product > and nucleotide is much better to use the Genbank ".gbk" files that > contains the complete data and you can parse it easily using BioPerl > to obtain all genes, and then print the /protein_id, /product, and the > nucleotide sequences in a new fasta file. Check these to see how to do > it: > http://www.bioperl.org/wiki/HOWTO:SeqIO > http://www.bioperl.org/wiki/HOWTO:Feature-Annotation > > Cheers, > > Francisco J. Ossandon > > -----Mensaje original----- > De: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] En nombre de shalabh > sharma Enviado el: viernes, 17 de mayo de 2013 10:55 > Para: bioperl-l > Asunto: [Bioperl-l] Downloading annotated Gene nucleotide fasta files > > HI, > First of all i am really sorry for sending this mail > here, i am not sure if this is the right forum. I know lot of people > work on similar stuff. > I wrote to NCBI but nobody replied. > > Actually i am looking for all bacterial/microbial gene annotation > nucleotide fasta files. > Does anyone knows where to download these kind of files. > I tried *ffn files but they are not annotated. > Or is there any module in bioperl that i can use ? > I would really appreciate your help. > > Thanks > Shalabh > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics > Specialist) Department of Marine Sciences University of Georgia > Athens, GA 30602-3636 _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From miquel.ramia at uab.cat Tue May 21 11:08:18 2013 From: miquel.ramia at uab.cat (=?ISO-8859-1?Q?Miquel_R=E0mia?=) Date: Tue, 21 May 2013 17:08:18 +0200 Subject: [Bioperl-l] Problem compiling Bio::DB::Sam In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF74E212B0@CHIMBX5.ad.uillinois.edu> References: <5194B815.2010401@uab.cat> <"118F034CF4C3EF48A96F86CE585B94BF74E1F C2C"@CHIMBX5.ad.uillinois.edu> <118F034CF4C3EF48A96F86CE585B94BF74E212B0@CHIMBX5.ad.uillinois.edu> Message-ID: <519B8DE2.2070308@uab.cat> On 17/05/13 06:16, Fields, Christopher J wrote: > For the record, this is now fixed in the latest Bio::Samtools (via Lincoln). > > chris > > Compiled correctly! thank you -- Miquel R?mia Jes?s PhD. candidate (PIF) Evolutionary Bioinformatics Group (Genomics, Bioinformatics and Evolution Group) Lab MRB/014 - 93 586 89 58 MRB - Institut de Biologia i Biomedicina (IBB) Universitat Aut?noma de Barcelona (UAB) 08193, Cerdanyola del Vall?s Barcelona (Spain) From b.l.cohen_home at btinternet.com Mon May 20 14:49:50 2013 From: b.l.cohen_home at btinternet.com (Bernard Cohen) Date: Mon, 20 May 2013 19:49:50 +0100 (BST) Subject: [Bioperl-l] Phylip format error Message-ID: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Hello! I happen to have checked to see what the PERL webpage says about Phylip format for DNA alignment files and see that it is erroneous.? I am not a PERL user and do not want to be bothered to register or otherwise learn how to make an official comment, so forward this for someone to pick up. Phylip format allows up to 10 spaces for taxon names; the data must start in the 11th space. This can be checked on Jo Felsenstein's site. The PERL page accessed by searching for "Phylip format PERL" allows only 8 spaces for the name.? B. L. Cohen From senanu.pearson at gmail.com Wed May 22 16:15:24 2013 From: senanu.pearson at gmail.com (Senanu) Date: Wed, 22 May 2013 13:15:24 -0700 Subject: [Bioperl-l] Speed issues with making IUPAC consensus from alignment Message-ID: <1DE07C6C-0DEA-43EB-8F1A-BACB8534400F@gmail.com> Hi all, I am wondering if the consensus_iupac method of Bio::Align is known to be extremely slow, or if I'm doing something wrong. I have bacterial whole-genome alignments (~7 Mbases) that I made in progressiveMauve and wish to get an IUPAC consensus. (I know that progressiveMauve uses a non-standard XMFA format, but Bio::AlignIO seems to read them just fine.) The code below takes more than all night to make a consensus. It works fine on tiny test alignments. Is this a known problem? Is there another way to generate such a consensus? my $in = Bio::AlignIO->new(-file => $files[0], -format => 'XMFA'); while (my $aln = $in->next_aln()) { foreach my $seq ($aln->each_seq) { $seq->alphabet('dna'); } my $con = $aln->consensus_iupac(); } Thanks in advance. Ngwenyama From cjfields at illinois.edu Wed May 22 19:17:50 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 22 May 2013 23:17:50 +0000 Subject: [Bioperl-l] Speed issues with making IUPAC consensus from alignment In-Reply-To: <1DE07C6C-0DEA-43EB-8F1A-BACB8534400F@gmail.com> References: <1DE07C6C-0DEA-43EB-8F1A-BACB8534400F@gmail.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E2A01D@CHIMBX5.ad.uillinois.edu> On May 22, 2013, at 3:15 PM, Senanu wrote: > Hi all, > > I am wondering if the consensus_iupac method of Bio::Align is known to be extremely slow, or if I'm doing something wrong. Probably the former, but... > I have bacterial whole-genome alignments (~7 Mbases) that I made in progressiveMauve and wish to get an IUPAC consensus. (I know that progressiveMauve uses a non-standard XMFA format, but Bio::AlignIO seems to read them just fine.) The code below takes more than all night to make a consensus. It works fine on tiny test alignments. It shouldn't take that long, 7 Mb isn't that large. Or is that 7 Mb for one genome? > Is this a known problem? Is there another way to generate such a consensus? The code isn't really optimized for this, but again this isn't terribly large. Is the bottleneck reading the alignment in, or is it the consensus_iupac() step? Hard to say w/o seeing the alignment data itself. > my $in = Bio::AlignIO->new(-file => $files[0], > -format => 'XMFA'); > while (my $aln = $in->next_aln()) { > foreach my $seq ($aln->each_seq) { > $seq->alphabet('dna'); > } > my $con = $aln->consensus_iupac(); > } > > > Thanks in advance. > Ngwenyama > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l chris From alexeymorozov1991 at gmail.com Thu May 23 03:22:13 2013 From: alexeymorozov1991 at gmail.com (Alexey Morozov) Date: Thu, 23 May 2013 16:22:13 +0900 Subject: [Bioperl-l] Phylip format error In-Reply-To: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Message-ID: Which is also worsened by the fact that there is relaxed phylip format, which allows up to 250 chars for taxon name. They are separated from a sequence by single space, which creates problems if names were extended to 10 chars in strict Felsenstein's format by whitespaces. On the whole, phylip is as messily defined format as one can make from a plain textfile with information content of fasta. Bioperl documentation says nothing about whether Bio::SeqIO accepts relaxed phylip and how does it tell dialects from one another. Even if code support is OK, it may be worthwile to explain it somewhere at bioperl.org 2013/5/21 Bernard Cohen > Hello! > > I happen to have checked to see what the PERL webpage says about Phylip > format for DNA alignment files and see that it is erroneous. > > I am not a PERL user and do not want to be bothered to register or > otherwise learn how to make an official comment, so forward this for > someone to pick up. > > Phylip format allows up to 10 spaces for taxon names; the data must start > in the 11th space. This can be checked on Jo Felsenstein's site. > > The PERL page accessed by searching for "Phylip format PERL" allows only 8 > spaces for the name. > > B. L. Cohen > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- Alexey Morozov, LIN SB RAS, bioinformatics group. Irkutsk, Russia. From p.j.a.cock at googlemail.com Thu May 23 04:30:21 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 23 May 2013 09:30:21 +0100 Subject: [Bioperl-l] Phylip format error In-Reply-To: References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Message-ID: On Thu, May 23, 2013 at 8:22 AM, Alexey Morozov wrote: > Which is also worsened by the fact that there is relaxed phylip format, > which allows up to 250 chars for taxon name. They are separated from a > sequence by single space, which creates problems if names were extended to > 10 chars in strict Felsenstein's format by whitespaces. On the whole, > phylip is as messily defined format as one can make from a plain textfile > with information content of fasta. > Bioperl documentation says nothing about whether Bio::SeqIO accepts relaxed > phylip and how does it tell dialects from one another. Even if code support > is OK, it may be worthwile to explain it somewhere at bioperl.org Biopython's AlignIO defines both a (strict) "phylip" and "relaxed-phylip" as two separate formats (or variants, like the "fastq" variants). Doing the same in BioPerl would seem sensible since auto-detection is not easy. http://biopython.org/wiki/AlignIO#File_Formats Peter P.S. Where does that 250 characters for the taxon name limit come from? The trouble with relaxed phylip is that some tools are more relaxed than others ;) From awitney at sgul.ac.uk Thu May 23 04:43:15 2013 From: awitney at sgul.ac.uk (Adam Witney) Date: Thu, 23 May 2013 09:43:15 +0100 Subject: [Bioperl-l] Phylip format error In-Reply-To: References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Message-ID: <519DD6A3.8090304@sgul.ac.uk> Not sure if there is an actual question in these messages, but BioPerl can be used to generate valid Phylip format and run, like this: ## Build Align object my $aln = Bio::SimpleAlign->new(-seqs=>$seqs); ## swap the taxa names with 8 characters long unique IDs my ($aln_safe, $ref_name) = $aln->set_displayname_safe(8); ## Write out phylip format infile Bio::AlignIO->new(-file=>'>infile.out', -format=>'phylip', -interleaved => 0)->write_aln($aln); ## run PHYLIP's pars program my @params = (idlength=>10); #, jumble=>"17,10"); my $tree_factory = Bio::Tools::Run::Phylo::Phylip::Pars->new(@params); $tree_factory->quiet(1); # Suppress pars messages to terminal my $tree = $tree_factory->create_tree($aln_safe); ## fix the node labels back my @nodes = sort { defined $a->id && defined $b->id && $a->id cmp $b->id } $tree->get_nodes(); foreach my $nd (@nodes) { if ( $nd->is_Leaf ) { $nd->id($ref_name->{$nd->id_output}) } } HTH Adam On 23/05/2013 08:22, Alexey Morozov wrote: > Which is also worsened by the fact that there is relaxed phylip format, > which allows up to 250 chars for taxon name. They are separated from a > sequence by single space, which creates problems if names were extended to > 10 chars in strict Felsenstein's format by whitespaces. On the whole, > phylip is as messily defined format as one can make from a plain textfile > with information content of fasta. > Bioperl documentation says nothing about whether Bio::SeqIO accepts relaxed > phylip and how does it tell dialects from one another. Even if code support > is OK, it may be worthwile to explain it somewhere at bioperl.org > > > 2013/5/21 Bernard Cohen > >> Hello! >> >> I happen to have checked to see what the PERL webpage says about Phylip >> format for DNA alignment files and see that it is erroneous. >> >> I am not a PERL user and do not want to be bothered to register or >> otherwise learn how to make an official comment, so forward this for >> someone to pick up. >> >> Phylip format allows up to 10 spaces for taxon names; the data must start >> in the 11th space. This can be checked on Jo Felsenstein's site. >> >> The PERL page accessed by searching for "Phylip format PERL" allows only 8 >> spaces for the name. >> >> B. L. Cohen >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > > From cjfields at illinois.edu Thu May 23 09:48:31 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 23 May 2013 13:48:31 +0000 Subject: [Bioperl-l] Phylip format error In-Reply-To: References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E323A4@CITESMBX5.ad.uillinois.edu> Alexey, Just want to point out that 'relaxed phylip' format was introduced long after this parser was created; in fact (as Adam points out) there was an alternative workaround to deal with the lossy names. The content of that page is on a wiki, which anyone is free to edit (just need an OpenID to set up an account). chris On May 23, 2013, at 2:22 AM, Alexey Morozov wrote: > Which is also worsened by the fact that there is relaxed phylip format, > which allows up to 250 chars for taxon name. They are separated from a > sequence by single space, which creates problems if names were extended to > 10 chars in strict Felsenstein's format by whitespaces. On the whole, > phylip is as messily defined format as one can make from a plain textfile > with information content of fasta. > Bioperl documentation says nothing about whether Bio::SeqIO accepts relaxed > phylip and how does it tell dialects from one another. Even if code support > is OK, it may be worthwile to explain it somewhere at bioperl.org > > > 2013/5/21 Bernard Cohen > >> Hello! >> >> I happen to have checked to see what the PERL webpage says about Phylip >> format for DNA alignment files and see that it is erroneous. >> >> I am not a PERL user and do not want to be bothered to register or >> otherwise learn how to make an official comment, so forward this for >> someone to pick up. >> >> Phylip format allows up to 10 spaces for taxon names; the data must start >> in the 11th space. This can be checked on Jo Felsenstein's site. >> >> The PERL page accessed by searching for "Phylip format PERL" allows only 8 >> spaces for the name. >> >> B. L. Cohen >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > > > -- > Alexey Morozov, > LIN SB RAS, bioinformatics group. > Irkutsk, Russia. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Thu May 23 10:05:32 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 23 May 2013 14:05:32 +0000 Subject: [Bioperl-l] Phylip format error In-Reply-To: References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E32551@CITESMBX5.ad.uillinois.edu> On May 23, 2013, at 3:30 AM, Peter Cock wrote: > On Thu, May 23, 2013 at 8:22 AM, Alexey Morozov > wrote: >> Which is also worsened by the fact that there is relaxed phylip format, >> which allows up to 250 chars for taxon name. They are separated from a >> sequence by single space, which creates problems if names were extended to >> 10 chars in strict Felsenstein's format by whitespaces. On the whole, >> phylip is as messily defined format as one can make from a plain textfile >> with information content of fasta. >> Bioperl documentation says nothing about whether Bio::SeqIO accepts relaxed >> phylip and how does it tell dialects from one another. Even if code support >> is OK, it may be worthwile to explain it somewhere at bioperl.org > > Biopython's AlignIO defines both a (strict) "phylip" and "relaxed-phylip" > as two separate formats (or variants, like the "fastq" variants). Doing > the same in BioPerl would seem sensible since auto-detection is not > easy. > > http://biopython.org/wiki/AlignIO#File_Formats > > Peter > > P.S. Where does that 250 characters for the taxon name limit come from? > The trouble with relaxed phylip is that some tools are more relaxed than > others ;) As Adam pointed out, prior to the introduction of 'relaxed phylip' we had an alternative solution that didn't require a modified format but still allowed one to use PHYLIP and other tools requesting the format. I think 'relaxed phylip' was introduced by CIPRES a few years back. Frankly, this is the first time I have seen this mentioned on the list; yay, yet another format variation :) The variant format parsing (as implemented for SeqIO::fastq, as you know) deals with variant names like 'fastq-sanger', where the main format name is first, the variant of the format second. The order in this case is reversed (relaxed-phylip), which I'm pretty sure will not work. Not impossible to allow, but we would probably allow support like this initially: my $in = Bio::AlignIO->new(-format => 'phylip', -variant => 'relaxed', ?); chris From cjfields at illinois.edu Thu May 23 09:56:32 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 23 May 2013 13:56:32 +0000 Subject: [Bioperl-l] Speed issues with making IUPAC consensus from alignment In-Reply-To: <92CD2F9D-7C8E-4941-9354-4BD80D8C2CB6@gmail.com> References: <1DE07C6C-0DEA-43EB-8F1A-BACB8534400F@gmail.com> <118F034CF4C3EF48A96F86CE585B94BF74E2A01D@CHIMBX5.ad.uillinois.edu> <92CD2F9D-7C8E-4941-9354-4BD80D8C2CB6@gmail.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E32489@CITESMBX5.ad.uillinois.edu> (keep the list cc'd) On May 22, 2013, at 6:31 PM, Senanu wrote: > On May 22, 2013, at 4:17 PM, Fields, Christopher J wrote: > >> Hi all, >>> >>> I am wondering if the consensus_iupac method of Bio::Align is known to be extremely slow, or if I'm doing something wrong. >> >> Probably the former, but... >> >>> I have bacterial whole-genome alignments (~7 Mbases) that I made in progressiveMauve and wish to get an IUPAC consensus. (I know that progressiveMauve uses a non-standard XMFA format, but Bio::AlignIO seems to read them just fine.) The code below takes more than all night to make a consensus. It works fine on tiny test alignments. >> >> It shouldn't take that long, 7 Mb isn't that large. Or is that 7 Mb for one genome? > > It is 7Mb per genome, but there are only 2 genomes in the alignment, and the sequences are very similar to one another. > >> >>> Is this a known problem? Is there another way to generate such a consensus? >> >> The code isn't really optimized for this, but again this isn't terribly large. Is the bottleneck reading the alignment in, or is it the consensus_iupac() step? Hard to say w/o seeing the alignment data itself. > > The bottleneck is definitely with the consensus_iupac step. Reading the alignment in takes a few seconds. That's interesting, but again not surprising. One would have to look at the code, but I wouldn't be surprised if the method is terribly inefficient. chris From p.j.a.cock at googlemail.com Thu May 23 10:53:09 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 23 May 2013 15:53:09 +0100 Subject: [Bioperl-l] Phylip format error In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF74E32551@CITESMBX5.ad.uillinois.edu> References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> <118F034CF4C3EF48A96F86CE585B94BF74E32551@CITESMBX5.ad.uillinois.edu> Message-ID: On Thu, May 23, 2013 at 3:05 PM, Fields, Christopher J wrote: > > I think 'relaxed phylip' was introduced by CIPRES a few years back. > Frankly, this is the first time I have seen this mentioned on the list; yay, > yet another format variation :) The relaxed phylip 'format' goes back further than that, e.g. http://lists.open-bio.org/pipermail/biopython-dev/2008-June/003899.html RAxML and PHYML support relaxed phylip - but with their own ID limits. Peter From Ben.Ward at sainsbury-laboratory.ac.uk Fri May 24 15:14:17 2013 From: Ben.Ward at sainsbury-laboratory.ac.uk (Ben Ward (TSL)) Date: Fri, 24 May 2013 19:14:17 +0000 Subject: [Bioperl-l] Building Many trees with Bioperl Message-ID: Hi, I'm new to Bioperl and plan to make a script to automate making trees with many alignment files (themselves generated by automating the process of multiple alignment for many datasets by using clustalw in a bioperl script). But I need to get it right for one pice of test data before I can do it for all: What I have produced so far is the below. It's supposed to load in the alignment file as as SimpleAlign. Then use that alignment in phyml. I looked at the documentation and tried to follow examples. However it gives me many errors like: --------------------- WARNING --------------------- MSG: Replacing one sequence [FXCNDTJ02P/1-366] And then gives me: Phyml command = /Users/wardb/clustalw2/phyml /var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip 0 i 1 0 HKY e e 1 e BIONJ y y ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Phyml call (/Users/wardb/clustalw2/phyml /var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip 0 i 1 0 HKY e e 1 e BIONJ y y) did not give an output [/var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip_phyml_stat.txt]: 11 STACK: Error::throw STACK: Bio::Root::Root::throw /Library/Perl/5.12/Bio/Root/Root.pm:472 STACK: Bio::Tools::Run::Phylo::Phyml::_run /Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:851 STACK: /Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:338 ----------------------------------------------------------- Can someone let me know if I'm going about this correctly and what I need to do to get it to work. I've also tried to run phyml by giving the filename in the run() method like: my $inputfilename = 'outputalignmentfile'; my $tree = $factory->run("$inputfilename"); But I get the same EXCEPTION: Bio::Root::Exception message. Thanks, Ben W. SCRIPT --- #!/usr/bin/perl use warnings; use strict; BEGIN { $ENV{PHYMLDIR} = '/Users/wardb/clustalw2' } use Bio::TreeIO; use Bio::AlignIO; use Bio::Tools::Run::Phylo::Phyml; my $alnin = Bio::AlignIO->new(-file => " 'phylip'); my $aln = $alnin->next_aln(); # Make a Phyml factory. my $factory = Bio::Tools::Run::Phylo::Phyml->new(-verbose => 2, -data_type => 'dna'); # Pass the factory an alignment and run: # my $inputfilename = 'outputalignmentfile'; my $tree = $factory->run($aln); # $tree is a Bio::Tree::Tree object. # Setup tree output stream... my $treeio = Bio::TreeIO->new(-format => 'newick', -file => 'tree.newick'); $treeio->write_tree($tree); exit 0; From bosborne11 at verizon.net Fri May 24 17:25:38 2013 From: bosborne11 at verizon.net (Brian Osborne) Date: Fri, 24 May 2013 17:25:38 -0400 Subject: [Bioperl-l] Building Many trees with Bioperl In-Reply-To: References: Message-ID: <93D0963E-8399-468C-A32B-875FAA5054D1@verizon.net> Ben, What happens when you take the Phyml command itself and run it from the command line? Also, a minor point: the message "MSG: Replacing one sequence [FXCNDTJ02P/1-366]" is not an error, it is a warning. An error accompanies an exit, warnings are just informative. Brian O. On May 24, 2013, at 3:14 PM, Ben Ward (TSL) wrote: > Hi, > > I'm new to Bioperl and plan to make a script to automate making trees with many alignment files (themselves generated by automating the process of multiple alignment for many datasets by using clustalw in a bioperl script). > But I need to get it right for one pice of test data before I can do it for all: > > What I have produced so far is the below. It's supposed to load in the alignment file as as SimpleAlign. Then use that alignment in phyml. I looked at the documentation and tried to follow examples. > > However it gives me many errors like: > --------------------- WARNING --------------------- > MSG: Replacing one sequence [FXCNDTJ02P/1-366] > > And then gives me: > Phyml command = /Users/wardb/clustalw2/phyml /var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip 0 i 1 0 HKY e e 1 e BIONJ y y > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: Phyml call (/Users/wardb/clustalw2/phyml /var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip 0 i 1 0 HKY e e 1 e BIONJ y y) did not give an output [/var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip_phyml_stat.txt]: 11 > STACK: Error::throw > STACK: Bio::Root::Root::throw /Library/Perl/5.12/Bio/Root/Root.pm:472 > STACK: Bio::Tools::Run::Phylo::Phyml::_run /Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:851 > STACK: /Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:338 > ----------------------------------------------------------- > > Can someone let me know if I'm going about this correctly and what I need to do to get it to work. I've also tried to run phyml by giving the filename in the run() method like: > my $inputfilename = 'outputalignmentfile'; > my $tree = $factory->run("$inputfilename"); > > But I get the same EXCEPTION: Bio::Root::Exception message. > > Thanks, > Ben W. > > SCRIPT --- > > #!/usr/bin/perl > use warnings; > use strict; > BEGIN { $ENV{PHYMLDIR} = '/Users/wardb/clustalw2' } > use Bio::TreeIO; > use Bio::AlignIO; > use Bio::Tools::Run::Phylo::Phyml; > > my $alnin = Bio::AlignIO->new(-file => " -format => 'phylip'); > > my $aln = $alnin->next_aln(); > > > # Make a Phyml factory. > my $factory = Bio::Tools::Run::Phylo::Phyml->new(-verbose => 2, > -data_type => 'dna'); > > # Pass the factory an alignment and run: > # my $inputfilename = 'outputalignmentfile'; > > my $tree = $factory->run($aln); # $tree is a Bio::Tree::Tree object. > > > # Setup tree output stream... > my $treeio = Bio::TreeIO->new(-format => 'newick', > -file => 'tree.newick'); > > $treeio->write_tree($tree); > > exit 0; > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From Ben.Ward at sainsbury-laboratory.ac.uk Fri May 24 17:46:40 2013 From: Ben.Ward at sainsbury-laboratory.ac.uk (Ben Ward (TSL)) Date: Fri, 24 May 2013 21:46:40 +0000 Subject: [Bioperl-l] Building Many trees with Bioperl In-Reply-To: <93D0963E-8399-468C-A32B-875FAA5054D1@verizon.net> Message-ID: Hi, If I just run phyml on the command line it seems to run ok - it accept's my file and appears to undergo the tree building process - I haven't actually see it complete yet, but ML and Bayes always does take a while - and I have many reads that need to be aligned. But PhyML gets to the point it asks - are you sure you want to proceed - I say yes, then it keeps quiet and is currently working along to itself: . 766 patterns found (out of a total of 795 sites). . 58 sites without polymorphism (7.30%). . Computing pairwise distances... . Building BioNJ tree... . WARNING: this analysis requires at least 556 MB of memory space. . Do you really want to proceed? [Y/n] Y It appears to be working =/ Best, Ben. On 24/05/2013 22:25, "Brian Osborne" wrote: >Ben, > >What happens when you take the Phyml command itself and run it from the >command line? > >Also, a minor point: the message "MSG: Replacing one sequence >[FXCNDTJ02P/1-366]" is not an error, it is a warning. An error >accompanies an exit, warnings are just informative. > >Brian O. > > >On May 24, 2013, at 3:14 PM, Ben Ward (TSL) > wrote: > >> Hi, >> >> I'm new to Bioperl and plan to make a script to automate making trees >>with many alignment files (themselves generated by automating the >>process of multiple alignment for many datasets by using clustalw in a >>bioperl script). >> But I need to get it right for one pice of test data before I can do it >>for all: >> >> What I have produced so far is the below. It's supposed to load in the >>alignment file as as SimpleAlign. Then use that alignment in phyml. I >>looked at the documentation and tried to follow examples. >> >> However it gives me many errors like: >> --------------------- WARNING --------------------- >> MSG: Replacing one sequence [FXCNDTJ02P/1-366] >> >> And then gives me: >> Phyml command = /Users/wardb/clustalw2/phyml >>/var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phyli >>p 0 i 1 0 HKY e e 1 e BIONJ y y >> >> ------------- EXCEPTION: Bio::Root::Exception ------------- >> MSG: Phyml call (/Users/wardb/clustalw2/phyml >>/var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phyli >>p 0 i 1 0 HKY e e 1 e BIONJ y y) did not give an output >>[/var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phyl >>ip_phyml_stat.txt]: 11 >> STACK: Error::throw >> STACK: Bio::Root::Root::throw /Library/Perl/5.12/Bio/Root/Root.pm:472 >> STACK: Bio::Tools::Run::Phylo::Phyml::_run >>/Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:851 >> STACK: /Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:338 >> ----------------------------------------------------------- >> >> Can someone let me know if I'm going about this correctly and what I >>need to do to get it to work. I've also tried to run phyml by giving the >>filename in the run() method like: >> my $inputfilename = 'outputalignmentfile'; >> my $tree = $factory->run("$inputfilename"); >> >> But I get the same EXCEPTION: Bio::Root::Exception message. >> >> Thanks, >> Ben W. >> >> SCRIPT --- >> >> #!/usr/bin/perl >> use warnings; >> use strict; >> BEGIN { $ENV{PHYMLDIR} = '/Users/wardb/clustalw2' } >> use Bio::TreeIO; >> use Bio::AlignIO; >> use Bio::Tools::Run::Phylo::Phyml; >> >> my $alnin = Bio::AlignIO->new(-file => "> -format => 'phylip'); >> >> my $aln = $alnin->next_aln(); >> >> >> # Make a Phyml factory. >> my $factory = Bio::Tools::Run::Phylo::Phyml->new(-verbose => 2, >> -data_type => 'dna'); >> >> # Pass the factory an alignment and run: >> # my $inputfilename = 'outputalignmentfile'; >> >> my $tree = $factory->run($aln); # $tree is a Bio::Tree::Tree object. >> >> >> # Setup tree output stream... >> my $treeio = Bio::TreeIO->new(-format => 'newick', >> -file => 'tree.newick'); >> >> $treeio->write_tree($tree); >> >> exit 0; >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > From himaghna.bhattacharjee at gmail.com Tue May 28 11:59:02 2013 From: himaghna.bhattacharjee at gmail.com (Himaghna Bhattacharjee) Date: Tue, 28 May 2013 21:29:02 +0530 Subject: [Bioperl-l] error in the link to install Kobe repository for windows Message-ID: Hey, the link to install Kobe's repository for per 5.10 <" http://cpan.uwinnipeg.ca/PPMPackages/10xx/ --"> seems to be broken as it shows Error 503 Service Temporarily Unavailable. Could you please suggest an alternative ? Thanks . Himaghna Bhattacharjee 3rd year B.E.(Hons.)Chemical Engineering Birla Institute of Technology and Science,Pilani Rajasthan 333 031 From wgallin at ualberta.ca Tue May 28 13:49:02 2013 From: wgallin at ualberta.ca (Warren Gallin) Date: Tue, 28 May 2013 11:49:02 -0600 Subject: [Bioperl-l] ReplacedBy value in esummary Message-ID: <81794641-2693-4476-B2FA-6B742190D141@ualberta.ca> Hi, I just encountered a glitch when I was trying to update some entries in a database by finding updated GENBANK protein entries. The original record was gi 118091304 which has been replaced by gi 363734282 I retrieved an ESummary of the record for gi 118091304 as a Bio::Tools::EUtilities::Summary::DocSum object (called $ds). When I then tried to retrieve the gi number for the replacement by using: my $replaced = $ds->get_contents_by_name('ReplacedBy'); the returned value was 1, and when I dumped the ESummary record the relevant pair is ReplacedBy :XP_421022.3. The full Esummary dump is: UID :118091304 Caption :XP_421022 Title :PREDICTED: similar to Potassium voltage-gated channel, subfamily Q, member :1 [Gallus gallus] Extra :gi|118091304|ref|XP_421022.2|[118091304] Gi :118091304 CreateDate :2004/07/28 UpdateDate :2006/11/16 Flags :512 TaxId :9031 Length :643 Status :replaced ReplacedBy :XP_421022.3 Comment : This record was replaced or removed. So two questions: 1) Is the ReplacedBy value always supposed to be the new accession number version rather than the new gi number? 2) Why would I be getting a returned value of 1 instead of the accession number that is in the summary record? Any advice appreciated. Warren Gallin From cjfields at illinois.edu Tue May 28 14:31:30 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 28 May 2013 18:31:30 +0000 Subject: [Bioperl-l] ReplacedBy value in esummary In-Reply-To: <81794641-2693-4476-B2FA-6B742190D141@ualberta.ca> References: <81794641-2693-4476-B2FA-6B742190D141@ualberta.ca> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E44377@CHIMBX5.ad.uillinois.edu> On May 28, 2013, at 12:49 PM, Warren Gallin wrote: > Hi, > > I just encountered a glitch when I was trying to update some entries in a database by finding updated GENBANK protein entries. > > The original record was gi 118091304 which has been replaced by gi 363734282 > > I retrieved an ESummary of the record for gi 118091304 as a Bio::Tools::EUtilities::Summary::DocSum object (called $ds). > > When I then tried to retrieve the gi number for the replacement by using: > > my $replaced = $ds->get_contents_by_name('ReplacedBy'); > > the returned value was 1, and when I dumped the ESummary record the relevant pair is ReplacedBy :XP_421022.3. > > The full Esummary dump is: > > UID :118091304 > Caption :XP_421022 > Title :PREDICTED: similar to Potassium voltage-gated channel, subfamily Q, member > :1 [Gallus gallus] > Extra :gi|118091304|ref|XP_421022.2|[118091304] > Gi :118091304 > CreateDate :2004/07/28 > UpdateDate :2006/11/16 > Flags :512 > TaxId :9031 > Length :643 > Status :replaced > ReplacedBy :XP_421022.3 > Comment : This record was replaced or removed. > > > So two questions: > > 1) Is the ReplacedBy value always supposed to be the new accession number version rather than the new gi number? No idea, the best people to answer that would be NCBI (the idea of these modules was to simplify getting at that data instead of munging the XML, but whatever they report is mainly from NCBI, not bioperl). > 2) Why would I be getting a returned value of 1 instead of the accession number that is in the summary record? The text dump above indicates the values do exist. However, you are calling a method that returns a list (note the plural in the name) in scalar context, so you get the number of values. If you always expect a single value, use: my ($replaced) = $ds->get_contents_by_name('ReplacedBy'); which forces array context. That should fix it. chris > Any advice appreciated. > > Warren Gallin > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From florent.angly at gmail.com Wed May 1 22:16:02 2013 From: florent.angly at gmail.com (Florent Angly) Date: Thu, 02 May 2013 12:16:02 +1000 Subject: [Bioperl-l] Downloading sequences in batch from Trace Archive In-Reply-To: References: Message-ID: <5181CC62.9000609@gmail.com> Maybe using EUtilities? http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook http://www.bioperl.org/wiki/HOWTO:EUtilities_Web_Service Florent On 30/04/13 06:25, shalabh sharma wrote: > Hi All, > Is there any module in Bioperl that can download sequences from > NCBI's trace archive? > > Thanks > Shalabh > From jason.stajich at gmail.com Thu May 2 01:42:55 2013 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 1 May 2013 22:42:55 -0700 Subject: [Bioperl-l] Fwd: doubt References: Message-ID: Begin forwarded message: > From: ARYA DAS > Subject: doubt > Date: May 1, 2013 10:42:21 PM PDT > To: jason at bioperl.org > > sir, > > Am using windows7 n was trying to install bio perl in it..i have > already installed active perl.5.16.3.1603 . n was followeing the > installation procedure mentioned .when i tried GUI installation .. i cant > find bioperl package when i try to search them for installation. > while using command line.. > > ppm> install PPM-Repositories > > shows error like cant find package that provides PPM repositories, > > and when i try manually ,on reaching the > perl Build test > > it says build is recognised as an internal or external file. > > please help if time permits > > regards, > arya Jason Stajich jason.stajich at gmail.com jason at bioperl.org From voldrani at gmail.com Sun May 5 00:03:38 2013 From: voldrani at gmail.com (Chris Maloney) Date: Sun, 5 May 2013 00:03:38 -0400 Subject: [Bioperl-l] Wiki work, Template:Doclink Message-ID: The module pages on the wiki could look a little better, like this one for example: http://www.bioperl.org/wiki/Module:Bio::Tools::Run::RemoteBlast. There used to be a bunch of extra whitespace at the top of the page, which was caused by extra line breaks in Template:Doclink, which I just removed. But, I think there are other improvements that could be made. I would like to turn this into an infobox -- which are the helpful informative tables of info on Wikipedia that appear on many articles on the upper right. That would allow us to add more links -- like to metacpan, for example. It is not completely trivial to import infoboxes into a wiki though, I just discovered. I just went through the exercise on my home wiki, and it involves importing a lot of templates from Wikipedia, and fixing up the common.css. You can see the full list of imported templates here: http://chrismaloney.org/wiki/index.php?title=Special:RecentChanges&limit=100. I don't *think* this should cause any problems, but I'm not 100% sure. On the other hand, if it does, it should be easy to roll back -- it's a wiki, after all. Does anybody have a problem if I do this? I'll wait a day for responses, and tackle this tomorrow, if no one objects. -- Chris M. From armendarez77 at hotmail.com Tue May 7 20:32:22 2013 From: armendarez77 at hotmail.com (Veronica A.) Date: Tue, 7 May 2013 17:32:22 -0700 Subject: [Bioperl-l] Bio::SeqIO doesn't write all gbk sequences from Bio::DB::GenBank Message-ID: Hello, I'm currently running Bio::Perl 1.6.1 on Ubuntu 12.04.2 LTS and have noticed a problem with Bio::SeqIO when writing genbank files using the write_seq() function; some of the files do not include an 'ORIGIN' tag or the sequence. I am using GI#s (50 at a time every 2 minutes) to retrieve genbank files via Bio::DB::GenBank. ----------------------------------------START CODE---------------------------------- my $gb = Bio::DB::GenBank->new(-verbose=>-1);my $seqout = Bio::SeqIO->new(-file=>">$fileName", '-format'=>'Genbank', -alphabet=>'dna', -flush=>0, -verbose=>-1);while(@ids){ my @batchArray = splice(@ids, 0, 50); my $batchArrayRef = \@batchArray; my $streamObj; my $pid = fork(); if($pid == 0){ eval{ $streamObj = $gb->get_Stream_by_id($batchArrayRef); }; if($@){ print "Error: ".$@."\n"; } else{ while(my $seqObj = $streamObj->next_seq()){ unless($seqObj->accession_number() =~ /N[A-Z]\_/){ #print "ID: ".$seqObj->id()."\n"; #print "Seq:\n".$seqObj->seq()."\n"; $seqout->write_seq($seqObj); } } } exit 0; }}waitpid($pid,0);sleep(120); ----------------------------------------END CODE---------------------------------- Most of the Genbank files written to the output file have sequences, but there is a small portion that do not, even though they should. For example, JX287367, in NCBI includes an 'ORIGIN' tag and sequence and when I use the print function before writing to file, the sequence is printed to STDOUT, but the 'ORIGIN' tag and sequence are not written to the output gbk file. The following is found in the final output file: ----------------------------------START GBK----------------------------------------- LOCUS JX287367 588 bp DNA linear BCT 19-DEC-2012DEFINITION Chlamydia trachomatis strain UW-5/CX pyruvoyl-dependent arginine decarboxylase (aaxB) gene, complete cds.ACCESSION JX287367VERSION JX287367.1 GI:404351720KEYWORDS .SOURCE Chlamydia trachomatis ORGANISM Chlamydia trachomatis Bacteria; Chlamydiae; Chlamydiales; Chlamydiaceae; Chlamydia/Chlamydophila group; Chlamydia.REFERENCE 1 (bases 1 to 588) AUTHORS Bliven,K.A., Fisher,D.J. and Maurelli,A.T. TITLE Characterization of the activity and expression of arginine decarboxylase in human and animal Chlamydia pathogens JOURNAL FEMS Microbiol. Lett. 337 (2), 140-146 (2012) PUBMED 23043454REFERENCE 2 (bases 1 to 588) AUTHORS Bliven,K.A., Fisher,D.J. and Maurelli,A.T. TITLE Direct Submission JOURNAL Submitted (06-JUL-2012) Department of Microbiology and Immunology, F. Edward Hebert School of Medicine, Uniformed Services University of the Health Sciences, 4301 Jones Bridge Road, Bethesda, MD 20814, USAFEATURES Location/Qualifiers source 1..588 /mol_type="genomic DNA" /db_xref="taxon:813" /strain="UW-5/CX" /organism="Chlamydia trachomatis" /serovar="E" gene 1..588 /gene="aaxB" CDS 1..588 /protein_id="AFR60849.1" /gene="aaxB" /transl_table=11 /note="AaxB" /db_xref="GI:404351721" /codon_start=1 /product="pyruvoyl-dependent arginine decarboxylase" /translation="MPYGTRYPTLAFHTGGVGESDDGMPPQPFETFCYDSALLQAKIE NFNIVPYTSVLPKELFGNILPVDQCTKFFKHGAVLEVIMAGRGATVTDGTQAIATGVG ICWGKDKNGELIGGWAAEYVEFFPTWIDDEIAESHAKMWLKKSLQHELDLRSVSKHSE FQYFHNYINIRKKFGFCLTALGFLNFENVAPAVIQ" // ----------------------------------END GBK----------------------------------------- Can anyone tell what I am missing or why this is happening? I don't know if this has happened in earlier BioPerl versions as up until now, I usually downloaded sequences straight from NCBI, but that became too time consuming....but this seems to be as well :S Thank you in advance for any help, Veronica From cjfields at illinois.edu Tue May 7 22:17:43 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 8 May 2013 02:17:43 +0000 Subject: [Bioperl-l] Bio::SeqIO doesn't write all gbk sequences from Bio::DB::GenBank In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E166A0@CHIMBX5.ad.uillinois.edu> Veronica, Your mail may have garbled the script and example file. Can you paste these in a gist? https://gist.github.com/ chris On May 7, 2013, at 7:32 PM, Veronica A. wrote: > Hello, > I'm currently running Bio::Perl 1.6.1 on Ubuntu 12.04.2 LTS and have noticed a problem with Bio::SeqIO when writing genbank files using the write_seq() function; some of the files do not include an 'ORIGIN' tag or the sequence. > > I am using GI#s (50 at a time every 2 minutes) to retrieve genbank files via Bio::DB::GenBank. > > ----------------------------------------START CODE---------------------------------- > my $gb = Bio::DB::GenBank->new(-verbose=>-1);my $seqout = Bio::SeqIO->new(-file=>">$fileName", '-format'=>'Genbank', -alphabet=>'dna', -flush=>0, -verbose=>-1);while(@ids){ my @batchArray = splice(@ids, 0, 50); my $batchArrayRef = \@batchArray; > my $streamObj; my $pid = fork(); if($pid == 0){ eval{ $streamObj = $gb->get_Stream_by_id($batchArrayRef); }; if($@){ print "Error: ".$@."\n"; } else{ while(my $seqObj = $streamObj->next_seq()){ unless($seqObj->accession_number() =~ /N[A-Z]\_/){ #print "ID: ".$seqObj->id()."\n"; #print "Seq:\n".$seqObj->seq()."\n"; $seqout->write_seq($seqObj); } } } exit 0; }}waitpid($pid,0);sleep(120); > ----------------------------------------END CODE---------------------------------- > Most of the Genbank files written to the output file have sequences, but there is a small portion that do not, even though they should. For example, JX287367, in NCBI includes an 'ORIGIN' tag and sequence and when I use the print function before writing to file, the sequence is printed to STDOUT, but the 'ORIGIN' tag and sequence are not written to the output gbk file. The following is found in the final output file: > ----------------------------------START GBK----------------------------------------- > LOCUS JX287367 588 bp DNA linear BCT 19-DEC-2012DEFINITION Chlamydia trachomatis strain UW-5/CX pyruvoyl-dependent arginine decarboxylase (aaxB) gene, complete cds.ACCESSION JX287367VERSION JX287367.1 GI:404351720KEYWORDS .SOURCE Chlamydia trachomatis ORGANISM Chlamydia trachomatis Bacteria; Chlamydiae; Chlamydiales; Chlamydiaceae; Chlamydia/Chlamydophila group; Chlamydia.REFERENCE 1 (bases 1 to 588) AUTHORS Bliven,K.A., Fisher,D.J. and Maurelli,A.T. TITLE Characterization of the activity and expression of arginine decarboxylase in human and animal Chlamydia pathogens JOURNAL FEMS Microbiol. Lett. 337 (2), 140-146 (2012) PUBMED 23043454REFERENCE 2 (bases 1 to 588) AUTHORS Bliven,K.A., Fisher,D.J. and Maurelli,A.T. TITLE Direct Submission JOURNAL Submitted (06-JUL-2012) Department of Microbiology and Immunology, F. Edward Hebert School of Med! > icine, Uniformed Services University of the Health Sciences, 4301 Jones Bridge Road, Bethesda, MD 20814, USAFEATURES Location/Qualifiers source 1..588 /mol_type="genomic DNA" /db_xref="taxon:813" /strain="UW-5/CX" /organism="Chlamydia trachomatis" /serovar="E" gene 1..588 /gene="aaxB" CDS 1..588 /protein_id="AFR60849.1" /gene="aaxB" /transl_table=11 /note="AaxB" /db_xref="GI:404351721" /codon_start=1 /product="pyruvoyl-dependent arginine decarboxylase" /translation="MPYGTRYPTLAFHTGGVGESDDGMPPQPFETFCYDSALLQAKIE NFNIVPYTSVLPKELFGNILPVDQCTKFFKHGAVLEVIMAGRGATVTDGTQAIATGVG ICWGKDKNGELIGGW! > AAEYVEFFPTWIDDEIAESHAKMWLKKSLQHELDLRSVSKHSE FQYFHN > YINIRKKFGFCLTALGFLNFENVAPAVIQ" > // > ----------------------------------END GBK----------------------------------------- > Can anyone tell what I am missing or why this is happening? I don't know if this has happened in earlier BioPerl versions as up until now, I usually downloaded sequences straight from NCBI, but that became too time consuming....but this seems to be as well :S > Thank you in advance for any help, > Veronica > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From witch.of.agnessi at gmail.com Wed May 8 15:24:53 2013 From: witch.of.agnessi at gmail.com (WoA) Date: Wed, 8 May 2013 12:24:53 -0700 (PDT) Subject: [Bioperl-l] Extracting matching subsequence from pairwise alignment Message-ID: <1368041092972-16935.post@n3.nabble.com> Hello All, I've a pairwise global alignemnet of two DNA sequences generated by the program NEEDLE of EMBOSS package. I wish to extract the sub-sequence that matches/aligns to a given region of the other sequence. In this alignment (Pastebin Link) the given region (actually the CDS) falls between base number 24:485 in the original sequence with ID 'XM_001005073.' I wish to extract the sub-sequence in the sequence ID 'Homolog' that aligns with that 24:485 region of the other sequence. I'm using Bioperl to parse the alignment. I find out the the alignment column numbers corresponding to 24:485 region in the particular sequence, using 'column_from_residue_number'. Then I extract the sub-sequence from the 'aligned' other sequence(containing gaps) using the corresponding column numbers. Finally I remove the gap characters. Am I doing this thing correctly and are there any pitfalls ? Is there any better way to do it by (Bio)Perl/Python? The code goes here: use strict; use warnings; use Bio::AlignIO; # read in an alignment generated by the EMBOSS program Needle my $in = new Bio::AlignIO(-format => 'emboss', -file => 'test_needle.aln'); while( my $aln = $in->next_aln ) { #Seqnames: 'XM_001005073.'(CDS:24-485),'Homolog' my ($cds_start,$cds_end)=(24,485);# my $col_cdsstart = $aln->column_from_residue_number( 'XM_001005073.', $cds_start); my $col_cdsend= $aln->column_from_residue_number( 'XM_001005073.', $cds_end); foreach my $seq ($aln->each_seq) { if($seq->id() eq 'Homolog'){ my $homolog_cds=$seq->subseq($col_cdsstart,$col_cdsend); $homolog_cds=~s/\-//g; print $homolog_cds,"\n"; } } } -- View this message in context: http://bioperl.996286.n3.nabble.com/Extracting-matching-subsequence-from-pairwise-alignment-tp16935.html Sent from the Bioperl-L mailing list archive at Nabble.com. From hlapp at drycafe.net Wed May 15 16:44:07 2013 From: hlapp at drycafe.net (Hilmar Lapp) Date: Wed, 15 May 2013 16:44:07 -0400 Subject: [Bioperl-l] Workshop on Sustainable Software for Science: Practice and Experiences Message-ID: FYI, if you haven't seen this yet: http://wssspe.researchcomputing.org.uk/ It seems to me that the Bio* projects, perhaps led by BioPerl as the oldest and thus longest running (nowadays more fancily called "sustained") of them would have a lot to say about the subject. Anyone interested in a joint submission? Also, I notice that Biojava's Andreas is on the organizing committee, so maybe he's been conspiring on something already :-) -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 203 bytes Desc: Message signed with OpenPGP using GPGMail URL: From carandraug+dev at gmail.com Wed May 15 21:53:55 2013 From: carandraug+dev at gmail.com (=?ISO-8859-1?Q?Carn=EB_Draug?=) Date: Thu, 16 May 2013 02:53:55 +0100 Subject: [Bioperl-l] sets of sequences - how to read? Message-ID: Hi when accessing entrez gene using eutils to get multiple genes, NCBI now returns an Entrezgene-Set[1] rather than a list of EntrezGene. This change must have happened sometime on the last 2 months. Compare: use Bio::DB::EUtilities; my %sets = ( eutil => 'efetch', db => 'gene', retmode => 'text', rettype => 'asn1', email => 'bioperl-l at lists.open-bio.org', ); ## this mimics the previous behaviour of the NCBI server but the multiple requests will annoy their servers my @ids = (3014, 85235); my $response; foreach (@ids) { my $fetcher = Bio::DB::EUtilities->new(%sets, id => $_); $response .= $fetcher->get_Response->content; } print $fetcher->get_Response->content; ## this used to be the right way to do it, but now returns an Entrezgene-Set my $fetcher = Bio::DB::EUtilities->new(%sets, id => \@ids); $response .= $fetcher->get_Response->content; print $fetcher->get_Response->content; There is no module to read these Entrezgene-Set in Perl at the moment, since Bio::ASN1::EntrezGene; is not able to handle them. I have contacted the module author and set him a fix[2] and he said he'll try to look into it next week. However, even with the fix there is another problem. How would one access a set of sequences using the Bio::SeqIO API? There is no method to do that. One could say, to ignore them, and make next_seq return the next sequence of the set. But then we are losing data. After all, it's perfectly viable to have multiple Entrezgene-Set in one file. What would be the right way to do this? Carn? [1] http://0-www.ncbi.nlm.nih.gov.elis.tmu.edu.tw/IEB/ToolBox/CPP_DOC/asn_spec/Entrezgene-Set.html [2] https://github.com/carandraug/bio-asn1-entrezgene/commit/69d505056d8b7897df6271ffb7a5f39d58873c6b From cjfields at illinois.edu Thu May 16 00:43:22 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 16 May 2013 04:43:22 +0000 Subject: [Bioperl-l] Workshop on Sustainable Software for Science: Practice and Experiences In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E1F8C8@CHIMBX5.ad.uillinois.edu> Jason and I have discussed looking into opportunity's like this, I think it makes sense to try a joint submission. chris On May 15, 2013, at 3:44 PM, Hilmar Lapp wrote: > FYI, if you haven't seen this yet: > > http://wssspe.researchcomputing.org.uk/ > > It seems to me that the Bio* projects, perhaps led by BioPerl as the oldest and thus longest running (nowadays more fancily called "sustained") of them would have a lot to say about the subject. Anyone interested in a joint submission? > > Also, I notice that Biojava's Andreas is on the organizing committee, so maybe he's been conspiring on something already :-) > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : > =========================================================== > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From p.j.a.cock at googlemail.com Thu May 16 05:10:25 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 16 May 2013 10:10:25 +0100 Subject: [Bioperl-l] Workshop on Sustainable Software for Science: Practice and Experiences In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF74E1F8C8@CHIMBX5.ad.uillinois.edu> References: <118F034CF4C3EF48A96F86CE585B94BF74E1F8C8@CHIMBX5.ad.uillinois.edu> Message-ID: On Thu, May 16, 2013 at 5:43 AM, Fields, Christopher J wrote: > Jason and I have discussed looking into opportunity's like this, I think it makes > sense to try a joint submission. > > chris This sounds like a good idea, although given the time and place I am unlikely to be able to attend in person: First Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE) (to held in conjunction with SC13, Sunday, 17 November 2013, Denver, CO, USA) http://wssspe.researchcomputing.org.uk/ Rather than trying to discuss this over four mailing lists should we switch to the cross project list open-bio-l, or continue off-list? http://lists.open-bio.org/mailman/listinfo/open-bio-l Thanks, Peter From miquel.ramia at uab.cat Thu May 16 06:42:29 2013 From: miquel.ramia at uab.cat (=?ISO-8859-1?Q?Miquel_R=E0mia?=) Date: Thu, 16 May 2013 12:42:29 +0200 Subject: [Bioperl-l] Problem compiling Bio::DB::Sam Message-ID: <5194B815.2010401@uab.cat> Hi all, I get this message when compiling Bio::DB::Sam: Building Bio-SamTools gcc -g -Wall -O2 -fPIC -o bam2bedgraph bam2bedgraph.o -L/var/lib/gbrowse2/databases/samtools/samtools-0.1.19 -lbam -lm -lz /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `mt_destroy': /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:458: undefined reference to `pthread_join' /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `bgzf_mt': /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:445: undefined reference to `pthread_create' collect2: ld returned 1 exit status make: *** [bam2bedgraph] Error 1 Is this error related to the module or some dependencies? or maybe a problem with my system? Any help appreciated, thanks! -- Miquel R?mia Jes?s PhD. candidate (PIF) Evolutionary Bioinformatics Group (Genomics, Bioinformatics and Evolution Group) Lab MRB/014 - 93 586 89 58 MRB - Institut de Biologia i Biomedicina (IBB) Universitat Aut?noma de Barcelona (UAB) 08193, Cerdanyola del Vall?s Barcelona (Spain) From cjfields at illinois.edu Thu May 16 09:12:40 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 16 May 2013 13:12:40 +0000 Subject: [Bioperl-l] Problem compiling Bio::DB::Sam In-Reply-To: <5194B815.2010401@uab.cat> References: <5194B815.2010401@uab.cat> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E1FC2C@CHIMBX5.ad.uillinois.edu> It may be due to the new samtools release (v 0.1.19). I know Heng Li has been working on the code over the last year for threading support (notice the undefined functions). Have you tried v 0.1.18? chris On May 16, 2013, at 5:42 AM, Miquel R?mia wrote: > Hi all, > > I get this message when compiling Bio::DB::Sam: > > Building Bio-SamTools > > gcc -g -Wall -O2 -fPIC -o bam2bedgraph bam2bedgraph.o -L/var/lib/gbrowse2/databases/samtools/samtools-0.1.19 -lbam -lm -lz > > /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `mt_destroy': > > /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:458: undefined reference to `pthread_join' > > /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `bgzf_mt': > > /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:445: undefined reference to `pthread_create' > > collect2: ld returned 1 exit status > > make: *** [bam2bedgraph] Error 1 > > > Is this error related to the module or some dependencies? or maybe a problem with my system? > > Any help appreciated, thanks! > > > -- > Miquel R?mia Jes?s > PhD. candidate (PIF) > Evolutionary Bioinformatics Group > (Genomics, Bioinformatics and Evolution Group) > Lab MRB/014 - 93 586 89 58 > MRB - Institut de Biologia i Biomedicina (IBB) > Universitat Aut?noma de Barcelona (UAB) > 08193, Cerdanyola del Vall?s > Barcelona (Spain) > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Thu May 16 09:09:45 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 16 May 2013 13:09:45 +0000 Subject: [Bioperl-l] Workshop on Sustainable Software for Science: Practice and Experiences In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF74E1F8C8@CHIMBX5.ad.uillinois.edu> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E1FBCF@CHIMBX5.ad.uillinois.edu> Yes, though we need to make sure others (e.g. those not subscribed to open-bio-l) are in the loop. November is a possibility for me. chris On May 16, 2013, at 4:10 AM, Peter Cock wrote: > On Thu, May 16, 2013 at 5:43 AM, Fields, Christopher J > wrote: >> Jason and I have discussed looking into opportunity's like this, I think it makes >> sense to try a joint submission. >> >> chris > > This sounds like a good idea, although given the time and place I am > unlikely to be able to attend in person: > > First Workshop on Sustainable Software for Science: Practice and > Experiences (WSSSPE) > (to held in conjunction with SC13, Sunday, 17 November 2013, Denver, CO, USA) > http://wssspe.researchcomputing.org.uk/ > > Rather than trying to discuss this over four mailing lists should we switch > to the cross project list open-bio-l, or continue off-list? > http://lists.open-bio.org/mailman/listinfo/open-bio-l > > Thanks, > > Peter From andreas at sdsc.edu Thu May 16 00:31:34 2013 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 15 May 2013 21:31:34 -0700 Subject: [Bioperl-l] [Biojava-l] Workshop on Sustainable Software for Science: Practice and Experiences In-Reply-To: References: Message-ID: Thanks Hilmar, you were faster than me in sending this out.. You are right, it would be very interesting to hear what some of the long running open-bio projects have to say on the topic of sustainability. Let me know if anybody is interested in a submission! Andreas On Wed, May 15, 2013 at 1:44 PM, Hilmar Lapp wrote: > FYI, if you haven't seen this yet: > > http://wssspe.researchcomputing.org.uk/ > > It seems to me that the Bio* projects, perhaps led by BioPerl as the > oldest and thus longest running (nowadays more fancily called "sustained") > of them would have a lot to say about the subject. Anyone interested in a > joint submission? > > Also, I notice that Biojava's Andreas is on the organizing committee, so > maybe he's been conspiring on something already :-) > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : > =========================================================== > > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > From cjfields at illinois.edu Fri May 17 00:08:04 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 17 May 2013 04:08:04 +0000 Subject: [Bioperl-l] sets of sequences - how to read? In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E21252@CHIMBX5.ad.uillinois.edu> This doesn't surprise me too much; I know there have been some changes brewing, but didn't know when they would land. I guess that would be... ... now. My feeling is this will require writing some code for a higher-level layer of abstraction, say a Bio::DB::* (which would allow some internal indexing of the files maybe using a Bio::Index::*, look ups for specific gene IDs, etc). How hard that would be to implement is another thing, have no idea w/o seeing what the data look like beyond they are in ASN1. chris On May 15, 2013, at 8:53 PM, Carn? Draug wrote: > Hi > > when accessing entrez gene using eutils to get multiple genes, NCBI > now returns an Entrezgene-Set[1] rather than a list of EntrezGene. > This change must have happened sometime on the last 2 months. Compare: > > use Bio::DB::EUtilities; > > my %sets = ( > eutil => 'efetch', > db => 'gene', > retmode => 'text', > rettype => 'asn1', > email => 'bioperl-l at lists.open-bio.org', > ); > > ## this mimics the previous behaviour of the NCBI server but the > multiple requests will annoy their servers > my @ids = (3014, 85235); > my $response; > foreach (@ids) { > my $fetcher = Bio::DB::EUtilities->new(%sets, id => $_); > $response .= $fetcher->get_Response->content; > } > print $fetcher->get_Response->content; > > ## this used to be the right way to do it, but now returns an Entrezgene-Set > my $fetcher = Bio::DB::EUtilities->new(%sets, id => \@ids); > $response .= $fetcher->get_Response->content; > print $fetcher->get_Response->content; > > There is no module to read these Entrezgene-Set in Perl at the moment, > since Bio::ASN1::EntrezGene; is not able to handle them. I have > contacted the module author and set him a fix[2] and he said he'll try > to look into it next week. > > However, even with the fix there is another problem. How would one > access a set of sequences using the Bio::SeqIO API? There is no method > to do that. One could say, to ignore them, and make next_seq return > the next sequence of the set. But then we are losing data. After all, > it's perfectly viable to have multiple Entrezgene-Set in one file. > What would be the right way to do this? > > Carn? > > [1] http://0-www.ncbi.nlm.nih.gov.elis.tmu.edu.tw/IEB/ToolBox/CPP_DOC/asn_spec/Entrezgene-Set.html > [2] https://github.com/carandraug/bio-asn1-entrezgene/commit/69d505056d8b7897df6271ffb7a5f39d58873c6b > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Fri May 17 00:16:12 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 17 May 2013 04:16:12 +0000 Subject: [Bioperl-l] Problem compiling Bio::DB::Sam In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF74E1FC2C@CHIMBX5.ad.uillinois.edu> References: <5194B815.2010401@uab.cat> <118F034CF4C3EF48A96F86CE585B94BF74E1FC2C@CHIMBX5.ad.uillinois.edu> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E212B0@CHIMBX5.ad.uillinois.edu> For the record, this is now fixed in the latest Bio::Samtools (via Lincoln). chris On May 16, 2013, at 8:12 AM, "Fields, Christopher J" wrote: > It may be due to the new samtools release (v 0.1.19). I know Heng Li has been working on the code over the last year for threading support (notice the undefined functions). Have you tried v 0.1.18? > > chris > > On May 16, 2013, at 5:42 AM, Miquel R?mia wrote: > >> Hi all, >> >> I get this message when compiling Bio::DB::Sam: >> >> Building Bio-SamTools >> >> gcc -g -Wall -O2 -fPIC -o bam2bedgraph bam2bedgraph.o -L/var/lib/gbrowse2/databases/samtools/samtools-0.1.19 -lbam -lm -lz >> >> /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `mt_destroy': >> >> /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:458: undefined reference to `pthread_join' >> >> /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `bgzf_mt': >> >> /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:445: undefined reference to `pthread_create' >> >> collect2: ld returned 1 exit status >> >> make: *** [bam2bedgraph] Error 1 >> >> >> Is this error related to the module or some dependencies? or maybe a problem with my system? >> >> Any help appreciated, thanks! >> >> >> -- >> Miquel R?mia Jes?s >> PhD. candidate (PIF) >> Evolutionary Bioinformatics Group >> (Genomics, Bioinformatics and Evolution Group) >> Lab MRB/014 - 93 586 89 58 >> MRB - Institut de Biologia i Biomedicina (IBB) >> Universitat Aut?noma de Barcelona (UAB) >> 08193, Cerdanyola del Vall?s >> Barcelona (Spain) >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From carandraug+dev at gmail.com Fri May 17 01:12:24 2013 From: carandraug+dev at gmail.com (=?ISO-8859-1?Q?Carn=EB_Draug?=) Date: Fri, 17 May 2013 06:12:24 +0100 Subject: [Bioperl-l] sets of sequences - how to read? In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF74E21252@CHIMBX5.ad.uillinois.edu> References: <118F034CF4C3EF48A96F86CE585B94BF74E21252@CHIMBX5.ad.uillinois.edu> Message-ID: On 17 May 2013 05:08, Fields, Christopher J wrote: > This doesn't surprise me too much; I know there have been some changes brewing, but didn't know when they would land. I guess that would be... ... now. > > My feeling is this will require writing some code for a higher-level layer of abstraction, say a Bio::DB::* (which would allow some internal indexing of the files maybe using a Bio::Index::*, look ups for specific gene IDs, etc). How hard that would be to implement is another thing, have no idea w/o seeing what the data look like beyond they are in ASN1. :s I'm not sure I understood your suggestion. I think the problem is just the introduction of a new concept, a "set" of stuff (genes in this case), and how should SeqIO handle multiple sets. Carn? From shalabh.sharma7 at gmail.com Fri May 17 10:54:55 2013 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Fri, 17 May 2013 10:54:55 -0400 Subject: [Bioperl-l] Downloading annotated Gene nucleotide fasta files Message-ID: HI, First of all i am really sorry for sending this mail here, i am not sure if this is the right forum. I know lot of people work on similar stuff. I wrote to NCBI but nobody replied. Actually i am looking for all bacterial/microbial gene annotation nucleotide fasta files. Does anyone knows where to download these kind of files. I tried *ffn files but they are not annotated. Or is there any module in bioperl that i can use ? I would really appreciate your help. Thanks Shalabh -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From fossandonc at hotmail.com Fri May 17 11:59:04 2013 From: fossandonc at hotmail.com (=?iso-8859-1?Q?Francisco_J._Ossand=F3n?=) Date: Fri, 17 May 2013 11:59:04 -0400 Subject: [Bioperl-l] Downloading annotated Gene nucleotide fasta files In-Reply-To: References: Message-ID: Hi, You can get the annotations from here: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ The ".ffn" are the genes nucleotide fasta files but it does not show the product name, on the other hand the ".faa" are the genes aminoacid fasta files and shows the product name, but if you want both product and nucleotide is much better to use the Genbank ".gbk" files that contains the complete data and you can parse it easily using BioPerl to obtain all genes, and then print the /protein_id, /product, and the nucleotide sequences in a new fasta file. Check these to see how to do it: http://www.bioperl.org/wiki/HOWTO:SeqIO http://www.bioperl.org/wiki/HOWTO:Feature-Annotation Cheers, Francisco J. Ossandon -----Mensaje original----- De: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] En nombre de shalabh sharma Enviado el: viernes, 17 de mayo de 2013 10:55 Para: bioperl-l Asunto: [Bioperl-l] Downloading annotated Gene nucleotide fasta files HI, First of all i am really sorry for sending this mail here, i am not sure if this is the right forum. I know lot of people work on similar stuff. I wrote to NCBI but nobody replied. Actually i am looking for all bacterial/microbial gene annotation nucleotide fasta files. Does anyone knows where to download these kind of files. I tried *ffn files but they are not annotated. Or is there any module in bioperl that i can use ? I would really appreciate your help. Thanks Shalabh -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From shalabh.sharma7 at gmail.com Fri May 17 12:26:26 2013 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Fri, 17 May 2013 12:26:26 -0400 Subject: [Bioperl-l] Downloading annotated Gene nucleotide fasta files In-Reply-To: References: Message-ID: Hey Francisco, Thanks a lot. Basically i just wanted gene nucleotide fasta files with GI numbers. I think i will have to parse it from gbk files. -Shalabh On Fri, May 17, 2013 at 11:59 AM, Francisco J. Ossand?n < fossandonc at hotmail.com> wrote: > Hi, > You can get the annotations from here: > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ > > The ".ffn" are the genes nucleotide fasta files but it does not show the > product name, on the other hand the ".faa" are the genes aminoacid fasta > files and shows the product name, but if you want both product and > nucleotide is much better to use the Genbank ".gbk" files that contains the > complete data and you can parse it easily using BioPerl to obtain all > genes, > and then print the /protein_id, /product, and the nucleotide sequences in a > new fasta file. Check these to see how to do it: > http://www.bioperl.org/wiki/HOWTO:SeqIO > http://www.bioperl.org/wiki/HOWTO:Feature-Annotation > > Cheers, > > Francisco J. Ossandon > > -----Mensaje original----- > De: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] En nombre de shalabh sharma > Enviado el: viernes, 17 de mayo de 2013 10:55 > Para: bioperl-l > Asunto: [Bioperl-l] Downloading annotated Gene nucleotide fasta files > > HI, > First of all i am really sorry for sending this mail here, > i am not sure if this is the right forum. I know lot of people work on > similar stuff. > I wrote to NCBI but nobody replied. > > Actually i am looking for all bacterial/microbial gene annotation > nucleotide > fasta files. > Does anyone knows where to download these kind of files. > I tried *ffn files but they are not annotated. > Or is there any module in bioperl that i can use ? > I would really appreciate your help. > > Thanks > Shalabh > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences University of Georgia Athens, GA 30602-3636 > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From cjfields at illinois.edu Fri May 17 13:37:53 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 17 May 2013 17:37:53 +0000 Subject: [Bioperl-l] sets of sequences - how to read? In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF74E21252@CHIMBX5.ad.uillinois.edu> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E22218@CHIMBX5.ad.uillinois.edu> On May 17, 2013, at 12:12 AM, Carn? Draug wrote: > On 17 May 2013 05:08, Fields, Christopher J wrote: >> This doesn't surprise me too much; I know there have been some changes brewing, but didn't know when they would land. I guess that would be... ... now. >> >> My feeling is this will require writing some code for a higher-level layer of abstraction, say a Bio::DB::* (which would allow some internal indexing of the files maybe using a Bio::Index::*, look ups for specific gene IDs, etc). How hard that would be to implement is another thing, have no idea w/o seeing what the data look like beyond they are in ASN1. > > :s I'm not sure I understood your suggestion. I think the problem is > just the introduction of a new concept, a "set" of stuff (genes in > this case), and how should SeqIO handle multiple sets. > > Carn? (note: critical point in this is Bio::ASN1::Entrezgene would allow this, I'm not sure it would. Otherwise this is all really hand-wavy) To me a 'set of stuff', particularly when the 'stuff' is stored sequentially in a flat file, is a simple 'database' or 'store' of similar items, where the class allows one the ability to look up particular members in the set, but also could store higher level information about the set as a whole if needed. If it were me, I would implement a method particular to Bio::SeqIO::entrezgene that specifically creates and returns this ( next_geneset(), for instance ); next_seq() could then be implemented to iterate through the items in that database/store. Two useful things come out of this. First, if the data for the Entrez Gene file/chunk are parsed to store offsets per ID, one would only need to parse out the chunks needed (offset of ID to next offset), then pass that into the parser and create objects on the fly. This would probably be as fast or faster than (for instance) the greedy method of parsing the entire file and storing everything in objects up-front, then iterating through those objects one at a time, which I think is current behavior. Second: if an index is created, the upfront cost is already paid (you could reuse the same index when parsing the same data). An analogous example might be storing all FASTQ data in a sequencing run; I don't want to expend the effort to parse all the FASTQ data, but I may want to run operations on individual items in the set as well as store additional information about the data (barcodes per run, lanes, overall quality stats, etc). Does that make sense? The pieces for this are lying around (Bio::Index::* for instance has methods for indexing flat files, and classes like Bio::DB::Fasta). chris From shalabh.sharma7 at gmail.com Sun May 19 15:33:16 2013 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Sun, 19 May 2013 15:33:16 -0400 Subject: [Bioperl-l] Downloading annotated Gene nucleotide fasta files In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF37365BCAEC2@exchsth.agresearch.co.nz> References: <18DF7D20DFEC044098A1062202F5FFF37365BCAEC2@exchsth.agresearch.co.nz> Message-ID: Thanks Russell, Actually i wanted all the Bacterial gene nucleotide files, so i parsed it from *gbk. But yes these files might help me for my other parts of my work. Thanks Shalabh On Sun, May 19, 2013 at 3:26 PM, Smithies, Russell < Russell.Smithies at agresearch.co.nz> wrote: > Another option that I've used before is to download the gene2accession, > gene2refseq, and gene_info files from here > ftp://ftp.ncbi.nih.gov/gene/DATA/ then parse out the data you require. > It might work for you? > > --Russell > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto: > bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma > Sent: Saturday, 18 May 2013 4:26 a.m. > To: Francisco J. Ossand?n > Cc: bioperl-l > Subject: Re: [Bioperl-l] Downloading annotated Gene nucleotide fasta files > > Hey Francisco, > Thanks a lot. Basically i just wanted gene nucleotide fasta > files with GI numbers. > I think i will have to parse it from gbk files. > > -Shalabh > > > On Fri, May 17, 2013 at 11:59 AM, Francisco J. Ossand?n < > fossandonc at hotmail.com> wrote: > > > Hi, > > You can get the annotations from here: > > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ > > > > The ".ffn" are the genes nucleotide fasta files but it does not show > > the product name, on the other hand the ".faa" are the genes aminoacid > > fasta files and shows the product name, but if you want both product > > and nucleotide is much better to use the Genbank ".gbk" files that > > contains the complete data and you can parse it easily using BioPerl > > to obtain all genes, and then print the /protein_id, /product, and the > > nucleotide sequences in a new fasta file. Check these to see how to do > > it: > > http://www.bioperl.org/wiki/HOWTO:SeqIO > > http://www.bioperl.org/wiki/HOWTO:Feature-Annotation > > > > Cheers, > > > > Francisco J. Ossandon > > > > -----Mensaje original----- > > De: bioperl-l-bounces at lists.open-bio.org > > [mailto:bioperl-l-bounces at lists.open-bio.org] En nombre de shalabh > > sharma Enviado el: viernes, 17 de mayo de 2013 10:55 > > Para: bioperl-l > > Asunto: [Bioperl-l] Downloading annotated Gene nucleotide fasta files > > > > HI, > > First of all i am really sorry for sending this mail > > here, i am not sure if this is the right forum. I know lot of people > > work on similar stuff. > > I wrote to NCBI but nobody replied. > > > > Actually i am looking for all bacterial/microbial gene annotation > > nucleotide fasta files. > > Does anyone knows where to download these kind of files. > > I tried *ffn files but they are not annotated. > > Or is there any module in bioperl that i can use ? > > I would really appreciate your help. > > > > Thanks > > Shalabh > > > > -- > > Shalabh Sharma > > Scientific Computing Professional Associate (Bioinformatics > > Specialist) Department of Marine Sciences University of Georgia > > Athens, GA 30602-3636 _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences University of Georgia Athens, GA 30602-3636 > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From Russell.Smithies at agresearch.co.nz Sun May 19 15:26:35 2013 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Mon, 20 May 2013 07:26:35 +1200 Subject: [Bioperl-l] Downloading annotated Gene nucleotide fasta files In-Reply-To: References: Message-ID: <18DF7D20DFEC044098A1062202F5FFF37365BCAEC2@exchsth.agresearch.co.nz> Another option that I've used before is to download the gene2accession, gene2refseq, and gene_info files from here ftp://ftp.ncbi.nih.gov/gene/DATA/ then parse out the data you require. It might work for you? --Russell -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma Sent: Saturday, 18 May 2013 4:26 a.m. To: Francisco J. Ossand?n Cc: bioperl-l Subject: Re: [Bioperl-l] Downloading annotated Gene nucleotide fasta files Hey Francisco, Thanks a lot. Basically i just wanted gene nucleotide fasta files with GI numbers. I think i will have to parse it from gbk files. -Shalabh On Fri, May 17, 2013 at 11:59 AM, Francisco J. Ossand?n < fossandonc at hotmail.com> wrote: > Hi, > You can get the annotations from here: > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ > > The ".ffn" are the genes nucleotide fasta files but it does not show > the product name, on the other hand the ".faa" are the genes aminoacid > fasta files and shows the product name, but if you want both product > and nucleotide is much better to use the Genbank ".gbk" files that > contains the complete data and you can parse it easily using BioPerl > to obtain all genes, and then print the /protein_id, /product, and the > nucleotide sequences in a new fasta file. Check these to see how to do > it: > http://www.bioperl.org/wiki/HOWTO:SeqIO > http://www.bioperl.org/wiki/HOWTO:Feature-Annotation > > Cheers, > > Francisco J. Ossandon > > -----Mensaje original----- > De: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] En nombre de shalabh > sharma Enviado el: viernes, 17 de mayo de 2013 10:55 > Para: bioperl-l > Asunto: [Bioperl-l] Downloading annotated Gene nucleotide fasta files > > HI, > First of all i am really sorry for sending this mail > here, i am not sure if this is the right forum. I know lot of people > work on similar stuff. > I wrote to NCBI but nobody replied. > > Actually i am looking for all bacterial/microbial gene annotation > nucleotide fasta files. > Does anyone knows where to download these kind of files. > I tried *ffn files but they are not annotated. > Or is there any module in bioperl that i can use ? > I would really appreciate your help. > > Thanks > Shalabh > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics > Specialist) Department of Marine Sciences University of Georgia > Athens, GA 30602-3636 _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From miquel.ramia at uab.cat Tue May 21 11:08:18 2013 From: miquel.ramia at uab.cat (=?ISO-8859-1?Q?Miquel_R=E0mia?=) Date: Tue, 21 May 2013 17:08:18 +0200 Subject: [Bioperl-l] Problem compiling Bio::DB::Sam In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF74E212B0@CHIMBX5.ad.uillinois.edu> References: <5194B815.2010401@uab.cat> <"118F034CF4C3EF48A96F86CE585B94BF74E1F C2C"@CHIMBX5.ad.uillinois.edu> <118F034CF4C3EF48A96F86CE585B94BF74E212B0@CHIMBX5.ad.uillinois.edu> Message-ID: <519B8DE2.2070308@uab.cat> On 17/05/13 06:16, Fields, Christopher J wrote: > For the record, this is now fixed in the latest Bio::Samtools (via Lincoln). > > chris > > Compiled correctly! thank you -- Miquel R?mia Jes?s PhD. candidate (PIF) Evolutionary Bioinformatics Group (Genomics, Bioinformatics and Evolution Group) Lab MRB/014 - 93 586 89 58 MRB - Institut de Biologia i Biomedicina (IBB) Universitat Aut?noma de Barcelona (UAB) 08193, Cerdanyola del Vall?s Barcelona (Spain) From b.l.cohen_home at btinternet.com Mon May 20 14:49:50 2013 From: b.l.cohen_home at btinternet.com (Bernard Cohen) Date: Mon, 20 May 2013 19:49:50 +0100 (BST) Subject: [Bioperl-l] Phylip format error Message-ID: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Hello! I happen to have checked to see what the PERL webpage says about Phylip format for DNA alignment files and see that it is erroneous.? I am not a PERL user and do not want to be bothered to register or otherwise learn how to make an official comment, so forward this for someone to pick up. Phylip format allows up to 10 spaces for taxon names; the data must start in the 11th space. This can be checked on Jo Felsenstein's site. The PERL page accessed by searching for "Phylip format PERL" allows only 8 spaces for the name.? B. L. Cohen From senanu.pearson at gmail.com Wed May 22 16:15:24 2013 From: senanu.pearson at gmail.com (Senanu) Date: Wed, 22 May 2013 13:15:24 -0700 Subject: [Bioperl-l] Speed issues with making IUPAC consensus from alignment Message-ID: <1DE07C6C-0DEA-43EB-8F1A-BACB8534400F@gmail.com> Hi all, I am wondering if the consensus_iupac method of Bio::Align is known to be extremely slow, or if I'm doing something wrong. I have bacterial whole-genome alignments (~7 Mbases) that I made in progressiveMauve and wish to get an IUPAC consensus. (I know that progressiveMauve uses a non-standard XMFA format, but Bio::AlignIO seems to read them just fine.) The code below takes more than all night to make a consensus. It works fine on tiny test alignments. Is this a known problem? Is there another way to generate such a consensus? my $in = Bio::AlignIO->new(-file => $files[0], -format => 'XMFA'); while (my $aln = $in->next_aln()) { foreach my $seq ($aln->each_seq) { $seq->alphabet('dna'); } my $con = $aln->consensus_iupac(); } Thanks in advance. Ngwenyama From cjfields at illinois.edu Wed May 22 19:17:50 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 22 May 2013 23:17:50 +0000 Subject: [Bioperl-l] Speed issues with making IUPAC consensus from alignment In-Reply-To: <1DE07C6C-0DEA-43EB-8F1A-BACB8534400F@gmail.com> References: <1DE07C6C-0DEA-43EB-8F1A-BACB8534400F@gmail.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E2A01D@CHIMBX5.ad.uillinois.edu> On May 22, 2013, at 3:15 PM, Senanu wrote: > Hi all, > > I am wondering if the consensus_iupac method of Bio::Align is known to be extremely slow, or if I'm doing something wrong. Probably the former, but... > I have bacterial whole-genome alignments (~7 Mbases) that I made in progressiveMauve and wish to get an IUPAC consensus. (I know that progressiveMauve uses a non-standard XMFA format, but Bio::AlignIO seems to read them just fine.) The code below takes more than all night to make a consensus. It works fine on tiny test alignments. It shouldn't take that long, 7 Mb isn't that large. Or is that 7 Mb for one genome? > Is this a known problem? Is there another way to generate such a consensus? The code isn't really optimized for this, but again this isn't terribly large. Is the bottleneck reading the alignment in, or is it the consensus_iupac() step? Hard to say w/o seeing the alignment data itself. > my $in = Bio::AlignIO->new(-file => $files[0], > -format => 'XMFA'); > while (my $aln = $in->next_aln()) { > foreach my $seq ($aln->each_seq) { > $seq->alphabet('dna'); > } > my $con = $aln->consensus_iupac(); > } > > > Thanks in advance. > Ngwenyama > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l chris From alexeymorozov1991 at gmail.com Thu May 23 03:22:13 2013 From: alexeymorozov1991 at gmail.com (Alexey Morozov) Date: Thu, 23 May 2013 16:22:13 +0900 Subject: [Bioperl-l] Phylip format error In-Reply-To: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Message-ID: Which is also worsened by the fact that there is relaxed phylip format, which allows up to 250 chars for taxon name. They are separated from a sequence by single space, which creates problems if names were extended to 10 chars in strict Felsenstein's format by whitespaces. On the whole, phylip is as messily defined format as one can make from a plain textfile with information content of fasta. Bioperl documentation says nothing about whether Bio::SeqIO accepts relaxed phylip and how does it tell dialects from one another. Even if code support is OK, it may be worthwile to explain it somewhere at bioperl.org 2013/5/21 Bernard Cohen > Hello! > > I happen to have checked to see what the PERL webpage says about Phylip > format for DNA alignment files and see that it is erroneous. > > I am not a PERL user and do not want to be bothered to register or > otherwise learn how to make an official comment, so forward this for > someone to pick up. > > Phylip format allows up to 10 spaces for taxon names; the data must start > in the 11th space. This can be checked on Jo Felsenstein's site. > > The PERL page accessed by searching for "Phylip format PERL" allows only 8 > spaces for the name. > > B. L. Cohen > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- Alexey Morozov, LIN SB RAS, bioinformatics group. Irkutsk, Russia. From p.j.a.cock at googlemail.com Thu May 23 04:30:21 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 23 May 2013 09:30:21 +0100 Subject: [Bioperl-l] Phylip format error In-Reply-To: References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Message-ID: On Thu, May 23, 2013 at 8:22 AM, Alexey Morozov wrote: > Which is also worsened by the fact that there is relaxed phylip format, > which allows up to 250 chars for taxon name. They are separated from a > sequence by single space, which creates problems if names were extended to > 10 chars in strict Felsenstein's format by whitespaces. On the whole, > phylip is as messily defined format as one can make from a plain textfile > with information content of fasta. > Bioperl documentation says nothing about whether Bio::SeqIO accepts relaxed > phylip and how does it tell dialects from one another. Even if code support > is OK, it may be worthwile to explain it somewhere at bioperl.org Biopython's AlignIO defines both a (strict) "phylip" and "relaxed-phylip" as two separate formats (or variants, like the "fastq" variants). Doing the same in BioPerl would seem sensible since auto-detection is not easy. http://biopython.org/wiki/AlignIO#File_Formats Peter P.S. Where does that 250 characters for the taxon name limit come from? The trouble with relaxed phylip is that some tools are more relaxed than others ;) From awitney at sgul.ac.uk Thu May 23 04:43:15 2013 From: awitney at sgul.ac.uk (Adam Witney) Date: Thu, 23 May 2013 09:43:15 +0100 Subject: [Bioperl-l] Phylip format error In-Reply-To: References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Message-ID: <519DD6A3.8090304@sgul.ac.uk> Not sure if there is an actual question in these messages, but BioPerl can be used to generate valid Phylip format and run, like this: ## Build Align object my $aln = Bio::SimpleAlign->new(-seqs=>$seqs); ## swap the taxa names with 8 characters long unique IDs my ($aln_safe, $ref_name) = $aln->set_displayname_safe(8); ## Write out phylip format infile Bio::AlignIO->new(-file=>'>infile.out', -format=>'phylip', -interleaved => 0)->write_aln($aln); ## run PHYLIP's pars program my @params = (idlength=>10); #, jumble=>"17,10"); my $tree_factory = Bio::Tools::Run::Phylo::Phylip::Pars->new(@params); $tree_factory->quiet(1); # Suppress pars messages to terminal my $tree = $tree_factory->create_tree($aln_safe); ## fix the node labels back my @nodes = sort { defined $a->id && defined $b->id && $a->id cmp $b->id } $tree->get_nodes(); foreach my $nd (@nodes) { if ( $nd->is_Leaf ) { $nd->id($ref_name->{$nd->id_output}) } } HTH Adam On 23/05/2013 08:22, Alexey Morozov wrote: > Which is also worsened by the fact that there is relaxed phylip format, > which allows up to 250 chars for taxon name. They are separated from a > sequence by single space, which creates problems if names were extended to > 10 chars in strict Felsenstein's format by whitespaces. On the whole, > phylip is as messily defined format as one can make from a plain textfile > with information content of fasta. > Bioperl documentation says nothing about whether Bio::SeqIO accepts relaxed > phylip and how does it tell dialects from one another. Even if code support > is OK, it may be worthwile to explain it somewhere at bioperl.org > > > 2013/5/21 Bernard Cohen > >> Hello! >> >> I happen to have checked to see what the PERL webpage says about Phylip >> format for DNA alignment files and see that it is erroneous. >> >> I am not a PERL user and do not want to be bothered to register or >> otherwise learn how to make an official comment, so forward this for >> someone to pick up. >> >> Phylip format allows up to 10 spaces for taxon names; the data must start >> in the 11th space. This can be checked on Jo Felsenstein's site. >> >> The PERL page accessed by searching for "Phylip format PERL" allows only 8 >> spaces for the name. >> >> B. L. Cohen >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > > From cjfields at illinois.edu Thu May 23 09:48:31 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 23 May 2013 13:48:31 +0000 Subject: [Bioperl-l] Phylip format error In-Reply-To: References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E323A4@CITESMBX5.ad.uillinois.edu> Alexey, Just want to point out that 'relaxed phylip' format was introduced long after this parser was created; in fact (as Adam points out) there was an alternative workaround to deal with the lossy names. The content of that page is on a wiki, which anyone is free to edit (just need an OpenID to set up an account). chris On May 23, 2013, at 2:22 AM, Alexey Morozov wrote: > Which is also worsened by the fact that there is relaxed phylip format, > which allows up to 250 chars for taxon name. They are separated from a > sequence by single space, which creates problems if names were extended to > 10 chars in strict Felsenstein's format by whitespaces. On the whole, > phylip is as messily defined format as one can make from a plain textfile > with information content of fasta. > Bioperl documentation says nothing about whether Bio::SeqIO accepts relaxed > phylip and how does it tell dialects from one another. Even if code support > is OK, it may be worthwile to explain it somewhere at bioperl.org > > > 2013/5/21 Bernard Cohen > >> Hello! >> >> I happen to have checked to see what the PERL webpage says about Phylip >> format for DNA alignment files and see that it is erroneous. >> >> I am not a PERL user and do not want to be bothered to register or >> otherwise learn how to make an official comment, so forward this for >> someone to pick up. >> >> Phylip format allows up to 10 spaces for taxon names; the data must start >> in the 11th space. This can be checked on Jo Felsenstein's site. >> >> The PERL page accessed by searching for "Phylip format PERL" allows only 8 >> spaces for the name. >> >> B. L. Cohen >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > > > -- > Alexey Morozov, > LIN SB RAS, bioinformatics group. > Irkutsk, Russia. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Thu May 23 10:05:32 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 23 May 2013 14:05:32 +0000 Subject: [Bioperl-l] Phylip format error In-Reply-To: References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E32551@CITESMBX5.ad.uillinois.edu> On May 23, 2013, at 3:30 AM, Peter Cock wrote: > On Thu, May 23, 2013 at 8:22 AM, Alexey Morozov > wrote: >> Which is also worsened by the fact that there is relaxed phylip format, >> which allows up to 250 chars for taxon name. They are separated from a >> sequence by single space, which creates problems if names were extended to >> 10 chars in strict Felsenstein's format by whitespaces. On the whole, >> phylip is as messily defined format as one can make from a plain textfile >> with information content of fasta. >> Bioperl documentation says nothing about whether Bio::SeqIO accepts relaxed >> phylip and how does it tell dialects from one another. Even if code support >> is OK, it may be worthwile to explain it somewhere at bioperl.org > > Biopython's AlignIO defines both a (strict) "phylip" and "relaxed-phylip" > as two separate formats (or variants, like the "fastq" variants). Doing > the same in BioPerl would seem sensible since auto-detection is not > easy. > > http://biopython.org/wiki/AlignIO#File_Formats > > Peter > > P.S. Where does that 250 characters for the taxon name limit come from? > The trouble with relaxed phylip is that some tools are more relaxed than > others ;) As Adam pointed out, prior to the introduction of 'relaxed phylip' we had an alternative solution that didn't require a modified format but still allowed one to use PHYLIP and other tools requesting the format. I think 'relaxed phylip' was introduced by CIPRES a few years back. Frankly, this is the first time I have seen this mentioned on the list; yay, yet another format variation :) The variant format parsing (as implemented for SeqIO::fastq, as you know) deals with variant names like 'fastq-sanger', where the main format name is first, the variant of the format second. The order in this case is reversed (relaxed-phylip), which I'm pretty sure will not work. Not impossible to allow, but we would probably allow support like this initially: my $in = Bio::AlignIO->new(-format => 'phylip', -variant => 'relaxed', ?); chris From cjfields at illinois.edu Thu May 23 09:56:32 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 23 May 2013 13:56:32 +0000 Subject: [Bioperl-l] Speed issues with making IUPAC consensus from alignment In-Reply-To: <92CD2F9D-7C8E-4941-9354-4BD80D8C2CB6@gmail.com> References: <1DE07C6C-0DEA-43EB-8F1A-BACB8534400F@gmail.com> <118F034CF4C3EF48A96F86CE585B94BF74E2A01D@CHIMBX5.ad.uillinois.edu> <92CD2F9D-7C8E-4941-9354-4BD80D8C2CB6@gmail.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E32489@CITESMBX5.ad.uillinois.edu> (keep the list cc'd) On May 22, 2013, at 6:31 PM, Senanu wrote: > On May 22, 2013, at 4:17 PM, Fields, Christopher J wrote: > >> Hi all, >>> >>> I am wondering if the consensus_iupac method of Bio::Align is known to be extremely slow, or if I'm doing something wrong. >> >> Probably the former, but... >> >>> I have bacterial whole-genome alignments (~7 Mbases) that I made in progressiveMauve and wish to get an IUPAC consensus. (I know that progressiveMauve uses a non-standard XMFA format, but Bio::AlignIO seems to read them just fine.) The code below takes more than all night to make a consensus. It works fine on tiny test alignments. >> >> It shouldn't take that long, 7 Mb isn't that large. Or is that 7 Mb for one genome? > > It is 7Mb per genome, but there are only 2 genomes in the alignment, and the sequences are very similar to one another. > >> >>> Is this a known problem? Is there another way to generate such a consensus? >> >> The code isn't really optimized for this, but again this isn't terribly large. Is the bottleneck reading the alignment in, or is it the consensus_iupac() step? Hard to say w/o seeing the alignment data itself. > > The bottleneck is definitely with the consensus_iupac step. Reading the alignment in takes a few seconds. That's interesting, but again not surprising. One would have to look at the code, but I wouldn't be surprised if the method is terribly inefficient. chris From p.j.a.cock at googlemail.com Thu May 23 10:53:09 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 23 May 2013 15:53:09 +0100 Subject: [Bioperl-l] Phylip format error In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF74E32551@CITESMBX5.ad.uillinois.edu> References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> <118F034CF4C3EF48A96F86CE585B94BF74E32551@CITESMBX5.ad.uillinois.edu> Message-ID: On Thu, May 23, 2013 at 3:05 PM, Fields, Christopher J wrote: > > I think 'relaxed phylip' was introduced by CIPRES a few years back. > Frankly, this is the first time I have seen this mentioned on the list; yay, > yet another format variation :) The relaxed phylip 'format' goes back further than that, e.g. http://lists.open-bio.org/pipermail/biopython-dev/2008-June/003899.html RAxML and PHYML support relaxed phylip - but with their own ID limits. Peter From Ben.Ward at sainsbury-laboratory.ac.uk Fri May 24 15:14:17 2013 From: Ben.Ward at sainsbury-laboratory.ac.uk (Ben Ward (TSL)) Date: Fri, 24 May 2013 19:14:17 +0000 Subject: [Bioperl-l] Building Many trees with Bioperl Message-ID: Hi, I'm new to Bioperl and plan to make a script to automate making trees with many alignment files (themselves generated by automating the process of multiple alignment for many datasets by using clustalw in a bioperl script). But I need to get it right for one pice of test data before I can do it for all: What I have produced so far is the below. It's supposed to load in the alignment file as as SimpleAlign. Then use that alignment in phyml. I looked at the documentation and tried to follow examples. However it gives me many errors like: --------------------- WARNING --------------------- MSG: Replacing one sequence [FXCNDTJ02P/1-366] And then gives me: Phyml command = /Users/wardb/clustalw2/phyml /var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip 0 i 1 0 HKY e e 1 e BIONJ y y ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Phyml call (/Users/wardb/clustalw2/phyml /var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip 0 i 1 0 HKY e e 1 e BIONJ y y) did not give an output [/var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip_phyml_stat.txt]: 11 STACK: Error::throw STACK: Bio::Root::Root::throw /Library/Perl/5.12/Bio/Root/Root.pm:472 STACK: Bio::Tools::Run::Phylo::Phyml::_run /Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:851 STACK: /Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:338 ----------------------------------------------------------- Can someone let me know if I'm going about this correctly and what I need to do to get it to work. I've also tried to run phyml by giving the filename in the run() method like: my $inputfilename = 'outputalignmentfile'; my $tree = $factory->run("$inputfilename"); But I get the same EXCEPTION: Bio::Root::Exception message. Thanks, Ben W. SCRIPT --- #!/usr/bin/perl use warnings; use strict; BEGIN { $ENV{PHYMLDIR} = '/Users/wardb/clustalw2' } use Bio::TreeIO; use Bio::AlignIO; use Bio::Tools::Run::Phylo::Phyml; my $alnin = Bio::AlignIO->new(-file => " 'phylip'); my $aln = $alnin->next_aln(); # Make a Phyml factory. my $factory = Bio::Tools::Run::Phylo::Phyml->new(-verbose => 2, -data_type => 'dna'); # Pass the factory an alignment and run: # my $inputfilename = 'outputalignmentfile'; my $tree = $factory->run($aln); # $tree is a Bio::Tree::Tree object. # Setup tree output stream... my $treeio = Bio::TreeIO->new(-format => 'newick', -file => 'tree.newick'); $treeio->write_tree($tree); exit 0; From bosborne11 at verizon.net Fri May 24 17:25:38 2013 From: bosborne11 at verizon.net (Brian Osborne) Date: Fri, 24 May 2013 17:25:38 -0400 Subject: [Bioperl-l] Building Many trees with Bioperl In-Reply-To: References: Message-ID: <93D0963E-8399-468C-A32B-875FAA5054D1@verizon.net> Ben, What happens when you take the Phyml command itself and run it from the command line? Also, a minor point: the message "MSG: Replacing one sequence [FXCNDTJ02P/1-366]" is not an error, it is a warning. An error accompanies an exit, warnings are just informative. Brian O. On May 24, 2013, at 3:14 PM, Ben Ward (TSL) wrote: > Hi, > > I'm new to Bioperl and plan to make a script to automate making trees with many alignment files (themselves generated by automating the process of multiple alignment for many datasets by using clustalw in a bioperl script). > But I need to get it right for one pice of test data before I can do it for all: > > What I have produced so far is the below. It's supposed to load in the alignment file as as SimpleAlign. Then use that alignment in phyml. I looked at the documentation and tried to follow examples. > > However it gives me many errors like: > --------------------- WARNING --------------------- > MSG: Replacing one sequence [FXCNDTJ02P/1-366] > > And then gives me: > Phyml command = /Users/wardb/clustalw2/phyml /var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip 0 i 1 0 HKY e e 1 e BIONJ y y > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: Phyml call (/Users/wardb/clustalw2/phyml /var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip 0 i 1 0 HKY e e 1 e BIONJ y y) did not give an output [/var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip_phyml_stat.txt]: 11 > STACK: Error::throw > STACK: Bio::Root::Root::throw /Library/Perl/5.12/Bio/Root/Root.pm:472 > STACK: Bio::Tools::Run::Phylo::Phyml::_run /Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:851 > STACK: /Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:338 > ----------------------------------------------------------- > > Can someone let me know if I'm going about this correctly and what I need to do to get it to work. I've also tried to run phyml by giving the filename in the run() method like: > my $inputfilename = 'outputalignmentfile'; > my $tree = $factory->run("$inputfilename"); > > But I get the same EXCEPTION: Bio::Root::Exception message. > > Thanks, > Ben W. > > SCRIPT --- > > #!/usr/bin/perl > use warnings; > use strict; > BEGIN { $ENV{PHYMLDIR} = '/Users/wardb/clustalw2' } > use Bio::TreeIO; > use Bio::AlignIO; > use Bio::Tools::Run::Phylo::Phyml; > > my $alnin = Bio::AlignIO->new(-file => " -format => 'phylip'); > > my $aln = $alnin->next_aln(); > > > # Make a Phyml factory. > my $factory = Bio::Tools::Run::Phylo::Phyml->new(-verbose => 2, > -data_type => 'dna'); > > # Pass the factory an alignment and run: > # my $inputfilename = 'outputalignmentfile'; > > my $tree = $factory->run($aln); # $tree is a Bio::Tree::Tree object. > > > # Setup tree output stream... > my $treeio = Bio::TreeIO->new(-format => 'newick', > -file => 'tree.newick'); > > $treeio->write_tree($tree); > > exit 0; > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From Ben.Ward at sainsbury-laboratory.ac.uk Fri May 24 17:46:40 2013 From: Ben.Ward at sainsbury-laboratory.ac.uk (Ben Ward (TSL)) Date: Fri, 24 May 2013 21:46:40 +0000 Subject: [Bioperl-l] Building Many trees with Bioperl In-Reply-To: <93D0963E-8399-468C-A32B-875FAA5054D1@verizon.net> Message-ID: Hi, If I just run phyml on the command line it seems to run ok - it accept's my file and appears to undergo the tree building process - I haven't actually see it complete yet, but ML and Bayes always does take a while - and I have many reads that need to be aligned. But PhyML gets to the point it asks - are you sure you want to proceed - I say yes, then it keeps quiet and is currently working along to itself: . 766 patterns found (out of a total of 795 sites). . 58 sites without polymorphism (7.30%). . Computing pairwise distances... . Building BioNJ tree... . WARNING: this analysis requires at least 556 MB of memory space. . Do you really want to proceed? [Y/n] Y It appears to be working =/ Best, Ben. On 24/05/2013 22:25, "Brian Osborne" wrote: >Ben, > >What happens when you take the Phyml command itself and run it from the >command line? > >Also, a minor point: the message "MSG: Replacing one sequence >[FXCNDTJ02P/1-366]" is not an error, it is a warning. An error >accompanies an exit, warnings are just informative. > >Brian O. > > >On May 24, 2013, at 3:14 PM, Ben Ward (TSL) > wrote: > >> Hi, >> >> I'm new to Bioperl and plan to make a script to automate making trees >>with many alignment files (themselves generated by automating the >>process of multiple alignment for many datasets by using clustalw in a >>bioperl script). >> But I need to get it right for one pice of test data before I can do it >>for all: >> >> What I have produced so far is the below. It's supposed to load in the >>alignment file as as SimpleAlign. Then use that alignment in phyml. I >>looked at the documentation and tried to follow examples. >> >> However it gives me many errors like: >> --------------------- WARNING --------------------- >> MSG: Replacing one sequence [FXCNDTJ02P/1-366] >> >> And then gives me: >> Phyml command = /Users/wardb/clustalw2/phyml >>/var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phyli >>p 0 i 1 0 HKY e e 1 e BIONJ y y >> >> ------------- EXCEPTION: Bio::Root::Exception ------------- >> MSG: Phyml call (/Users/wardb/clustalw2/phyml >>/var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phyli >>p 0 i 1 0 HKY e e 1 e BIONJ y y) did not give an output >>[/var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phyl >>ip_phyml_stat.txt]: 11 >> STACK: Error::throw >> STACK: Bio::Root::Root::throw /Library/Perl/5.12/Bio/Root/Root.pm:472 >> STACK: Bio::Tools::Run::Phylo::Phyml::_run >>/Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:851 >> STACK: /Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:338 >> ----------------------------------------------------------- >> >> Can someone let me know if I'm going about this correctly and what I >>need to do to get it to work. I've also tried to run phyml by giving the >>filename in the run() method like: >> my $inputfilename = 'outputalignmentfile'; >> my $tree = $factory->run("$inputfilename"); >> >> But I get the same EXCEPTION: Bio::Root::Exception message. >> >> Thanks, >> Ben W. >> >> SCRIPT --- >> >> #!/usr/bin/perl >> use warnings; >> use strict; >> BEGIN { $ENV{PHYMLDIR} = '/Users/wardb/clustalw2' } >> use Bio::TreeIO; >> use Bio::AlignIO; >> use Bio::Tools::Run::Phylo::Phyml; >> >> my $alnin = Bio::AlignIO->new(-file => "> -format => 'phylip'); >> >> my $aln = $alnin->next_aln(); >> >> >> # Make a Phyml factory. >> my $factory = Bio::Tools::Run::Phylo::Phyml->new(-verbose => 2, >> -data_type => 'dna'); >> >> # Pass the factory an alignment and run: >> # my $inputfilename = 'outputalignmentfile'; >> >> my $tree = $factory->run($aln); # $tree is a Bio::Tree::Tree object. >> >> >> # Setup tree output stream... >> my $treeio = Bio::TreeIO->new(-format => 'newick', >> -file => 'tree.newick'); >> >> $treeio->write_tree($tree); >> >> exit 0; >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > From himaghna.bhattacharjee at gmail.com Tue May 28 11:59:02 2013 From: himaghna.bhattacharjee at gmail.com (Himaghna Bhattacharjee) Date: Tue, 28 May 2013 21:29:02 +0530 Subject: [Bioperl-l] error in the link to install Kobe repository for windows Message-ID: Hey, the link to install Kobe's repository for per 5.10 <" http://cpan.uwinnipeg.ca/PPMPackages/10xx/ --"> seems to be broken as it shows Error 503 Service Temporarily Unavailable. Could you please suggest an alternative ? Thanks . Himaghna Bhattacharjee 3rd year B.E.(Hons.)Chemical Engineering Birla Institute of Technology and Science,Pilani Rajasthan 333 031 From wgallin at ualberta.ca Tue May 28 13:49:02 2013 From: wgallin at ualberta.ca (Warren Gallin) Date: Tue, 28 May 2013 11:49:02 -0600 Subject: [Bioperl-l] ReplacedBy value in esummary Message-ID: <81794641-2693-4476-B2FA-6B742190D141@ualberta.ca> Hi, I just encountered a glitch when I was trying to update some entries in a database by finding updated GENBANK protein entries. The original record was gi 118091304 which has been replaced by gi 363734282 I retrieved an ESummary of the record for gi 118091304 as a Bio::Tools::EUtilities::Summary::DocSum object (called $ds). When I then tried to retrieve the gi number for the replacement by using: my $replaced = $ds->get_contents_by_name('ReplacedBy'); the returned value was 1, and when I dumped the ESummary record the relevant pair is ReplacedBy :XP_421022.3. The full Esummary dump is: UID :118091304 Caption :XP_421022 Title :PREDICTED: similar to Potassium voltage-gated channel, subfamily Q, member :1 [Gallus gallus] Extra :gi|118091304|ref|XP_421022.2|[118091304] Gi :118091304 CreateDate :2004/07/28 UpdateDate :2006/11/16 Flags :512 TaxId :9031 Length :643 Status :replaced ReplacedBy :XP_421022.3 Comment : This record was replaced or removed. So two questions: 1) Is the ReplacedBy value always supposed to be the new accession number version rather than the new gi number? 2) Why would I be getting a returned value of 1 instead of the accession number that is in the summary record? Any advice appreciated. Warren Gallin From cjfields at illinois.edu Tue May 28 14:31:30 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 28 May 2013 18:31:30 +0000 Subject: [Bioperl-l] ReplacedBy value in esummary In-Reply-To: <81794641-2693-4476-B2FA-6B742190D141@ualberta.ca> References: <81794641-2693-4476-B2FA-6B742190D141@ualberta.ca> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E44377@CHIMBX5.ad.uillinois.edu> On May 28, 2013, at 12:49 PM, Warren Gallin wrote: > Hi, > > I just encountered a glitch when I was trying to update some entries in a database by finding updated GENBANK protein entries. > > The original record was gi 118091304 which has been replaced by gi 363734282 > > I retrieved an ESummary of the record for gi 118091304 as a Bio::Tools::EUtilities::Summary::DocSum object (called $ds). > > When I then tried to retrieve the gi number for the replacement by using: > > my $replaced = $ds->get_contents_by_name('ReplacedBy'); > > the returned value was 1, and when I dumped the ESummary record the relevant pair is ReplacedBy :XP_421022.3. > > The full Esummary dump is: > > UID :118091304 > Caption :XP_421022 > Title :PREDICTED: similar to Potassium voltage-gated channel, subfamily Q, member > :1 [Gallus gallus] > Extra :gi|118091304|ref|XP_421022.2|[118091304] > Gi :118091304 > CreateDate :2004/07/28 > UpdateDate :2006/11/16 > Flags :512 > TaxId :9031 > Length :643 > Status :replaced > ReplacedBy :XP_421022.3 > Comment : This record was replaced or removed. > > > So two questions: > > 1) Is the ReplacedBy value always supposed to be the new accession number version rather than the new gi number? No idea, the best people to answer that would be NCBI (the idea of these modules was to simplify getting at that data instead of munging the XML, but whatever they report is mainly from NCBI, not bioperl). > 2) Why would I be getting a returned value of 1 instead of the accession number that is in the summary record? The text dump above indicates the values do exist. However, you are calling a method that returns a list (note the plural in the name) in scalar context, so you get the number of values. If you always expect a single value, use: my ($replaced) = $ds->get_contents_by_name('ReplacedBy'); which forces array context. That should fix it. chris > Any advice appreciated. > > Warren Gallin > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From florent.angly at gmail.com Wed May 1 22:16:02 2013 From: florent.angly at gmail.com (Florent Angly) Date: Thu, 02 May 2013 12:16:02 +1000 Subject: [Bioperl-l] Downloading sequences in batch from Trace Archive In-Reply-To: References: Message-ID: <5181CC62.9000609@gmail.com> Maybe using EUtilities? http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook http://www.bioperl.org/wiki/HOWTO:EUtilities_Web_Service Florent On 30/04/13 06:25, shalabh sharma wrote: > Hi All, > Is there any module in Bioperl that can download sequences from > NCBI's trace archive? > > Thanks > Shalabh > From jason.stajich at gmail.com Thu May 2 01:42:55 2013 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 1 May 2013 22:42:55 -0700 Subject: [Bioperl-l] Fwd: doubt References: Message-ID: Begin forwarded message: > From: ARYA DAS > Subject: doubt > Date: May 1, 2013 10:42:21 PM PDT > To: jason at bioperl.org > > sir, > > Am using windows7 n was trying to install bio perl in it..i have > already installed active perl.5.16.3.1603 . n was followeing the > installation procedure mentioned .when i tried GUI installation .. i cant > find bioperl package when i try to search them for installation. > while using command line.. > > ppm> install PPM-Repositories > > shows error like cant find package that provides PPM repositories, > > and when i try manually ,on reaching the > perl Build test > > it says build is recognised as an internal or external file. > > please help if time permits > > regards, > arya Jason Stajich jason.stajich at gmail.com jason at bioperl.org From voldrani at gmail.com Sun May 5 00:03:38 2013 From: voldrani at gmail.com (Chris Maloney) Date: Sun, 5 May 2013 00:03:38 -0400 Subject: [Bioperl-l] Wiki work, Template:Doclink Message-ID: The module pages on the wiki could look a little better, like this one for example: http://www.bioperl.org/wiki/Module:Bio::Tools::Run::RemoteBlast. There used to be a bunch of extra whitespace at the top of the page, which was caused by extra line breaks in Template:Doclink, which I just removed. But, I think there are other improvements that could be made. I would like to turn this into an infobox -- which are the helpful informative tables of info on Wikipedia that appear on many articles on the upper right. That would allow us to add more links -- like to metacpan, for example. It is not completely trivial to import infoboxes into a wiki though, I just discovered. I just went through the exercise on my home wiki, and it involves importing a lot of templates from Wikipedia, and fixing up the common.css. You can see the full list of imported templates here: http://chrismaloney.org/wiki/index.php?title=Special:RecentChanges&limit=100. I don't *think* this should cause any problems, but I'm not 100% sure. On the other hand, if it does, it should be easy to roll back -- it's a wiki, after all. Does anybody have a problem if I do this? I'll wait a day for responses, and tackle this tomorrow, if no one objects. -- Chris M. From armendarez77 at hotmail.com Tue May 7 20:32:22 2013 From: armendarez77 at hotmail.com (Veronica A.) Date: Tue, 7 May 2013 17:32:22 -0700 Subject: [Bioperl-l] Bio::SeqIO doesn't write all gbk sequences from Bio::DB::GenBank Message-ID: Hello, I'm currently running Bio::Perl 1.6.1 on Ubuntu 12.04.2 LTS and have noticed a problem with Bio::SeqIO when writing genbank files using the write_seq() function; some of the files do not include an 'ORIGIN' tag or the sequence. I am using GI#s (50 at a time every 2 minutes) to retrieve genbank files via Bio::DB::GenBank. ----------------------------------------START CODE---------------------------------- my $gb = Bio::DB::GenBank->new(-verbose=>-1);my $seqout = Bio::SeqIO->new(-file=>">$fileName", '-format'=>'Genbank', -alphabet=>'dna', -flush=>0, -verbose=>-1);while(@ids){ my @batchArray = splice(@ids, 0, 50); my $batchArrayRef = \@batchArray; my $streamObj; my $pid = fork(); if($pid == 0){ eval{ $streamObj = $gb->get_Stream_by_id($batchArrayRef); }; if($@){ print "Error: ".$@."\n"; } else{ while(my $seqObj = $streamObj->next_seq()){ unless($seqObj->accession_number() =~ /N[A-Z]\_/){ #print "ID: ".$seqObj->id()."\n"; #print "Seq:\n".$seqObj->seq()."\n"; $seqout->write_seq($seqObj); } } } exit 0; }}waitpid($pid,0);sleep(120); ----------------------------------------END CODE---------------------------------- Most of the Genbank files written to the output file have sequences, but there is a small portion that do not, even though they should. For example, JX287367, in NCBI includes an 'ORIGIN' tag and sequence and when I use the print function before writing to file, the sequence is printed to STDOUT, but the 'ORIGIN' tag and sequence are not written to the output gbk file. The following is found in the final output file: ----------------------------------START GBK----------------------------------------- LOCUS JX287367 588 bp DNA linear BCT 19-DEC-2012DEFINITION Chlamydia trachomatis strain UW-5/CX pyruvoyl-dependent arginine decarboxylase (aaxB) gene, complete cds.ACCESSION JX287367VERSION JX287367.1 GI:404351720KEYWORDS .SOURCE Chlamydia trachomatis ORGANISM Chlamydia trachomatis Bacteria; Chlamydiae; Chlamydiales; Chlamydiaceae; Chlamydia/Chlamydophila group; Chlamydia.REFERENCE 1 (bases 1 to 588) AUTHORS Bliven,K.A., Fisher,D.J. and Maurelli,A.T. TITLE Characterization of the activity and expression of arginine decarboxylase in human and animal Chlamydia pathogens JOURNAL FEMS Microbiol. Lett. 337 (2), 140-146 (2012) PUBMED 23043454REFERENCE 2 (bases 1 to 588) AUTHORS Bliven,K.A., Fisher,D.J. and Maurelli,A.T. TITLE Direct Submission JOURNAL Submitted (06-JUL-2012) Department of Microbiology and Immunology, F. Edward Hebert School of Medicine, Uniformed Services University of the Health Sciences, 4301 Jones Bridge Road, Bethesda, MD 20814, USAFEATURES Location/Qualifiers source 1..588 /mol_type="genomic DNA" /db_xref="taxon:813" /strain="UW-5/CX" /organism="Chlamydia trachomatis" /serovar="E" gene 1..588 /gene="aaxB" CDS 1..588 /protein_id="AFR60849.1" /gene="aaxB" /transl_table=11 /note="AaxB" /db_xref="GI:404351721" /codon_start=1 /product="pyruvoyl-dependent arginine decarboxylase" /translation="MPYGTRYPTLAFHTGGVGESDDGMPPQPFETFCYDSALLQAKIE NFNIVPYTSVLPKELFGNILPVDQCTKFFKHGAVLEVIMAGRGATVTDGTQAIATGVG ICWGKDKNGELIGGWAAEYVEFFPTWIDDEIAESHAKMWLKKSLQHELDLRSVSKHSE FQYFHNYINIRKKFGFCLTALGFLNFENVAPAVIQ" // ----------------------------------END GBK----------------------------------------- Can anyone tell what I am missing or why this is happening? I don't know if this has happened in earlier BioPerl versions as up until now, I usually downloaded sequences straight from NCBI, but that became too time consuming....but this seems to be as well :S Thank you in advance for any help, Veronica From cjfields at illinois.edu Tue May 7 22:17:43 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 8 May 2013 02:17:43 +0000 Subject: [Bioperl-l] Bio::SeqIO doesn't write all gbk sequences from Bio::DB::GenBank In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E166A0@CHIMBX5.ad.uillinois.edu> Veronica, Your mail may have garbled the script and example file. Can you paste these in a gist? https://gist.github.com/ chris On May 7, 2013, at 7:32 PM, Veronica A. wrote: > Hello, > I'm currently running Bio::Perl 1.6.1 on Ubuntu 12.04.2 LTS and have noticed a problem with Bio::SeqIO when writing genbank files using the write_seq() function; some of the files do not include an 'ORIGIN' tag or the sequence. > > I am using GI#s (50 at a time every 2 minutes) to retrieve genbank files via Bio::DB::GenBank. > > ----------------------------------------START CODE---------------------------------- > my $gb = Bio::DB::GenBank->new(-verbose=>-1);my $seqout = Bio::SeqIO->new(-file=>">$fileName", '-format'=>'Genbank', -alphabet=>'dna', -flush=>0, -verbose=>-1);while(@ids){ my @batchArray = splice(@ids, 0, 50); my $batchArrayRef = \@batchArray; > my $streamObj; my $pid = fork(); if($pid == 0){ eval{ $streamObj = $gb->get_Stream_by_id($batchArrayRef); }; if($@){ print "Error: ".$@."\n"; } else{ while(my $seqObj = $streamObj->next_seq()){ unless($seqObj->accession_number() =~ /N[A-Z]\_/){ #print "ID: ".$seqObj->id()."\n"; #print "Seq:\n".$seqObj->seq()."\n"; $seqout->write_seq($seqObj); } } } exit 0; }}waitpid($pid,0);sleep(120); > ----------------------------------------END CODE---------------------------------- > Most of the Genbank files written to the output file have sequences, but there is a small portion that do not, even though they should. For example, JX287367, in NCBI includes an 'ORIGIN' tag and sequence and when I use the print function before writing to file, the sequence is printed to STDOUT, but the 'ORIGIN' tag and sequence are not written to the output gbk file. The following is found in the final output file: > ----------------------------------START GBK----------------------------------------- > LOCUS JX287367 588 bp DNA linear BCT 19-DEC-2012DEFINITION Chlamydia trachomatis strain UW-5/CX pyruvoyl-dependent arginine decarboxylase (aaxB) gene, complete cds.ACCESSION JX287367VERSION JX287367.1 GI:404351720KEYWORDS .SOURCE Chlamydia trachomatis ORGANISM Chlamydia trachomatis Bacteria; Chlamydiae; Chlamydiales; Chlamydiaceae; Chlamydia/Chlamydophila group; Chlamydia.REFERENCE 1 (bases 1 to 588) AUTHORS Bliven,K.A., Fisher,D.J. and Maurelli,A.T. TITLE Characterization of the activity and expression of arginine decarboxylase in human and animal Chlamydia pathogens JOURNAL FEMS Microbiol. Lett. 337 (2), 140-146 (2012) PUBMED 23043454REFERENCE 2 (bases 1 to 588) AUTHORS Bliven,K.A., Fisher,D.J. and Maurelli,A.T. TITLE Direct Submission JOURNAL Submitted (06-JUL-2012) Department of Microbiology and Immunology, F. Edward Hebert School of Med! > icine, Uniformed Services University of the Health Sciences, 4301 Jones Bridge Road, Bethesda, MD 20814, USAFEATURES Location/Qualifiers source 1..588 /mol_type="genomic DNA" /db_xref="taxon:813" /strain="UW-5/CX" /organism="Chlamydia trachomatis" /serovar="E" gene 1..588 /gene="aaxB" CDS 1..588 /protein_id="AFR60849.1" /gene="aaxB" /transl_table=11 /note="AaxB" /db_xref="GI:404351721" /codon_start=1 /product="pyruvoyl-dependent arginine decarboxylase" /translation="MPYGTRYPTLAFHTGGVGESDDGMPPQPFETFCYDSALLQAKIE NFNIVPYTSVLPKELFGNILPVDQCTKFFKHGAVLEVIMAGRGATVTDGTQAIATGVG ICWGKDKNGELIGGW! > AAEYVEFFPTWIDDEIAESHAKMWLKKSLQHELDLRSVSKHSE FQYFHN > YINIRKKFGFCLTALGFLNFENVAPAVIQ" > // > ----------------------------------END GBK----------------------------------------- > Can anyone tell what I am missing or why this is happening? I don't know if this has happened in earlier BioPerl versions as up until now, I usually downloaded sequences straight from NCBI, but that became too time consuming....but this seems to be as well :S > Thank you in advance for any help, > Veronica > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From witch.of.agnessi at gmail.com Wed May 8 15:24:53 2013 From: witch.of.agnessi at gmail.com (WoA) Date: Wed, 8 May 2013 12:24:53 -0700 (PDT) Subject: [Bioperl-l] Extracting matching subsequence from pairwise alignment Message-ID: <1368041092972-16935.post@n3.nabble.com> Hello All, I've a pairwise global alignemnet of two DNA sequences generated by the program NEEDLE of EMBOSS package. I wish to extract the sub-sequence that matches/aligns to a given region of the other sequence. In this alignment (Pastebin Link) the given region (actually the CDS) falls between base number 24:485 in the original sequence with ID 'XM_001005073.' I wish to extract the sub-sequence in the sequence ID 'Homolog' that aligns with that 24:485 region of the other sequence. I'm using Bioperl to parse the alignment. I find out the the alignment column numbers corresponding to 24:485 region in the particular sequence, using 'column_from_residue_number'. Then I extract the sub-sequence from the 'aligned' other sequence(containing gaps) using the corresponding column numbers. Finally I remove the gap characters. Am I doing this thing correctly and are there any pitfalls ? Is there any better way to do it by (Bio)Perl/Python? The code goes here: use strict; use warnings; use Bio::AlignIO; # read in an alignment generated by the EMBOSS program Needle my $in = new Bio::AlignIO(-format => 'emboss', -file => 'test_needle.aln'); while( my $aln = $in->next_aln ) { #Seqnames: 'XM_001005073.'(CDS:24-485),'Homolog' my ($cds_start,$cds_end)=(24,485);# my $col_cdsstart = $aln->column_from_residue_number( 'XM_001005073.', $cds_start); my $col_cdsend= $aln->column_from_residue_number( 'XM_001005073.', $cds_end); foreach my $seq ($aln->each_seq) { if($seq->id() eq 'Homolog'){ my $homolog_cds=$seq->subseq($col_cdsstart,$col_cdsend); $homolog_cds=~s/\-//g; print $homolog_cds,"\n"; } } } -- View this message in context: http://bioperl.996286.n3.nabble.com/Extracting-matching-subsequence-from-pairwise-alignment-tp16935.html Sent from the Bioperl-L mailing list archive at Nabble.com. From hlapp at drycafe.net Wed May 15 16:44:07 2013 From: hlapp at drycafe.net (Hilmar Lapp) Date: Wed, 15 May 2013 16:44:07 -0400 Subject: [Bioperl-l] Workshop on Sustainable Software for Science: Practice and Experiences Message-ID: FYI, if you haven't seen this yet: http://wssspe.researchcomputing.org.uk/ It seems to me that the Bio* projects, perhaps led by BioPerl as the oldest and thus longest running (nowadays more fancily called "sustained") of them would have a lot to say about the subject. Anyone interested in a joint submission? Also, I notice that Biojava's Andreas is on the organizing committee, so maybe he's been conspiring on something already :-) -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 203 bytes Desc: Message signed with OpenPGP using GPGMail URL: From carandraug+dev at gmail.com Wed May 15 21:53:55 2013 From: carandraug+dev at gmail.com (=?ISO-8859-1?Q?Carn=EB_Draug?=) Date: Thu, 16 May 2013 02:53:55 +0100 Subject: [Bioperl-l] sets of sequences - how to read? Message-ID: Hi when accessing entrez gene using eutils to get multiple genes, NCBI now returns an Entrezgene-Set[1] rather than a list of EntrezGene. This change must have happened sometime on the last 2 months. Compare: use Bio::DB::EUtilities; my %sets = ( eutil => 'efetch', db => 'gene', retmode => 'text', rettype => 'asn1', email => 'bioperl-l at lists.open-bio.org', ); ## this mimics the previous behaviour of the NCBI server but the multiple requests will annoy their servers my @ids = (3014, 85235); my $response; foreach (@ids) { my $fetcher = Bio::DB::EUtilities->new(%sets, id => $_); $response .= $fetcher->get_Response->content; } print $fetcher->get_Response->content; ## this used to be the right way to do it, but now returns an Entrezgene-Set my $fetcher = Bio::DB::EUtilities->new(%sets, id => \@ids); $response .= $fetcher->get_Response->content; print $fetcher->get_Response->content; There is no module to read these Entrezgene-Set in Perl at the moment, since Bio::ASN1::EntrezGene; is not able to handle them. I have contacted the module author and set him a fix[2] and he said he'll try to look into it next week. However, even with the fix there is another problem. How would one access a set of sequences using the Bio::SeqIO API? There is no method to do that. One could say, to ignore them, and make next_seq return the next sequence of the set. But then we are losing data. After all, it's perfectly viable to have multiple Entrezgene-Set in one file. What would be the right way to do this? Carn? [1] http://0-www.ncbi.nlm.nih.gov.elis.tmu.edu.tw/IEB/ToolBox/CPP_DOC/asn_spec/Entrezgene-Set.html [2] https://github.com/carandraug/bio-asn1-entrezgene/commit/69d505056d8b7897df6271ffb7a5f39d58873c6b From cjfields at illinois.edu Thu May 16 00:43:22 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 16 May 2013 04:43:22 +0000 Subject: [Bioperl-l] Workshop on Sustainable Software for Science: Practice and Experiences In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E1F8C8@CHIMBX5.ad.uillinois.edu> Jason and I have discussed looking into opportunity's like this, I think it makes sense to try a joint submission. chris On May 15, 2013, at 3:44 PM, Hilmar Lapp wrote: > FYI, if you haven't seen this yet: > > http://wssspe.researchcomputing.org.uk/ > > It seems to me that the Bio* projects, perhaps led by BioPerl as the oldest and thus longest running (nowadays more fancily called "sustained") of them would have a lot to say about the subject. Anyone interested in a joint submission? > > Also, I notice that Biojava's Andreas is on the organizing committee, so maybe he's been conspiring on something already :-) > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : > =========================================================== > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From p.j.a.cock at googlemail.com Thu May 16 05:10:25 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 16 May 2013 10:10:25 +0100 Subject: [Bioperl-l] Workshop on Sustainable Software for Science: Practice and Experiences In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF74E1F8C8@CHIMBX5.ad.uillinois.edu> References: <118F034CF4C3EF48A96F86CE585B94BF74E1F8C8@CHIMBX5.ad.uillinois.edu> Message-ID: On Thu, May 16, 2013 at 5:43 AM, Fields, Christopher J wrote: > Jason and I have discussed looking into opportunity's like this, I think it makes > sense to try a joint submission. > > chris This sounds like a good idea, although given the time and place I am unlikely to be able to attend in person: First Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE) (to held in conjunction with SC13, Sunday, 17 November 2013, Denver, CO, USA) http://wssspe.researchcomputing.org.uk/ Rather than trying to discuss this over four mailing lists should we switch to the cross project list open-bio-l, or continue off-list? http://lists.open-bio.org/mailman/listinfo/open-bio-l Thanks, Peter From miquel.ramia at uab.cat Thu May 16 06:42:29 2013 From: miquel.ramia at uab.cat (=?ISO-8859-1?Q?Miquel_R=E0mia?=) Date: Thu, 16 May 2013 12:42:29 +0200 Subject: [Bioperl-l] Problem compiling Bio::DB::Sam Message-ID: <5194B815.2010401@uab.cat> Hi all, I get this message when compiling Bio::DB::Sam: Building Bio-SamTools gcc -g -Wall -O2 -fPIC -o bam2bedgraph bam2bedgraph.o -L/var/lib/gbrowse2/databases/samtools/samtools-0.1.19 -lbam -lm -lz /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `mt_destroy': /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:458: undefined reference to `pthread_join' /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `bgzf_mt': /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:445: undefined reference to `pthread_create' collect2: ld returned 1 exit status make: *** [bam2bedgraph] Error 1 Is this error related to the module or some dependencies? or maybe a problem with my system? Any help appreciated, thanks! -- Miquel R?mia Jes?s PhD. candidate (PIF) Evolutionary Bioinformatics Group (Genomics, Bioinformatics and Evolution Group) Lab MRB/014 - 93 586 89 58 MRB - Institut de Biologia i Biomedicina (IBB) Universitat Aut?noma de Barcelona (UAB) 08193, Cerdanyola del Vall?s Barcelona (Spain) From cjfields at illinois.edu Thu May 16 09:12:40 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 16 May 2013 13:12:40 +0000 Subject: [Bioperl-l] Problem compiling Bio::DB::Sam In-Reply-To: <5194B815.2010401@uab.cat> References: <5194B815.2010401@uab.cat> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E1FC2C@CHIMBX5.ad.uillinois.edu> It may be due to the new samtools release (v 0.1.19). I know Heng Li has been working on the code over the last year for threading support (notice the undefined functions). Have you tried v 0.1.18? chris On May 16, 2013, at 5:42 AM, Miquel R?mia wrote: > Hi all, > > I get this message when compiling Bio::DB::Sam: > > Building Bio-SamTools > > gcc -g -Wall -O2 -fPIC -o bam2bedgraph bam2bedgraph.o -L/var/lib/gbrowse2/databases/samtools/samtools-0.1.19 -lbam -lm -lz > > /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `mt_destroy': > > /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:458: undefined reference to `pthread_join' > > /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `bgzf_mt': > > /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:445: undefined reference to `pthread_create' > > collect2: ld returned 1 exit status > > make: *** [bam2bedgraph] Error 1 > > > Is this error related to the module or some dependencies? or maybe a problem with my system? > > Any help appreciated, thanks! > > > -- > Miquel R?mia Jes?s > PhD. candidate (PIF) > Evolutionary Bioinformatics Group > (Genomics, Bioinformatics and Evolution Group) > Lab MRB/014 - 93 586 89 58 > MRB - Institut de Biologia i Biomedicina (IBB) > Universitat Aut?noma de Barcelona (UAB) > 08193, Cerdanyola del Vall?s > Barcelona (Spain) > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Thu May 16 09:09:45 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 16 May 2013 13:09:45 +0000 Subject: [Bioperl-l] Workshop on Sustainable Software for Science: Practice and Experiences In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF74E1F8C8@CHIMBX5.ad.uillinois.edu> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E1FBCF@CHIMBX5.ad.uillinois.edu> Yes, though we need to make sure others (e.g. those not subscribed to open-bio-l) are in the loop. November is a possibility for me. chris On May 16, 2013, at 4:10 AM, Peter Cock wrote: > On Thu, May 16, 2013 at 5:43 AM, Fields, Christopher J > wrote: >> Jason and I have discussed looking into opportunity's like this, I think it makes >> sense to try a joint submission. >> >> chris > > This sounds like a good idea, although given the time and place I am > unlikely to be able to attend in person: > > First Workshop on Sustainable Software for Science: Practice and > Experiences (WSSSPE) > (to held in conjunction with SC13, Sunday, 17 November 2013, Denver, CO, USA) > http://wssspe.researchcomputing.org.uk/ > > Rather than trying to discuss this over four mailing lists should we switch > to the cross project list open-bio-l, or continue off-list? > http://lists.open-bio.org/mailman/listinfo/open-bio-l > > Thanks, > > Peter From andreas at sdsc.edu Thu May 16 00:31:34 2013 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 15 May 2013 21:31:34 -0700 Subject: [Bioperl-l] [Biojava-l] Workshop on Sustainable Software for Science: Practice and Experiences In-Reply-To: References: Message-ID: Thanks Hilmar, you were faster than me in sending this out.. You are right, it would be very interesting to hear what some of the long running open-bio projects have to say on the topic of sustainability. Let me know if anybody is interested in a submission! Andreas On Wed, May 15, 2013 at 1:44 PM, Hilmar Lapp wrote: > FYI, if you haven't seen this yet: > > http://wssspe.researchcomputing.org.uk/ > > It seems to me that the Bio* projects, perhaps led by BioPerl as the > oldest and thus longest running (nowadays more fancily called "sustained") > of them would have a lot to say about the subject. Anyone interested in a > joint submission? > > Also, I notice that Biojava's Andreas is on the organizing committee, so > maybe he's been conspiring on something already :-) > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : > =========================================================== > > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > From cjfields at illinois.edu Fri May 17 00:08:04 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 17 May 2013 04:08:04 +0000 Subject: [Bioperl-l] sets of sequences - how to read? In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E21252@CHIMBX5.ad.uillinois.edu> This doesn't surprise me too much; I know there have been some changes brewing, but didn't know when they would land. I guess that would be... ... now. My feeling is this will require writing some code for a higher-level layer of abstraction, say a Bio::DB::* (which would allow some internal indexing of the files maybe using a Bio::Index::*, look ups for specific gene IDs, etc). How hard that would be to implement is another thing, have no idea w/o seeing what the data look like beyond they are in ASN1. chris On May 15, 2013, at 8:53 PM, Carn? Draug wrote: > Hi > > when accessing entrez gene using eutils to get multiple genes, NCBI > now returns an Entrezgene-Set[1] rather than a list of EntrezGene. > This change must have happened sometime on the last 2 months. Compare: > > use Bio::DB::EUtilities; > > my %sets = ( > eutil => 'efetch', > db => 'gene', > retmode => 'text', > rettype => 'asn1', > email => 'bioperl-l at lists.open-bio.org', > ); > > ## this mimics the previous behaviour of the NCBI server but the > multiple requests will annoy their servers > my @ids = (3014, 85235); > my $response; > foreach (@ids) { > my $fetcher = Bio::DB::EUtilities->new(%sets, id => $_); > $response .= $fetcher->get_Response->content; > } > print $fetcher->get_Response->content; > > ## this used to be the right way to do it, but now returns an Entrezgene-Set > my $fetcher = Bio::DB::EUtilities->new(%sets, id => \@ids); > $response .= $fetcher->get_Response->content; > print $fetcher->get_Response->content; > > There is no module to read these Entrezgene-Set in Perl at the moment, > since Bio::ASN1::EntrezGene; is not able to handle them. I have > contacted the module author and set him a fix[2] and he said he'll try > to look into it next week. > > However, even with the fix there is another problem. How would one > access a set of sequences using the Bio::SeqIO API? There is no method > to do that. One could say, to ignore them, and make next_seq return > the next sequence of the set. But then we are losing data. After all, > it's perfectly viable to have multiple Entrezgene-Set in one file. > What would be the right way to do this? > > Carn? > > [1] http://0-www.ncbi.nlm.nih.gov.elis.tmu.edu.tw/IEB/ToolBox/CPP_DOC/asn_spec/Entrezgene-Set.html > [2] https://github.com/carandraug/bio-asn1-entrezgene/commit/69d505056d8b7897df6271ffb7a5f39d58873c6b > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Fri May 17 00:16:12 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 17 May 2013 04:16:12 +0000 Subject: [Bioperl-l] Problem compiling Bio::DB::Sam In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF74E1FC2C@CHIMBX5.ad.uillinois.edu> References: <5194B815.2010401@uab.cat> <118F034CF4C3EF48A96F86CE585B94BF74E1FC2C@CHIMBX5.ad.uillinois.edu> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E212B0@CHIMBX5.ad.uillinois.edu> For the record, this is now fixed in the latest Bio::Samtools (via Lincoln). chris On May 16, 2013, at 8:12 AM, "Fields, Christopher J" wrote: > It may be due to the new samtools release (v 0.1.19). I know Heng Li has been working on the code over the last year for threading support (notice the undefined functions). Have you tried v 0.1.18? > > chris > > On May 16, 2013, at 5:42 AM, Miquel R?mia wrote: > >> Hi all, >> >> I get this message when compiling Bio::DB::Sam: >> >> Building Bio-SamTools >> >> gcc -g -Wall -O2 -fPIC -o bam2bedgraph bam2bedgraph.o -L/var/lib/gbrowse2/databases/samtools/samtools-0.1.19 -lbam -lm -lz >> >> /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `mt_destroy': >> >> /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:458: undefined reference to `pthread_join' >> >> /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `bgzf_mt': >> >> /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:445: undefined reference to `pthread_create' >> >> collect2: ld returned 1 exit status >> >> make: *** [bam2bedgraph] Error 1 >> >> >> Is this error related to the module or some dependencies? or maybe a problem with my system? >> >> Any help appreciated, thanks! >> >> >> -- >> Miquel R?mia Jes?s >> PhD. candidate (PIF) >> Evolutionary Bioinformatics Group >> (Genomics, Bioinformatics and Evolution Group) >> Lab MRB/014 - 93 586 89 58 >> MRB - Institut de Biologia i Biomedicina (IBB) >> Universitat Aut?noma de Barcelona (UAB) >> 08193, Cerdanyola del Vall?s >> Barcelona (Spain) >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From carandraug+dev at gmail.com Fri May 17 01:12:24 2013 From: carandraug+dev at gmail.com (=?ISO-8859-1?Q?Carn=EB_Draug?=) Date: Fri, 17 May 2013 06:12:24 +0100 Subject: [Bioperl-l] sets of sequences - how to read? In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF74E21252@CHIMBX5.ad.uillinois.edu> References: <118F034CF4C3EF48A96F86CE585B94BF74E21252@CHIMBX5.ad.uillinois.edu> Message-ID: On 17 May 2013 05:08, Fields, Christopher J wrote: > This doesn't surprise me too much; I know there have been some changes brewing, but didn't know when they would land. I guess that would be... ... now. > > My feeling is this will require writing some code for a higher-level layer of abstraction, say a Bio::DB::* (which would allow some internal indexing of the files maybe using a Bio::Index::*, look ups for specific gene IDs, etc). How hard that would be to implement is another thing, have no idea w/o seeing what the data look like beyond they are in ASN1. :s I'm not sure I understood your suggestion. I think the problem is just the introduction of a new concept, a "set" of stuff (genes in this case), and how should SeqIO handle multiple sets. Carn? From shalabh.sharma7 at gmail.com Fri May 17 10:54:55 2013 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Fri, 17 May 2013 10:54:55 -0400 Subject: [Bioperl-l] Downloading annotated Gene nucleotide fasta files Message-ID: HI, First of all i am really sorry for sending this mail here, i am not sure if this is the right forum. I know lot of people work on similar stuff. I wrote to NCBI but nobody replied. Actually i am looking for all bacterial/microbial gene annotation nucleotide fasta files. Does anyone knows where to download these kind of files. I tried *ffn files but they are not annotated. Or is there any module in bioperl that i can use ? I would really appreciate your help. Thanks Shalabh -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From fossandonc at hotmail.com Fri May 17 11:59:04 2013 From: fossandonc at hotmail.com (=?iso-8859-1?Q?Francisco_J._Ossand=F3n?=) Date: Fri, 17 May 2013 11:59:04 -0400 Subject: [Bioperl-l] Downloading annotated Gene nucleotide fasta files In-Reply-To: References: Message-ID: Hi, You can get the annotations from here: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ The ".ffn" are the genes nucleotide fasta files but it does not show the product name, on the other hand the ".faa" are the genes aminoacid fasta files and shows the product name, but if you want both product and nucleotide is much better to use the Genbank ".gbk" files that contains the complete data and you can parse it easily using BioPerl to obtain all genes, and then print the /protein_id, /product, and the nucleotide sequences in a new fasta file. Check these to see how to do it: http://www.bioperl.org/wiki/HOWTO:SeqIO http://www.bioperl.org/wiki/HOWTO:Feature-Annotation Cheers, Francisco J. Ossandon -----Mensaje original----- De: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] En nombre de shalabh sharma Enviado el: viernes, 17 de mayo de 2013 10:55 Para: bioperl-l Asunto: [Bioperl-l] Downloading annotated Gene nucleotide fasta files HI, First of all i am really sorry for sending this mail here, i am not sure if this is the right forum. I know lot of people work on similar stuff. I wrote to NCBI but nobody replied. Actually i am looking for all bacterial/microbial gene annotation nucleotide fasta files. Does anyone knows where to download these kind of files. I tried *ffn files but they are not annotated. Or is there any module in bioperl that i can use ? I would really appreciate your help. Thanks Shalabh -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From shalabh.sharma7 at gmail.com Fri May 17 12:26:26 2013 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Fri, 17 May 2013 12:26:26 -0400 Subject: [Bioperl-l] Downloading annotated Gene nucleotide fasta files In-Reply-To: References: Message-ID: Hey Francisco, Thanks a lot. Basically i just wanted gene nucleotide fasta files with GI numbers. I think i will have to parse it from gbk files. -Shalabh On Fri, May 17, 2013 at 11:59 AM, Francisco J. Ossand?n < fossandonc at hotmail.com> wrote: > Hi, > You can get the annotations from here: > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ > > The ".ffn" are the genes nucleotide fasta files but it does not show the > product name, on the other hand the ".faa" are the genes aminoacid fasta > files and shows the product name, but if you want both product and > nucleotide is much better to use the Genbank ".gbk" files that contains the > complete data and you can parse it easily using BioPerl to obtain all > genes, > and then print the /protein_id, /product, and the nucleotide sequences in a > new fasta file. Check these to see how to do it: > http://www.bioperl.org/wiki/HOWTO:SeqIO > http://www.bioperl.org/wiki/HOWTO:Feature-Annotation > > Cheers, > > Francisco J. Ossandon > > -----Mensaje original----- > De: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] En nombre de shalabh sharma > Enviado el: viernes, 17 de mayo de 2013 10:55 > Para: bioperl-l > Asunto: [Bioperl-l] Downloading annotated Gene nucleotide fasta files > > HI, > First of all i am really sorry for sending this mail here, > i am not sure if this is the right forum. I know lot of people work on > similar stuff. > I wrote to NCBI but nobody replied. > > Actually i am looking for all bacterial/microbial gene annotation > nucleotide > fasta files. > Does anyone knows where to download these kind of files. > I tried *ffn files but they are not annotated. > Or is there any module in bioperl that i can use ? > I would really appreciate your help. > > Thanks > Shalabh > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences University of Georgia Athens, GA 30602-3636 > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From cjfields at illinois.edu Fri May 17 13:37:53 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 17 May 2013 17:37:53 +0000 Subject: [Bioperl-l] sets of sequences - how to read? In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF74E21252@CHIMBX5.ad.uillinois.edu> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E22218@CHIMBX5.ad.uillinois.edu> On May 17, 2013, at 12:12 AM, Carn? Draug wrote: > On 17 May 2013 05:08, Fields, Christopher J wrote: >> This doesn't surprise me too much; I know there have been some changes brewing, but didn't know when they would land. I guess that would be... ... now. >> >> My feeling is this will require writing some code for a higher-level layer of abstraction, say a Bio::DB::* (which would allow some internal indexing of the files maybe using a Bio::Index::*, look ups for specific gene IDs, etc). How hard that would be to implement is another thing, have no idea w/o seeing what the data look like beyond they are in ASN1. > > :s I'm not sure I understood your suggestion. I think the problem is > just the introduction of a new concept, a "set" of stuff (genes in > this case), and how should SeqIO handle multiple sets. > > Carn? (note: critical point in this is Bio::ASN1::Entrezgene would allow this, I'm not sure it would. Otherwise this is all really hand-wavy) To me a 'set of stuff', particularly when the 'stuff' is stored sequentially in a flat file, is a simple 'database' or 'store' of similar items, where the class allows one the ability to look up particular members in the set, but also could store higher level information about the set as a whole if needed. If it were me, I would implement a method particular to Bio::SeqIO::entrezgene that specifically creates and returns this ( next_geneset(), for instance ); next_seq() could then be implemented to iterate through the items in that database/store. Two useful things come out of this. First, if the data for the Entrez Gene file/chunk are parsed to store offsets per ID, one would only need to parse out the chunks needed (offset of ID to next offset), then pass that into the parser and create objects on the fly. This would probably be as fast or faster than (for instance) the greedy method of parsing the entire file and storing everything in objects up-front, then iterating through those objects one at a time, which I think is current behavior. Second: if an index is created, the upfront cost is already paid (you could reuse the same index when parsing the same data). An analogous example might be storing all FASTQ data in a sequencing run; I don't want to expend the effort to parse all the FASTQ data, but I may want to run operations on individual items in the set as well as store additional information about the data (barcodes per run, lanes, overall quality stats, etc). Does that make sense? The pieces for this are lying around (Bio::Index::* for instance has methods for indexing flat files, and classes like Bio::DB::Fasta). chris From shalabh.sharma7 at gmail.com Sun May 19 15:33:16 2013 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Sun, 19 May 2013 15:33:16 -0400 Subject: [Bioperl-l] Downloading annotated Gene nucleotide fasta files In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF37365BCAEC2@exchsth.agresearch.co.nz> References: <18DF7D20DFEC044098A1062202F5FFF37365BCAEC2@exchsth.agresearch.co.nz> Message-ID: Thanks Russell, Actually i wanted all the Bacterial gene nucleotide files, so i parsed it from *gbk. But yes these files might help me for my other parts of my work. Thanks Shalabh On Sun, May 19, 2013 at 3:26 PM, Smithies, Russell < Russell.Smithies at agresearch.co.nz> wrote: > Another option that I've used before is to download the gene2accession, > gene2refseq, and gene_info files from here > ftp://ftp.ncbi.nih.gov/gene/DATA/ then parse out the data you require. > It might work for you? > > --Russell > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto: > bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma > Sent: Saturday, 18 May 2013 4:26 a.m. > To: Francisco J. Ossand?n > Cc: bioperl-l > Subject: Re: [Bioperl-l] Downloading annotated Gene nucleotide fasta files > > Hey Francisco, > Thanks a lot. Basically i just wanted gene nucleotide fasta > files with GI numbers. > I think i will have to parse it from gbk files. > > -Shalabh > > > On Fri, May 17, 2013 at 11:59 AM, Francisco J. Ossand?n < > fossandonc at hotmail.com> wrote: > > > Hi, > > You can get the annotations from here: > > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ > > > > The ".ffn" are the genes nucleotide fasta files but it does not show > > the product name, on the other hand the ".faa" are the genes aminoacid > > fasta files and shows the product name, but if you want both product > > and nucleotide is much better to use the Genbank ".gbk" files that > > contains the complete data and you can parse it easily using BioPerl > > to obtain all genes, and then print the /protein_id, /product, and the > > nucleotide sequences in a new fasta file. Check these to see how to do > > it: > > http://www.bioperl.org/wiki/HOWTO:SeqIO > > http://www.bioperl.org/wiki/HOWTO:Feature-Annotation > > > > Cheers, > > > > Francisco J. Ossandon > > > > -----Mensaje original----- > > De: bioperl-l-bounces at lists.open-bio.org > > [mailto:bioperl-l-bounces at lists.open-bio.org] En nombre de shalabh > > sharma Enviado el: viernes, 17 de mayo de 2013 10:55 > > Para: bioperl-l > > Asunto: [Bioperl-l] Downloading annotated Gene nucleotide fasta files > > > > HI, > > First of all i am really sorry for sending this mail > > here, i am not sure if this is the right forum. I know lot of people > > work on similar stuff. > > I wrote to NCBI but nobody replied. > > > > Actually i am looking for all bacterial/microbial gene annotation > > nucleotide fasta files. > > Does anyone knows where to download these kind of files. > > I tried *ffn files but they are not annotated. > > Or is there any module in bioperl that i can use ? > > I would really appreciate your help. > > > > Thanks > > Shalabh > > > > -- > > Shalabh Sharma > > Scientific Computing Professional Associate (Bioinformatics > > Specialist) Department of Marine Sciences University of Georgia > > Athens, GA 30602-3636 _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences University of Georgia Athens, GA 30602-3636 > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From Russell.Smithies at agresearch.co.nz Sun May 19 15:26:35 2013 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Mon, 20 May 2013 07:26:35 +1200 Subject: [Bioperl-l] Downloading annotated Gene nucleotide fasta files In-Reply-To: References: Message-ID: <18DF7D20DFEC044098A1062202F5FFF37365BCAEC2@exchsth.agresearch.co.nz> Another option that I've used before is to download the gene2accession, gene2refseq, and gene_info files from here ftp://ftp.ncbi.nih.gov/gene/DATA/ then parse out the data you require. It might work for you? --Russell -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma Sent: Saturday, 18 May 2013 4:26 a.m. To: Francisco J. Ossand?n Cc: bioperl-l Subject: Re: [Bioperl-l] Downloading annotated Gene nucleotide fasta files Hey Francisco, Thanks a lot. Basically i just wanted gene nucleotide fasta files with GI numbers. I think i will have to parse it from gbk files. -Shalabh On Fri, May 17, 2013 at 11:59 AM, Francisco J. Ossand?n < fossandonc at hotmail.com> wrote: > Hi, > You can get the annotations from here: > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ > > The ".ffn" are the genes nucleotide fasta files but it does not show > the product name, on the other hand the ".faa" are the genes aminoacid > fasta files and shows the product name, but if you want both product > and nucleotide is much better to use the Genbank ".gbk" files that > contains the complete data and you can parse it easily using BioPerl > to obtain all genes, and then print the /protein_id, /product, and the > nucleotide sequences in a new fasta file. Check these to see how to do > it: > http://www.bioperl.org/wiki/HOWTO:SeqIO > http://www.bioperl.org/wiki/HOWTO:Feature-Annotation > > Cheers, > > Francisco J. Ossandon > > -----Mensaje original----- > De: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] En nombre de shalabh > sharma Enviado el: viernes, 17 de mayo de 2013 10:55 > Para: bioperl-l > Asunto: [Bioperl-l] Downloading annotated Gene nucleotide fasta files > > HI, > First of all i am really sorry for sending this mail > here, i am not sure if this is the right forum. I know lot of people > work on similar stuff. > I wrote to NCBI but nobody replied. > > Actually i am looking for all bacterial/microbial gene annotation > nucleotide fasta files. > Does anyone knows where to download these kind of files. > I tried *ffn files but they are not annotated. > Or is there any module in bioperl that i can use ? > I would really appreciate your help. > > Thanks > Shalabh > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics > Specialist) Department of Marine Sciences University of Georgia > Athens, GA 30602-3636 _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From miquel.ramia at uab.cat Tue May 21 11:08:18 2013 From: miquel.ramia at uab.cat (=?ISO-8859-1?Q?Miquel_R=E0mia?=) Date: Tue, 21 May 2013 17:08:18 +0200 Subject: [Bioperl-l] Problem compiling Bio::DB::Sam In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF74E212B0@CHIMBX5.ad.uillinois.edu> References: <5194B815.2010401@uab.cat> <"118F034CF4C3EF48A96F86CE585B94BF74E1F C2C"@CHIMBX5.ad.uillinois.edu> <118F034CF4C3EF48A96F86CE585B94BF74E212B0@CHIMBX5.ad.uillinois.edu> Message-ID: <519B8DE2.2070308@uab.cat> On 17/05/13 06:16, Fields, Christopher J wrote: > For the record, this is now fixed in the latest Bio::Samtools (via Lincoln). > > chris > > Compiled correctly! thank you -- Miquel R?mia Jes?s PhD. candidate (PIF) Evolutionary Bioinformatics Group (Genomics, Bioinformatics and Evolution Group) Lab MRB/014 - 93 586 89 58 MRB - Institut de Biologia i Biomedicina (IBB) Universitat Aut?noma de Barcelona (UAB) 08193, Cerdanyola del Vall?s Barcelona (Spain) From b.l.cohen_home at btinternet.com Mon May 20 14:49:50 2013 From: b.l.cohen_home at btinternet.com (Bernard Cohen) Date: Mon, 20 May 2013 19:49:50 +0100 (BST) Subject: [Bioperl-l] Phylip format error Message-ID: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Hello! I happen to have checked to see what the PERL webpage says about Phylip format for DNA alignment files and see that it is erroneous.? I am not a PERL user and do not want to be bothered to register or otherwise learn how to make an official comment, so forward this for someone to pick up. Phylip format allows up to 10 spaces for taxon names; the data must start in the 11th space. This can be checked on Jo Felsenstein's site. The PERL page accessed by searching for "Phylip format PERL" allows only 8 spaces for the name.? B. L. Cohen From senanu.pearson at gmail.com Wed May 22 16:15:24 2013 From: senanu.pearson at gmail.com (Senanu) Date: Wed, 22 May 2013 13:15:24 -0700 Subject: [Bioperl-l] Speed issues with making IUPAC consensus from alignment Message-ID: <1DE07C6C-0DEA-43EB-8F1A-BACB8534400F@gmail.com> Hi all, I am wondering if the consensus_iupac method of Bio::Align is known to be extremely slow, or if I'm doing something wrong. I have bacterial whole-genome alignments (~7 Mbases) that I made in progressiveMauve and wish to get an IUPAC consensus. (I know that progressiveMauve uses a non-standard XMFA format, but Bio::AlignIO seems to read them just fine.) The code below takes more than all night to make a consensus. It works fine on tiny test alignments. Is this a known problem? Is there another way to generate such a consensus? my $in = Bio::AlignIO->new(-file => $files[0], -format => 'XMFA'); while (my $aln = $in->next_aln()) { foreach my $seq ($aln->each_seq) { $seq->alphabet('dna'); } my $con = $aln->consensus_iupac(); } Thanks in advance. Ngwenyama From cjfields at illinois.edu Wed May 22 19:17:50 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 22 May 2013 23:17:50 +0000 Subject: [Bioperl-l] Speed issues with making IUPAC consensus from alignment In-Reply-To: <1DE07C6C-0DEA-43EB-8F1A-BACB8534400F@gmail.com> References: <1DE07C6C-0DEA-43EB-8F1A-BACB8534400F@gmail.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E2A01D@CHIMBX5.ad.uillinois.edu> On May 22, 2013, at 3:15 PM, Senanu wrote: > Hi all, > > I am wondering if the consensus_iupac method of Bio::Align is known to be extremely slow, or if I'm doing something wrong. Probably the former, but... > I have bacterial whole-genome alignments (~7 Mbases) that I made in progressiveMauve and wish to get an IUPAC consensus. (I know that progressiveMauve uses a non-standard XMFA format, but Bio::AlignIO seems to read them just fine.) The code below takes more than all night to make a consensus. It works fine on tiny test alignments. It shouldn't take that long, 7 Mb isn't that large. Or is that 7 Mb for one genome? > Is this a known problem? Is there another way to generate such a consensus? The code isn't really optimized for this, but again this isn't terribly large. Is the bottleneck reading the alignment in, or is it the consensus_iupac() step? Hard to say w/o seeing the alignment data itself. > my $in = Bio::AlignIO->new(-file => $files[0], > -format => 'XMFA'); > while (my $aln = $in->next_aln()) { > foreach my $seq ($aln->each_seq) { > $seq->alphabet('dna'); > } > my $con = $aln->consensus_iupac(); > } > > > Thanks in advance. > Ngwenyama > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l chris From alexeymorozov1991 at gmail.com Thu May 23 03:22:13 2013 From: alexeymorozov1991 at gmail.com (Alexey Morozov) Date: Thu, 23 May 2013 16:22:13 +0900 Subject: [Bioperl-l] Phylip format error In-Reply-To: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Message-ID: Which is also worsened by the fact that there is relaxed phylip format, which allows up to 250 chars for taxon name. They are separated from a sequence by single space, which creates problems if names were extended to 10 chars in strict Felsenstein's format by whitespaces. On the whole, phylip is as messily defined format as one can make from a plain textfile with information content of fasta. Bioperl documentation says nothing about whether Bio::SeqIO accepts relaxed phylip and how does it tell dialects from one another. Even if code support is OK, it may be worthwile to explain it somewhere at bioperl.org 2013/5/21 Bernard Cohen > Hello! > > I happen to have checked to see what the PERL webpage says about Phylip > format for DNA alignment files and see that it is erroneous. > > I am not a PERL user and do not want to be bothered to register or > otherwise learn how to make an official comment, so forward this for > someone to pick up. > > Phylip format allows up to 10 spaces for taxon names; the data must start > in the 11th space. This can be checked on Jo Felsenstein's site. > > The PERL page accessed by searching for "Phylip format PERL" allows only 8 > spaces for the name. > > B. L. Cohen > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- Alexey Morozov, LIN SB RAS, bioinformatics group. Irkutsk, Russia. From p.j.a.cock at googlemail.com Thu May 23 04:30:21 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 23 May 2013 09:30:21 +0100 Subject: [Bioperl-l] Phylip format error In-Reply-To: References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Message-ID: On Thu, May 23, 2013 at 8:22 AM, Alexey Morozov wrote: > Which is also worsened by the fact that there is relaxed phylip format, > which allows up to 250 chars for taxon name. They are separated from a > sequence by single space, which creates problems if names were extended to > 10 chars in strict Felsenstein's format by whitespaces. On the whole, > phylip is as messily defined format as one can make from a plain textfile > with information content of fasta. > Bioperl documentation says nothing about whether Bio::SeqIO accepts relaxed > phylip and how does it tell dialects from one another. Even if code support > is OK, it may be worthwile to explain it somewhere at bioperl.org Biopython's AlignIO defines both a (strict) "phylip" and "relaxed-phylip" as two separate formats (or variants, like the "fastq" variants). Doing the same in BioPerl would seem sensible since auto-detection is not easy. http://biopython.org/wiki/AlignIO#File_Formats Peter P.S. Where does that 250 characters for the taxon name limit come from? The trouble with relaxed phylip is that some tools are more relaxed than others ;) From awitney at sgul.ac.uk Thu May 23 04:43:15 2013 From: awitney at sgul.ac.uk (Adam Witney) Date: Thu, 23 May 2013 09:43:15 +0100 Subject: [Bioperl-l] Phylip format error In-Reply-To: References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Message-ID: <519DD6A3.8090304@sgul.ac.uk> Not sure if there is an actual question in these messages, but BioPerl can be used to generate valid Phylip format and run, like this: ## Build Align object my $aln = Bio::SimpleAlign->new(-seqs=>$seqs); ## swap the taxa names with 8 characters long unique IDs my ($aln_safe, $ref_name) = $aln->set_displayname_safe(8); ## Write out phylip format infile Bio::AlignIO->new(-file=>'>infile.out', -format=>'phylip', -interleaved => 0)->write_aln($aln); ## run PHYLIP's pars program my @params = (idlength=>10); #, jumble=>"17,10"); my $tree_factory = Bio::Tools::Run::Phylo::Phylip::Pars->new(@params); $tree_factory->quiet(1); # Suppress pars messages to terminal my $tree = $tree_factory->create_tree($aln_safe); ## fix the node labels back my @nodes = sort { defined $a->id && defined $b->id && $a->id cmp $b->id } $tree->get_nodes(); foreach my $nd (@nodes) { if ( $nd->is_Leaf ) { $nd->id($ref_name->{$nd->id_output}) } } HTH Adam On 23/05/2013 08:22, Alexey Morozov wrote: > Which is also worsened by the fact that there is relaxed phylip format, > which allows up to 250 chars for taxon name. They are separated from a > sequence by single space, which creates problems if names were extended to > 10 chars in strict Felsenstein's format by whitespaces. On the whole, > phylip is as messily defined format as one can make from a plain textfile > with information content of fasta. > Bioperl documentation says nothing about whether Bio::SeqIO accepts relaxed > phylip and how does it tell dialects from one another. Even if code support > is OK, it may be worthwile to explain it somewhere at bioperl.org > > > 2013/5/21 Bernard Cohen > >> Hello! >> >> I happen to have checked to see what the PERL webpage says about Phylip >> format for DNA alignment files and see that it is erroneous. >> >> I am not a PERL user and do not want to be bothered to register or >> otherwise learn how to make an official comment, so forward this for >> someone to pick up. >> >> Phylip format allows up to 10 spaces for taxon names; the data must start >> in the 11th space. This can be checked on Jo Felsenstein's site. >> >> The PERL page accessed by searching for "Phylip format PERL" allows only 8 >> spaces for the name. >> >> B. L. Cohen >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > > From cjfields at illinois.edu Thu May 23 09:48:31 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 23 May 2013 13:48:31 +0000 Subject: [Bioperl-l] Phylip format error In-Reply-To: References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E323A4@CITESMBX5.ad.uillinois.edu> Alexey, Just want to point out that 'relaxed phylip' format was introduced long after this parser was created; in fact (as Adam points out) there was an alternative workaround to deal with the lossy names. The content of that page is on a wiki, which anyone is free to edit (just need an OpenID to set up an account). chris On May 23, 2013, at 2:22 AM, Alexey Morozov wrote: > Which is also worsened by the fact that there is relaxed phylip format, > which allows up to 250 chars for taxon name. They are separated from a > sequence by single space, which creates problems if names were extended to > 10 chars in strict Felsenstein's format by whitespaces. On the whole, > phylip is as messily defined format as one can make from a plain textfile > with information content of fasta. > Bioperl documentation says nothing about whether Bio::SeqIO accepts relaxed > phylip and how does it tell dialects from one another. Even if code support > is OK, it may be worthwile to explain it somewhere at bioperl.org > > > 2013/5/21 Bernard Cohen > >> Hello! >> >> I happen to have checked to see what the PERL webpage says about Phylip >> format for DNA alignment files and see that it is erroneous. >> >> I am not a PERL user and do not want to be bothered to register or >> otherwise learn how to make an official comment, so forward this for >> someone to pick up. >> >> Phylip format allows up to 10 spaces for taxon names; the data must start >> in the 11th space. This can be checked on Jo Felsenstein's site. >> >> The PERL page accessed by searching for "Phylip format PERL" allows only 8 >> spaces for the name. >> >> B. L. Cohen >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > > > -- > Alexey Morozov, > LIN SB RAS, bioinformatics group. > Irkutsk, Russia. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Thu May 23 10:05:32 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 23 May 2013 14:05:32 +0000 Subject: [Bioperl-l] Phylip format error In-Reply-To: References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E32551@CITESMBX5.ad.uillinois.edu> On May 23, 2013, at 3:30 AM, Peter Cock wrote: > On Thu, May 23, 2013 at 8:22 AM, Alexey Morozov > wrote: >> Which is also worsened by the fact that there is relaxed phylip format, >> which allows up to 250 chars for taxon name. They are separated from a >> sequence by single space, which creates problems if names were extended to >> 10 chars in strict Felsenstein's format by whitespaces. On the whole, >> phylip is as messily defined format as one can make from a plain textfile >> with information content of fasta. >> Bioperl documentation says nothing about whether Bio::SeqIO accepts relaxed >> phylip and how does it tell dialects from one another. Even if code support >> is OK, it may be worthwile to explain it somewhere at bioperl.org > > Biopython's AlignIO defines both a (strict) "phylip" and "relaxed-phylip" > as two separate formats (or variants, like the "fastq" variants). Doing > the same in BioPerl would seem sensible since auto-detection is not > easy. > > http://biopython.org/wiki/AlignIO#File_Formats > > Peter > > P.S. Where does that 250 characters for the taxon name limit come from? > The trouble with relaxed phylip is that some tools are more relaxed than > others ;) As Adam pointed out, prior to the introduction of 'relaxed phylip' we had an alternative solution that didn't require a modified format but still allowed one to use PHYLIP and other tools requesting the format. I think 'relaxed phylip' was introduced by CIPRES a few years back. Frankly, this is the first time I have seen this mentioned on the list; yay, yet another format variation :) The variant format parsing (as implemented for SeqIO::fastq, as you know) deals with variant names like 'fastq-sanger', where the main format name is first, the variant of the format second. The order in this case is reversed (relaxed-phylip), which I'm pretty sure will not work. Not impossible to allow, but we would probably allow support like this initially: my $in = Bio::AlignIO->new(-format => 'phylip', -variant => 'relaxed', ?); chris From cjfields at illinois.edu Thu May 23 09:56:32 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 23 May 2013 13:56:32 +0000 Subject: [Bioperl-l] Speed issues with making IUPAC consensus from alignment In-Reply-To: <92CD2F9D-7C8E-4941-9354-4BD80D8C2CB6@gmail.com> References: <1DE07C6C-0DEA-43EB-8F1A-BACB8534400F@gmail.com> <118F034CF4C3EF48A96F86CE585B94BF74E2A01D@CHIMBX5.ad.uillinois.edu> <92CD2F9D-7C8E-4941-9354-4BD80D8C2CB6@gmail.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E32489@CITESMBX5.ad.uillinois.edu> (keep the list cc'd) On May 22, 2013, at 6:31 PM, Senanu wrote: > On May 22, 2013, at 4:17 PM, Fields, Christopher J wrote: > >> Hi all, >>> >>> I am wondering if the consensus_iupac method of Bio::Align is known to be extremely slow, or if I'm doing something wrong. >> >> Probably the former, but... >> >>> I have bacterial whole-genome alignments (~7 Mbases) that I made in progressiveMauve and wish to get an IUPAC consensus. (I know that progressiveMauve uses a non-standard XMFA format, but Bio::AlignIO seems to read them just fine.) The code below takes more than all night to make a consensus. It works fine on tiny test alignments. >> >> It shouldn't take that long, 7 Mb isn't that large. Or is that 7 Mb for one genome? > > It is 7Mb per genome, but there are only 2 genomes in the alignment, and the sequences are very similar to one another. > >> >>> Is this a known problem? Is there another way to generate such a consensus? >> >> The code isn't really optimized for this, but again this isn't terribly large. Is the bottleneck reading the alignment in, or is it the consensus_iupac() step? Hard to say w/o seeing the alignment data itself. > > The bottleneck is definitely with the consensus_iupac step. Reading the alignment in takes a few seconds. That's interesting, but again not surprising. One would have to look at the code, but I wouldn't be surprised if the method is terribly inefficient. chris From p.j.a.cock at googlemail.com Thu May 23 10:53:09 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 23 May 2013 15:53:09 +0100 Subject: [Bioperl-l] Phylip format error In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF74E32551@CITESMBX5.ad.uillinois.edu> References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> <118F034CF4C3EF48A96F86CE585B94BF74E32551@CITESMBX5.ad.uillinois.edu> Message-ID: On Thu, May 23, 2013 at 3:05 PM, Fields, Christopher J wrote: > > I think 'relaxed phylip' was introduced by CIPRES a few years back. > Frankly, this is the first time I have seen this mentioned on the list; yay, > yet another format variation :) The relaxed phylip 'format' goes back further than that, e.g. http://lists.open-bio.org/pipermail/biopython-dev/2008-June/003899.html RAxML and PHYML support relaxed phylip - but with their own ID limits. Peter From Ben.Ward at sainsbury-laboratory.ac.uk Fri May 24 15:14:17 2013 From: Ben.Ward at sainsbury-laboratory.ac.uk (Ben Ward (TSL)) Date: Fri, 24 May 2013 19:14:17 +0000 Subject: [Bioperl-l] Building Many trees with Bioperl Message-ID: Hi, I'm new to Bioperl and plan to make a script to automate making trees with many alignment files (themselves generated by automating the process of multiple alignment for many datasets by using clustalw in a bioperl script). But I need to get it right for one pice of test data before I can do it for all: What I have produced so far is the below. It's supposed to load in the alignment file as as SimpleAlign. Then use that alignment in phyml. I looked at the documentation and tried to follow examples. However it gives me many errors like: --------------------- WARNING --------------------- MSG: Replacing one sequence [FXCNDTJ02P/1-366] And then gives me: Phyml command = /Users/wardb/clustalw2/phyml /var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip 0 i 1 0 HKY e e 1 e BIONJ y y ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Phyml call (/Users/wardb/clustalw2/phyml /var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip 0 i 1 0 HKY e e 1 e BIONJ y y) did not give an output [/var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip_phyml_stat.txt]: 11 STACK: Error::throw STACK: Bio::Root::Root::throw /Library/Perl/5.12/Bio/Root/Root.pm:472 STACK: Bio::Tools::Run::Phylo::Phyml::_run /Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:851 STACK: /Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:338 ----------------------------------------------------------- Can someone let me know if I'm going about this correctly and what I need to do to get it to work. I've also tried to run phyml by giving the filename in the run() method like: my $inputfilename = 'outputalignmentfile'; my $tree = $factory->run("$inputfilename"); But I get the same EXCEPTION: Bio::Root::Exception message. Thanks, Ben W. SCRIPT --- #!/usr/bin/perl use warnings; use strict; BEGIN { $ENV{PHYMLDIR} = '/Users/wardb/clustalw2' } use Bio::TreeIO; use Bio::AlignIO; use Bio::Tools::Run::Phylo::Phyml; my $alnin = Bio::AlignIO->new(-file => " 'phylip'); my $aln = $alnin->next_aln(); # Make a Phyml factory. my $factory = Bio::Tools::Run::Phylo::Phyml->new(-verbose => 2, -data_type => 'dna'); # Pass the factory an alignment and run: # my $inputfilename = 'outputalignmentfile'; my $tree = $factory->run($aln); # $tree is a Bio::Tree::Tree object. # Setup tree output stream... my $treeio = Bio::TreeIO->new(-format => 'newick', -file => 'tree.newick'); $treeio->write_tree($tree); exit 0; From bosborne11 at verizon.net Fri May 24 17:25:38 2013 From: bosborne11 at verizon.net (Brian Osborne) Date: Fri, 24 May 2013 17:25:38 -0400 Subject: [Bioperl-l] Building Many trees with Bioperl In-Reply-To: References: Message-ID: <93D0963E-8399-468C-A32B-875FAA5054D1@verizon.net> Ben, What happens when you take the Phyml command itself and run it from the command line? Also, a minor point: the message "MSG: Replacing one sequence [FXCNDTJ02P/1-366]" is not an error, it is a warning. An error accompanies an exit, warnings are just informative. Brian O. On May 24, 2013, at 3:14 PM, Ben Ward (TSL) wrote: > Hi, > > I'm new to Bioperl and plan to make a script to automate making trees with many alignment files (themselves generated by automating the process of multiple alignment for many datasets by using clustalw in a bioperl script). > But I need to get it right for one pice of test data before I can do it for all: > > What I have produced so far is the below. It's supposed to load in the alignment file as as SimpleAlign. Then use that alignment in phyml. I looked at the documentation and tried to follow examples. > > However it gives me many errors like: > --------------------- WARNING --------------------- > MSG: Replacing one sequence [FXCNDTJ02P/1-366] > > And then gives me: > Phyml command = /Users/wardb/clustalw2/phyml /var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip 0 i 1 0 HKY e e 1 e BIONJ y y > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: Phyml call (/Users/wardb/clustalw2/phyml /var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip 0 i 1 0 HKY e e 1 e BIONJ y y) did not give an output [/var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip_phyml_stat.txt]: 11 > STACK: Error::throw > STACK: Bio::Root::Root::throw /Library/Perl/5.12/Bio/Root/Root.pm:472 > STACK: Bio::Tools::Run::Phylo::Phyml::_run /Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:851 > STACK: /Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:338 > ----------------------------------------------------------- > > Can someone let me know if I'm going about this correctly and what I need to do to get it to work. I've also tried to run phyml by giving the filename in the run() method like: > my $inputfilename = 'outputalignmentfile'; > my $tree = $factory->run("$inputfilename"); > > But I get the same EXCEPTION: Bio::Root::Exception message. > > Thanks, > Ben W. > > SCRIPT --- > > #!/usr/bin/perl > use warnings; > use strict; > BEGIN { $ENV{PHYMLDIR} = '/Users/wardb/clustalw2' } > use Bio::TreeIO; > use Bio::AlignIO; > use Bio::Tools::Run::Phylo::Phyml; > > my $alnin = Bio::AlignIO->new(-file => " -format => 'phylip'); > > my $aln = $alnin->next_aln(); > > > # Make a Phyml factory. > my $factory = Bio::Tools::Run::Phylo::Phyml->new(-verbose => 2, > -data_type => 'dna'); > > # Pass the factory an alignment and run: > # my $inputfilename = 'outputalignmentfile'; > > my $tree = $factory->run($aln); # $tree is a Bio::Tree::Tree object. > > > # Setup tree output stream... > my $treeio = Bio::TreeIO->new(-format => 'newick', > -file => 'tree.newick'); > > $treeio->write_tree($tree); > > exit 0; > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From Ben.Ward at sainsbury-laboratory.ac.uk Fri May 24 17:46:40 2013 From: Ben.Ward at sainsbury-laboratory.ac.uk (Ben Ward (TSL)) Date: Fri, 24 May 2013 21:46:40 +0000 Subject: [Bioperl-l] Building Many trees with Bioperl In-Reply-To: <93D0963E-8399-468C-A32B-875FAA5054D1@verizon.net> Message-ID: Hi, If I just run phyml on the command line it seems to run ok - it accept's my file and appears to undergo the tree building process - I haven't actually see it complete yet, but ML and Bayes always does take a while - and I have many reads that need to be aligned. But PhyML gets to the point it asks - are you sure you want to proceed - I say yes, then it keeps quiet and is currently working along to itself: . 766 patterns found (out of a total of 795 sites). . 58 sites without polymorphism (7.30%). . Computing pairwise distances... . Building BioNJ tree... . WARNING: this analysis requires at least 556 MB of memory space. . Do you really want to proceed? [Y/n] Y It appears to be working =/ Best, Ben. On 24/05/2013 22:25, "Brian Osborne" wrote: >Ben, > >What happens when you take the Phyml command itself and run it from the >command line? > >Also, a minor point: the message "MSG: Replacing one sequence >[FXCNDTJ02P/1-366]" is not an error, it is a warning. An error >accompanies an exit, warnings are just informative. > >Brian O. > > >On May 24, 2013, at 3:14 PM, Ben Ward (TSL) > wrote: > >> Hi, >> >> I'm new to Bioperl and plan to make a script to automate making trees >>with many alignment files (themselves generated by automating the >>process of multiple alignment for many datasets by using clustalw in a >>bioperl script). >> But I need to get it right for one pice of test data before I can do it >>for all: >> >> What I have produced so far is the below. It's supposed to load in the >>alignment file as as SimpleAlign. Then use that alignment in phyml. I >>looked at the documentation and tried to follow examples. >> >> However it gives me many errors like: >> --------------------- WARNING --------------------- >> MSG: Replacing one sequence [FXCNDTJ02P/1-366] >> >> And then gives me: >> Phyml command = /Users/wardb/clustalw2/phyml >>/var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phyli >>p 0 i 1 0 HKY e e 1 e BIONJ y y >> >> ------------- EXCEPTION: Bio::Root::Exception ------------- >> MSG: Phyml call (/Users/wardb/clustalw2/phyml >>/var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phyli >>p 0 i 1 0 HKY e e 1 e BIONJ y y) did not give an output >>[/var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phyl >>ip_phyml_stat.txt]: 11 >> STACK: Error::throw >> STACK: Bio::Root::Root::throw /Library/Perl/5.12/Bio/Root/Root.pm:472 >> STACK: Bio::Tools::Run::Phylo::Phyml::_run >>/Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:851 >> STACK: /Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:338 >> ----------------------------------------------------------- >> >> Can someone let me know if I'm going about this correctly and what I >>need to do to get it to work. I've also tried to run phyml by giving the >>filename in the run() method like: >> my $inputfilename = 'outputalignmentfile'; >> my $tree = $factory->run("$inputfilename"); >> >> But I get the same EXCEPTION: Bio::Root::Exception message. >> >> Thanks, >> Ben W. >> >> SCRIPT --- >> >> #!/usr/bin/perl >> use warnings; >> use strict; >> BEGIN { $ENV{PHYMLDIR} = '/Users/wardb/clustalw2' } >> use Bio::TreeIO; >> use Bio::AlignIO; >> use Bio::Tools::Run::Phylo::Phyml; >> >> my $alnin = Bio::AlignIO->new(-file => "> -format => 'phylip'); >> >> my $aln = $alnin->next_aln(); >> >> >> # Make a Phyml factory. >> my $factory = Bio::Tools::Run::Phylo::Phyml->new(-verbose => 2, >> -data_type => 'dna'); >> >> # Pass the factory an alignment and run: >> # my $inputfilename = 'outputalignmentfile'; >> >> my $tree = $factory->run($aln); # $tree is a Bio::Tree::Tree object. >> >> >> # Setup tree output stream... >> my $treeio = Bio::TreeIO->new(-format => 'newick', >> -file => 'tree.newick'); >> >> $treeio->write_tree($tree); >> >> exit 0; >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > From himaghna.bhattacharjee at gmail.com Tue May 28 11:59:02 2013 From: himaghna.bhattacharjee at gmail.com (Himaghna Bhattacharjee) Date: Tue, 28 May 2013 21:29:02 +0530 Subject: [Bioperl-l] error in the link to install Kobe repository for windows Message-ID: Hey, the link to install Kobe's repository for per 5.10 <" http://cpan.uwinnipeg.ca/PPMPackages/10xx/ --"> seems to be broken as it shows Error 503 Service Temporarily Unavailable. Could you please suggest an alternative ? Thanks . Himaghna Bhattacharjee 3rd year B.E.(Hons.)Chemical Engineering Birla Institute of Technology and Science,Pilani Rajasthan 333 031 From wgallin at ualberta.ca Tue May 28 13:49:02 2013 From: wgallin at ualberta.ca (Warren Gallin) Date: Tue, 28 May 2013 11:49:02 -0600 Subject: [Bioperl-l] ReplacedBy value in esummary Message-ID: <81794641-2693-4476-B2FA-6B742190D141@ualberta.ca> Hi, I just encountered a glitch when I was trying to update some entries in a database by finding updated GENBANK protein entries. The original record was gi 118091304 which has been replaced by gi 363734282 I retrieved an ESummary of the record for gi 118091304 as a Bio::Tools::EUtilities::Summary::DocSum object (called $ds). When I then tried to retrieve the gi number for the replacement by using: my $replaced = $ds->get_contents_by_name('ReplacedBy'); the returned value was 1, and when I dumped the ESummary record the relevant pair is ReplacedBy :XP_421022.3. The full Esummary dump is: UID :118091304 Caption :XP_421022 Title :PREDICTED: similar to Potassium voltage-gated channel, subfamily Q, member :1 [Gallus gallus] Extra :gi|118091304|ref|XP_421022.2|[118091304] Gi :118091304 CreateDate :2004/07/28 UpdateDate :2006/11/16 Flags :512 TaxId :9031 Length :643 Status :replaced ReplacedBy :XP_421022.3 Comment : This record was replaced or removed. So two questions: 1) Is the ReplacedBy value always supposed to be the new accession number version rather than the new gi number? 2) Why would I be getting a returned value of 1 instead of the accession number that is in the summary record? Any advice appreciated. Warren Gallin From cjfields at illinois.edu Tue May 28 14:31:30 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 28 May 2013 18:31:30 +0000 Subject: [Bioperl-l] ReplacedBy value in esummary In-Reply-To: <81794641-2693-4476-B2FA-6B742190D141@ualberta.ca> References: <81794641-2693-4476-B2FA-6B742190D141@ualberta.ca> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E44377@CHIMBX5.ad.uillinois.edu> On May 28, 2013, at 12:49 PM, Warren Gallin wrote: > Hi, > > I just encountered a glitch when I was trying to update some entries in a database by finding updated GENBANK protein entries. > > The original record was gi 118091304 which has been replaced by gi 363734282 > > I retrieved an ESummary of the record for gi 118091304 as a Bio::Tools::EUtilities::Summary::DocSum object (called $ds). > > When I then tried to retrieve the gi number for the replacement by using: > > my $replaced = $ds->get_contents_by_name('ReplacedBy'); > > the returned value was 1, and when I dumped the ESummary record the relevant pair is ReplacedBy :XP_421022.3. > > The full Esummary dump is: > > UID :118091304 > Caption :XP_421022 > Title :PREDICTED: similar to Potassium voltage-gated channel, subfamily Q, member > :1 [Gallus gallus] > Extra :gi|118091304|ref|XP_421022.2|[118091304] > Gi :118091304 > CreateDate :2004/07/28 > UpdateDate :2006/11/16 > Flags :512 > TaxId :9031 > Length :643 > Status :replaced > ReplacedBy :XP_421022.3 > Comment : This record was replaced or removed. > > > So two questions: > > 1) Is the ReplacedBy value always supposed to be the new accession number version rather than the new gi number? No idea, the best people to answer that would be NCBI (the idea of these modules was to simplify getting at that data instead of munging the XML, but whatever they report is mainly from NCBI, not bioperl). > 2) Why would I be getting a returned value of 1 instead of the accession number that is in the summary record? The text dump above indicates the values do exist. However, you are calling a method that returns a list (note the plural in the name) in scalar context, so you get the number of values. If you always expect a single value, use: my ($replaced) = $ds->get_contents_by_name('ReplacedBy'); which forces array context. That should fix it. chris > Any advice appreciated. > > Warren Gallin > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From florent.angly at gmail.com Thu May 2 02:16:02 2013 From: florent.angly at gmail.com (Florent Angly) Date: Thu, 02 May 2013 12:16:02 +1000 Subject: [Bioperl-l] Downloading sequences in batch from Trace Archive In-Reply-To: References: Message-ID: <5181CC62.9000609@gmail.com> Maybe using EUtilities? http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook http://www.bioperl.org/wiki/HOWTO:EUtilities_Web_Service Florent On 30/04/13 06:25, shalabh sharma wrote: > Hi All, > Is there any module in Bioperl that can download sequences from > NCBI's trace archive? > > Thanks > Shalabh > From jason.stajich at gmail.com Thu May 2 05:42:55 2013 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 1 May 2013 22:42:55 -0700 Subject: [Bioperl-l] Fwd: doubt References: Message-ID: Begin forwarded message: > From: ARYA DAS > Subject: doubt > Date: May 1, 2013 10:42:21 PM PDT > To: jason at bioperl.org > > sir, > > Am using windows7 n was trying to install bio perl in it..i have > already installed active perl.5.16.3.1603 . n was followeing the > installation procedure mentioned .when i tried GUI installation .. i cant > find bioperl package when i try to search them for installation. > while using command line.. > > ppm> install PPM-Repositories > > shows error like cant find package that provides PPM repositories, > > and when i try manually ,on reaching the > perl Build test > > it says build is recognised as an internal or external file. > > please help if time permits > > regards, > arya Jason Stajich jason.stajich at gmail.com jason at bioperl.org From voldrani at gmail.com Sun May 5 04:03:38 2013 From: voldrani at gmail.com (Chris Maloney) Date: Sun, 5 May 2013 00:03:38 -0400 Subject: [Bioperl-l] Wiki work, Template:Doclink Message-ID: The module pages on the wiki could look a little better, like this one for example: http://www.bioperl.org/wiki/Module:Bio::Tools::Run::RemoteBlast. There used to be a bunch of extra whitespace at the top of the page, which was caused by extra line breaks in Template:Doclink, which I just removed. But, I think there are other improvements that could be made. I would like to turn this into an infobox -- which are the helpful informative tables of info on Wikipedia that appear on many articles on the upper right. That would allow us to add more links -- like to metacpan, for example. It is not completely trivial to import infoboxes into a wiki though, I just discovered. I just went through the exercise on my home wiki, and it involves importing a lot of templates from Wikipedia, and fixing up the common.css. You can see the full list of imported templates here: http://chrismaloney.org/wiki/index.php?title=Special:RecentChanges&limit=100. I don't *think* this should cause any problems, but I'm not 100% sure. On the other hand, if it does, it should be easy to roll back -- it's a wiki, after all. Does anybody have a problem if I do this? I'll wait a day for responses, and tackle this tomorrow, if no one objects. -- Chris M. From armendarez77 at hotmail.com Wed May 8 00:32:22 2013 From: armendarez77 at hotmail.com (Veronica A.) Date: Tue, 7 May 2013 17:32:22 -0700 Subject: [Bioperl-l] Bio::SeqIO doesn't write all gbk sequences from Bio::DB::GenBank Message-ID: Hello, I'm currently running Bio::Perl 1.6.1 on Ubuntu 12.04.2 LTS and have noticed a problem with Bio::SeqIO when writing genbank files using the write_seq() function; some of the files do not include an 'ORIGIN' tag or the sequence. I am using GI#s (50 at a time every 2 minutes) to retrieve genbank files via Bio::DB::GenBank. ----------------------------------------START CODE---------------------------------- my $gb = Bio::DB::GenBank->new(-verbose=>-1);my $seqout = Bio::SeqIO->new(-file=>">$fileName", '-format'=>'Genbank', -alphabet=>'dna', -flush=>0, -verbose=>-1);while(@ids){ my @batchArray = splice(@ids, 0, 50); my $batchArrayRef = \@batchArray; my $streamObj; my $pid = fork(); if($pid == 0){ eval{ $streamObj = $gb->get_Stream_by_id($batchArrayRef); }; if($@){ print "Error: ".$@."\n"; } else{ while(my $seqObj = $streamObj->next_seq()){ unless($seqObj->accession_number() =~ /N[A-Z]\_/){ #print "ID: ".$seqObj->id()."\n"; #print "Seq:\n".$seqObj->seq()."\n"; $seqout->write_seq($seqObj); } } } exit 0; }}waitpid($pid,0);sleep(120); ----------------------------------------END CODE---------------------------------- Most of the Genbank files written to the output file have sequences, but there is a small portion that do not, even though they should. For example, JX287367, in NCBI includes an 'ORIGIN' tag and sequence and when I use the print function before writing to file, the sequence is printed to STDOUT, but the 'ORIGIN' tag and sequence are not written to the output gbk file. The following is found in the final output file: ----------------------------------START GBK----------------------------------------- LOCUS JX287367 588 bp DNA linear BCT 19-DEC-2012DEFINITION Chlamydia trachomatis strain UW-5/CX pyruvoyl-dependent arginine decarboxylase (aaxB) gene, complete cds.ACCESSION JX287367VERSION JX287367.1 GI:404351720KEYWORDS .SOURCE Chlamydia trachomatis ORGANISM Chlamydia trachomatis Bacteria; Chlamydiae; Chlamydiales; Chlamydiaceae; Chlamydia/Chlamydophila group; Chlamydia.REFERENCE 1 (bases 1 to 588) AUTHORS Bliven,K.A., Fisher,D.J. and Maurelli,A.T. TITLE Characterization of the activity and expression of arginine decarboxylase in human and animal Chlamydia pathogens JOURNAL FEMS Microbiol. Lett. 337 (2), 140-146 (2012) PUBMED 23043454REFERENCE 2 (bases 1 to 588) AUTHORS Bliven,K.A., Fisher,D.J. and Maurelli,A.T. TITLE Direct Submission JOURNAL Submitted (06-JUL-2012) Department of Microbiology and Immunology, F. Edward Hebert School of Medicine, Uniformed Services University of the Health Sciences, 4301 Jones Bridge Road, Bethesda, MD 20814, USAFEATURES Location/Qualifiers source 1..588 /mol_type="genomic DNA" /db_xref="taxon:813" /strain="UW-5/CX" /organism="Chlamydia trachomatis" /serovar="E" gene 1..588 /gene="aaxB" CDS 1..588 /protein_id="AFR60849.1" /gene="aaxB" /transl_table=11 /note="AaxB" /db_xref="GI:404351721" /codon_start=1 /product="pyruvoyl-dependent arginine decarboxylase" /translation="MPYGTRYPTLAFHTGGVGESDDGMPPQPFETFCYDSALLQAKIE NFNIVPYTSVLPKELFGNILPVDQCTKFFKHGAVLEVIMAGRGATVTDGTQAIATGVG ICWGKDKNGELIGGWAAEYVEFFPTWIDDEIAESHAKMWLKKSLQHELDLRSVSKHSE FQYFHNYINIRKKFGFCLTALGFLNFENVAPAVIQ" // ----------------------------------END GBK----------------------------------------- Can anyone tell what I am missing or why this is happening? I don't know if this has happened in earlier BioPerl versions as up until now, I usually downloaded sequences straight from NCBI, but that became too time consuming....but this seems to be as well :S Thank you in advance for any help, Veronica From cjfields at illinois.edu Wed May 8 02:17:43 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 8 May 2013 02:17:43 +0000 Subject: [Bioperl-l] Bio::SeqIO doesn't write all gbk sequences from Bio::DB::GenBank In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E166A0@CHIMBX5.ad.uillinois.edu> Veronica, Your mail may have garbled the script and example file. Can you paste these in a gist? https://gist.github.com/ chris On May 7, 2013, at 7:32 PM, Veronica A. wrote: > Hello, > I'm currently running Bio::Perl 1.6.1 on Ubuntu 12.04.2 LTS and have noticed a problem with Bio::SeqIO when writing genbank files using the write_seq() function; some of the files do not include an 'ORIGIN' tag or the sequence. > > I am using GI#s (50 at a time every 2 minutes) to retrieve genbank files via Bio::DB::GenBank. > > ----------------------------------------START CODE---------------------------------- > my $gb = Bio::DB::GenBank->new(-verbose=>-1);my $seqout = Bio::SeqIO->new(-file=>">$fileName", '-format'=>'Genbank', -alphabet=>'dna', -flush=>0, -verbose=>-1);while(@ids){ my @batchArray = splice(@ids, 0, 50); my $batchArrayRef = \@batchArray; > my $streamObj; my $pid = fork(); if($pid == 0){ eval{ $streamObj = $gb->get_Stream_by_id($batchArrayRef); }; if($@){ print "Error: ".$@."\n"; } else{ while(my $seqObj = $streamObj->next_seq()){ unless($seqObj->accession_number() =~ /N[A-Z]\_/){ #print "ID: ".$seqObj->id()."\n"; #print "Seq:\n".$seqObj->seq()."\n"; $seqout->write_seq($seqObj); } } } exit 0; }}waitpid($pid,0);sleep(120); > ----------------------------------------END CODE---------------------------------- > Most of the Genbank files written to the output file have sequences, but there is a small portion that do not, even though they should. For example, JX287367, in NCBI includes an 'ORIGIN' tag and sequence and when I use the print function before writing to file, the sequence is printed to STDOUT, but the 'ORIGIN' tag and sequence are not written to the output gbk file. The following is found in the final output file: > ----------------------------------START GBK----------------------------------------- > LOCUS JX287367 588 bp DNA linear BCT 19-DEC-2012DEFINITION Chlamydia trachomatis strain UW-5/CX pyruvoyl-dependent arginine decarboxylase (aaxB) gene, complete cds.ACCESSION JX287367VERSION JX287367.1 GI:404351720KEYWORDS .SOURCE Chlamydia trachomatis ORGANISM Chlamydia trachomatis Bacteria; Chlamydiae; Chlamydiales; Chlamydiaceae; Chlamydia/Chlamydophila group; Chlamydia.REFERENCE 1 (bases 1 to 588) AUTHORS Bliven,K.A., Fisher,D.J. and Maurelli,A.T. TITLE Characterization of the activity and expression of arginine decarboxylase in human and animal Chlamydia pathogens JOURNAL FEMS Microbiol. Lett. 337 (2), 140-146 (2012) PUBMED 23043454REFERENCE 2 (bases 1 to 588) AUTHORS Bliven,K.A., Fisher,D.J. and Maurelli,A.T. TITLE Direct Submission JOURNAL Submitted (06-JUL-2012) Department of Microbiology and Immunology, F. Edward Hebert School of Med! > icine, Uniformed Services University of the Health Sciences, 4301 Jones Bridge Road, Bethesda, MD 20814, USAFEATURES Location/Qualifiers source 1..588 /mol_type="genomic DNA" /db_xref="taxon:813" /strain="UW-5/CX" /organism="Chlamydia trachomatis" /serovar="E" gene 1..588 /gene="aaxB" CDS 1..588 /protein_id="AFR60849.1" /gene="aaxB" /transl_table=11 /note="AaxB" /db_xref="GI:404351721" /codon_start=1 /product="pyruvoyl-dependent arginine decarboxylase" /translation="MPYGTRYPTLAFHTGGVGESDDGMPPQPFETFCYDSALLQAKIE NFNIVPYTSVLPKELFGNILPVDQCTKFFKHGAVLEVIMAGRGATVTDGTQAIATGVG ICWGKDKNGELIGGW! > AAEYVEFFPTWIDDEIAESHAKMWLKKSLQHELDLRSVSKHSE FQYFHN > YINIRKKFGFCLTALGFLNFENVAPAVIQ" > // > ----------------------------------END GBK----------------------------------------- > Can anyone tell what I am missing or why this is happening? I don't know if this has happened in earlier BioPerl versions as up until now, I usually downloaded sequences straight from NCBI, but that became too time consuming....but this seems to be as well :S > Thank you in advance for any help, > Veronica > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From witch.of.agnessi at gmail.com Wed May 8 19:24:53 2013 From: witch.of.agnessi at gmail.com (WoA) Date: Wed, 8 May 2013 12:24:53 -0700 (PDT) Subject: [Bioperl-l] Extracting matching subsequence from pairwise alignment Message-ID: <1368041092972-16935.post@n3.nabble.com> Hello All, I've a pairwise global alignemnet of two DNA sequences generated by the program NEEDLE of EMBOSS package. I wish to extract the sub-sequence that matches/aligns to a given region of the other sequence. In this alignment (Pastebin Link) the given region (actually the CDS) falls between base number 24:485 in the original sequence with ID 'XM_001005073.' I wish to extract the sub-sequence in the sequence ID 'Homolog' that aligns with that 24:485 region of the other sequence. I'm using Bioperl to parse the alignment. I find out the the alignment column numbers corresponding to 24:485 region in the particular sequence, using 'column_from_residue_number'. Then I extract the sub-sequence from the 'aligned' other sequence(containing gaps) using the corresponding column numbers. Finally I remove the gap characters. Am I doing this thing correctly and are there any pitfalls ? Is there any better way to do it by (Bio)Perl/Python? The code goes here: use strict; use warnings; use Bio::AlignIO; # read in an alignment generated by the EMBOSS program Needle my $in = new Bio::AlignIO(-format => 'emboss', -file => 'test_needle.aln'); while( my $aln = $in->next_aln ) { #Seqnames: 'XM_001005073.'(CDS:24-485),'Homolog' my ($cds_start,$cds_end)=(24,485);# my $col_cdsstart = $aln->column_from_residue_number( 'XM_001005073.', $cds_start); my $col_cdsend= $aln->column_from_residue_number( 'XM_001005073.', $cds_end); foreach my $seq ($aln->each_seq) { if($seq->id() eq 'Homolog'){ my $homolog_cds=$seq->subseq($col_cdsstart,$col_cdsend); $homolog_cds=~s/\-//g; print $homolog_cds,"\n"; } } } -- View this message in context: http://bioperl.996286.n3.nabble.com/Extracting-matching-subsequence-from-pairwise-alignment-tp16935.html Sent from the Bioperl-L mailing list archive at Nabble.com. From hlapp at drycafe.net Wed May 15 20:44:07 2013 From: hlapp at drycafe.net (Hilmar Lapp) Date: Wed, 15 May 2013 16:44:07 -0400 Subject: [Bioperl-l] Workshop on Sustainable Software for Science: Practice and Experiences Message-ID: FYI, if you haven't seen this yet: http://wssspe.researchcomputing.org.uk/ It seems to me that the Bio* projects, perhaps led by BioPerl as the oldest and thus longest running (nowadays more fancily called "sustained") of them would have a lot to say about the subject. Anyone interested in a joint submission? Also, I notice that Biojava's Andreas is on the organizing committee, so maybe he's been conspiring on something already :-) -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 203 bytes Desc: Message signed with OpenPGP using GPGMail URL: From carandraug+dev at gmail.com Thu May 16 01:53:55 2013 From: carandraug+dev at gmail.com (=?ISO-8859-1?Q?Carn=EB_Draug?=) Date: Thu, 16 May 2013 02:53:55 +0100 Subject: [Bioperl-l] sets of sequences - how to read? Message-ID: Hi when accessing entrez gene using eutils to get multiple genes, NCBI now returns an Entrezgene-Set[1] rather than a list of EntrezGene. This change must have happened sometime on the last 2 months. Compare: use Bio::DB::EUtilities; my %sets = ( eutil => 'efetch', db => 'gene', retmode => 'text', rettype => 'asn1', email => 'bioperl-l at lists.open-bio.org', ); ## this mimics the previous behaviour of the NCBI server but the multiple requests will annoy their servers my @ids = (3014, 85235); my $response; foreach (@ids) { my $fetcher = Bio::DB::EUtilities->new(%sets, id => $_); $response .= $fetcher->get_Response->content; } print $fetcher->get_Response->content; ## this used to be the right way to do it, but now returns an Entrezgene-Set my $fetcher = Bio::DB::EUtilities->new(%sets, id => \@ids); $response .= $fetcher->get_Response->content; print $fetcher->get_Response->content; There is no module to read these Entrezgene-Set in Perl at the moment, since Bio::ASN1::EntrezGene; is not able to handle them. I have contacted the module author and set him a fix[2] and he said he'll try to look into it next week. However, even with the fix there is another problem. How would one access a set of sequences using the Bio::SeqIO API? There is no method to do that. One could say, to ignore them, and make next_seq return the next sequence of the set. But then we are losing data. After all, it's perfectly viable to have multiple Entrezgene-Set in one file. What would be the right way to do this? Carn? [1] http://0-www.ncbi.nlm.nih.gov.elis.tmu.edu.tw/IEB/ToolBox/CPP_DOC/asn_spec/Entrezgene-Set.html [2] https://github.com/carandraug/bio-asn1-entrezgene/commit/69d505056d8b7897df6271ffb7a5f39d58873c6b From cjfields at illinois.edu Thu May 16 04:43:22 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 16 May 2013 04:43:22 +0000 Subject: [Bioperl-l] Workshop on Sustainable Software for Science: Practice and Experiences In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E1F8C8@CHIMBX5.ad.uillinois.edu> Jason and I have discussed looking into opportunity's like this, I think it makes sense to try a joint submission. chris On May 15, 2013, at 3:44 PM, Hilmar Lapp wrote: > FYI, if you haven't seen this yet: > > http://wssspe.researchcomputing.org.uk/ > > It seems to me that the Bio* projects, perhaps led by BioPerl as the oldest and thus longest running (nowadays more fancily called "sustained") of them would have a lot to say about the subject. Anyone interested in a joint submission? > > Also, I notice that Biojava's Andreas is on the organizing committee, so maybe he's been conspiring on something already :-) > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : > =========================================================== > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From p.j.a.cock at googlemail.com Thu May 16 09:10:25 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 16 May 2013 10:10:25 +0100 Subject: [Bioperl-l] Workshop on Sustainable Software for Science: Practice and Experiences In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF74E1F8C8@CHIMBX5.ad.uillinois.edu> References: <118F034CF4C3EF48A96F86CE585B94BF74E1F8C8@CHIMBX5.ad.uillinois.edu> Message-ID: On Thu, May 16, 2013 at 5:43 AM, Fields, Christopher J wrote: > Jason and I have discussed looking into opportunity's like this, I think it makes > sense to try a joint submission. > > chris This sounds like a good idea, although given the time and place I am unlikely to be able to attend in person: First Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE) (to held in conjunction with SC13, Sunday, 17 November 2013, Denver, CO, USA) http://wssspe.researchcomputing.org.uk/ Rather than trying to discuss this over four mailing lists should we switch to the cross project list open-bio-l, or continue off-list? http://lists.open-bio.org/mailman/listinfo/open-bio-l Thanks, Peter From miquel.ramia at uab.cat Thu May 16 10:42:29 2013 From: miquel.ramia at uab.cat (=?ISO-8859-1?Q?Miquel_R=E0mia?=) Date: Thu, 16 May 2013 12:42:29 +0200 Subject: [Bioperl-l] Problem compiling Bio::DB::Sam Message-ID: <5194B815.2010401@uab.cat> Hi all, I get this message when compiling Bio::DB::Sam: Building Bio-SamTools gcc -g -Wall -O2 -fPIC -o bam2bedgraph bam2bedgraph.o -L/var/lib/gbrowse2/databases/samtools/samtools-0.1.19 -lbam -lm -lz /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `mt_destroy': /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:458: undefined reference to `pthread_join' /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `bgzf_mt': /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:445: undefined reference to `pthread_create' collect2: ld returned 1 exit status make: *** [bam2bedgraph] Error 1 Is this error related to the module or some dependencies? or maybe a problem with my system? Any help appreciated, thanks! -- Miquel R?mia Jes?s PhD. candidate (PIF) Evolutionary Bioinformatics Group (Genomics, Bioinformatics and Evolution Group) Lab MRB/014 - 93 586 89 58 MRB - Institut de Biologia i Biomedicina (IBB) Universitat Aut?noma de Barcelona (UAB) 08193, Cerdanyola del Vall?s Barcelona (Spain) From cjfields at illinois.edu Thu May 16 13:12:40 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 16 May 2013 13:12:40 +0000 Subject: [Bioperl-l] Problem compiling Bio::DB::Sam In-Reply-To: <5194B815.2010401@uab.cat> References: <5194B815.2010401@uab.cat> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E1FC2C@CHIMBX5.ad.uillinois.edu> It may be due to the new samtools release (v 0.1.19). I know Heng Li has been working on the code over the last year for threading support (notice the undefined functions). Have you tried v 0.1.18? chris On May 16, 2013, at 5:42 AM, Miquel R?mia wrote: > Hi all, > > I get this message when compiling Bio::DB::Sam: > > Building Bio-SamTools > > gcc -g -Wall -O2 -fPIC -o bam2bedgraph bam2bedgraph.o -L/var/lib/gbrowse2/databases/samtools/samtools-0.1.19 -lbam -lm -lz > > /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `mt_destroy': > > /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:458: undefined reference to `pthread_join' > > /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `bgzf_mt': > > /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:445: undefined reference to `pthread_create' > > collect2: ld returned 1 exit status > > make: *** [bam2bedgraph] Error 1 > > > Is this error related to the module or some dependencies? or maybe a problem with my system? > > Any help appreciated, thanks! > > > -- > Miquel R?mia Jes?s > PhD. candidate (PIF) > Evolutionary Bioinformatics Group > (Genomics, Bioinformatics and Evolution Group) > Lab MRB/014 - 93 586 89 58 > MRB - Institut de Biologia i Biomedicina (IBB) > Universitat Aut?noma de Barcelona (UAB) > 08193, Cerdanyola del Vall?s > Barcelona (Spain) > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Thu May 16 13:09:45 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 16 May 2013 13:09:45 +0000 Subject: [Bioperl-l] Workshop on Sustainable Software for Science: Practice and Experiences In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF74E1F8C8@CHIMBX5.ad.uillinois.edu> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E1FBCF@CHIMBX5.ad.uillinois.edu> Yes, though we need to make sure others (e.g. those not subscribed to open-bio-l) are in the loop. November is a possibility for me. chris On May 16, 2013, at 4:10 AM, Peter Cock wrote: > On Thu, May 16, 2013 at 5:43 AM, Fields, Christopher J > wrote: >> Jason and I have discussed looking into opportunity's like this, I think it makes >> sense to try a joint submission. >> >> chris > > This sounds like a good idea, although given the time and place I am > unlikely to be able to attend in person: > > First Workshop on Sustainable Software for Science: Practice and > Experiences (WSSSPE) > (to held in conjunction with SC13, Sunday, 17 November 2013, Denver, CO, USA) > http://wssspe.researchcomputing.org.uk/ > > Rather than trying to discuss this over four mailing lists should we switch > to the cross project list open-bio-l, or continue off-list? > http://lists.open-bio.org/mailman/listinfo/open-bio-l > > Thanks, > > Peter From andreas at sdsc.edu Thu May 16 04:31:34 2013 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 15 May 2013 21:31:34 -0700 Subject: [Bioperl-l] [Biojava-l] Workshop on Sustainable Software for Science: Practice and Experiences In-Reply-To: References: Message-ID: Thanks Hilmar, you were faster than me in sending this out.. You are right, it would be very interesting to hear what some of the long running open-bio projects have to say on the topic of sustainability. Let me know if anybody is interested in a submission! Andreas On Wed, May 15, 2013 at 1:44 PM, Hilmar Lapp wrote: > FYI, if you haven't seen this yet: > > http://wssspe.researchcomputing.org.uk/ > > It seems to me that the Bio* projects, perhaps led by BioPerl as the > oldest and thus longest running (nowadays more fancily called "sustained") > of them would have a lot to say about the subject. Anyone interested in a > joint submission? > > Also, I notice that Biojava's Andreas is on the organizing committee, so > maybe he's been conspiring on something already :-) > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : > =========================================================== > > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > From cjfields at illinois.edu Fri May 17 04:08:04 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 17 May 2013 04:08:04 +0000 Subject: [Bioperl-l] sets of sequences - how to read? In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E21252@CHIMBX5.ad.uillinois.edu> This doesn't surprise me too much; I know there have been some changes brewing, but didn't know when they would land. I guess that would be... ... now. My feeling is this will require writing some code for a higher-level layer of abstraction, say a Bio::DB::* (which would allow some internal indexing of the files maybe using a Bio::Index::*, look ups for specific gene IDs, etc). How hard that would be to implement is another thing, have no idea w/o seeing what the data look like beyond they are in ASN1. chris On May 15, 2013, at 8:53 PM, Carn? Draug wrote: > Hi > > when accessing entrez gene using eutils to get multiple genes, NCBI > now returns an Entrezgene-Set[1] rather than a list of EntrezGene. > This change must have happened sometime on the last 2 months. Compare: > > use Bio::DB::EUtilities; > > my %sets = ( > eutil => 'efetch', > db => 'gene', > retmode => 'text', > rettype => 'asn1', > email => 'bioperl-l at lists.open-bio.org', > ); > > ## this mimics the previous behaviour of the NCBI server but the > multiple requests will annoy their servers > my @ids = (3014, 85235); > my $response; > foreach (@ids) { > my $fetcher = Bio::DB::EUtilities->new(%sets, id => $_); > $response .= $fetcher->get_Response->content; > } > print $fetcher->get_Response->content; > > ## this used to be the right way to do it, but now returns an Entrezgene-Set > my $fetcher = Bio::DB::EUtilities->new(%sets, id => \@ids); > $response .= $fetcher->get_Response->content; > print $fetcher->get_Response->content; > > There is no module to read these Entrezgene-Set in Perl at the moment, > since Bio::ASN1::EntrezGene; is not able to handle them. I have > contacted the module author and set him a fix[2] and he said he'll try > to look into it next week. > > However, even with the fix there is another problem. How would one > access a set of sequences using the Bio::SeqIO API? There is no method > to do that. One could say, to ignore them, and make next_seq return > the next sequence of the set. But then we are losing data. After all, > it's perfectly viable to have multiple Entrezgene-Set in one file. > What would be the right way to do this? > > Carn? > > [1] http://0-www.ncbi.nlm.nih.gov.elis.tmu.edu.tw/IEB/ToolBox/CPP_DOC/asn_spec/Entrezgene-Set.html > [2] https://github.com/carandraug/bio-asn1-entrezgene/commit/69d505056d8b7897df6271ffb7a5f39d58873c6b > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Fri May 17 04:16:12 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 17 May 2013 04:16:12 +0000 Subject: [Bioperl-l] Problem compiling Bio::DB::Sam In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF74E1FC2C@CHIMBX5.ad.uillinois.edu> References: <5194B815.2010401@uab.cat> <118F034CF4C3EF48A96F86CE585B94BF74E1FC2C@CHIMBX5.ad.uillinois.edu> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E212B0@CHIMBX5.ad.uillinois.edu> For the record, this is now fixed in the latest Bio::Samtools (via Lincoln). chris On May 16, 2013, at 8:12 AM, "Fields, Christopher J" wrote: > It may be due to the new samtools release (v 0.1.19). I know Heng Li has been working on the code over the last year for threading support (notice the undefined functions). Have you tried v 0.1.18? > > chris > > On May 16, 2013, at 5:42 AM, Miquel R?mia wrote: > >> Hi all, >> >> I get this message when compiling Bio::DB::Sam: >> >> Building Bio-SamTools >> >> gcc -g -Wall -O2 -fPIC -o bam2bedgraph bam2bedgraph.o -L/var/lib/gbrowse2/databases/samtools/samtools-0.1.19 -lbam -lm -lz >> >> /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `mt_destroy': >> >> /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:458: undefined reference to `pthread_join' >> >> /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/libbam.a(bgzf.o): In function `bgzf_mt': >> >> /var/lib/gbrowse2/databases/samtools/samtools-0.1.19/bgzf.c:445: undefined reference to `pthread_create' >> >> collect2: ld returned 1 exit status >> >> make: *** [bam2bedgraph] Error 1 >> >> >> Is this error related to the module or some dependencies? or maybe a problem with my system? >> >> Any help appreciated, thanks! >> >> >> -- >> Miquel R?mia Jes?s >> PhD. candidate (PIF) >> Evolutionary Bioinformatics Group >> (Genomics, Bioinformatics and Evolution Group) >> Lab MRB/014 - 93 586 89 58 >> MRB - Institut de Biologia i Biomedicina (IBB) >> Universitat Aut?noma de Barcelona (UAB) >> 08193, Cerdanyola del Vall?s >> Barcelona (Spain) >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From carandraug+dev at gmail.com Fri May 17 05:12:24 2013 From: carandraug+dev at gmail.com (=?ISO-8859-1?Q?Carn=EB_Draug?=) Date: Fri, 17 May 2013 06:12:24 +0100 Subject: [Bioperl-l] sets of sequences - how to read? In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF74E21252@CHIMBX5.ad.uillinois.edu> References: <118F034CF4C3EF48A96F86CE585B94BF74E21252@CHIMBX5.ad.uillinois.edu> Message-ID: On 17 May 2013 05:08, Fields, Christopher J wrote: > This doesn't surprise me too much; I know there have been some changes brewing, but didn't know when they would land. I guess that would be... ... now. > > My feeling is this will require writing some code for a higher-level layer of abstraction, say a Bio::DB::* (which would allow some internal indexing of the files maybe using a Bio::Index::*, look ups for specific gene IDs, etc). How hard that would be to implement is another thing, have no idea w/o seeing what the data look like beyond they are in ASN1. :s I'm not sure I understood your suggestion. I think the problem is just the introduction of a new concept, a "set" of stuff (genes in this case), and how should SeqIO handle multiple sets. Carn? From shalabh.sharma7 at gmail.com Fri May 17 14:54:55 2013 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Fri, 17 May 2013 10:54:55 -0400 Subject: [Bioperl-l] Downloading annotated Gene nucleotide fasta files Message-ID: HI, First of all i am really sorry for sending this mail here, i am not sure if this is the right forum. I know lot of people work on similar stuff. I wrote to NCBI but nobody replied. Actually i am looking for all bacterial/microbial gene annotation nucleotide fasta files. Does anyone knows where to download these kind of files. I tried *ffn files but they are not annotated. Or is there any module in bioperl that i can use ? I would really appreciate your help. Thanks Shalabh -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From fossandonc at hotmail.com Fri May 17 15:59:04 2013 From: fossandonc at hotmail.com (=?iso-8859-1?Q?Francisco_J._Ossand=F3n?=) Date: Fri, 17 May 2013 11:59:04 -0400 Subject: [Bioperl-l] Downloading annotated Gene nucleotide fasta files In-Reply-To: References: Message-ID: Hi, You can get the annotations from here: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ The ".ffn" are the genes nucleotide fasta files but it does not show the product name, on the other hand the ".faa" are the genes aminoacid fasta files and shows the product name, but if you want both product and nucleotide is much better to use the Genbank ".gbk" files that contains the complete data and you can parse it easily using BioPerl to obtain all genes, and then print the /protein_id, /product, and the nucleotide sequences in a new fasta file. Check these to see how to do it: http://www.bioperl.org/wiki/HOWTO:SeqIO http://www.bioperl.org/wiki/HOWTO:Feature-Annotation Cheers, Francisco J. Ossandon -----Mensaje original----- De: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] En nombre de shalabh sharma Enviado el: viernes, 17 de mayo de 2013 10:55 Para: bioperl-l Asunto: [Bioperl-l] Downloading annotated Gene nucleotide fasta files HI, First of all i am really sorry for sending this mail here, i am not sure if this is the right forum. I know lot of people work on similar stuff. I wrote to NCBI but nobody replied. Actually i am looking for all bacterial/microbial gene annotation nucleotide fasta files. Does anyone knows where to download these kind of files. I tried *ffn files but they are not annotated. Or is there any module in bioperl that i can use ? I would really appreciate your help. Thanks Shalabh -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From shalabh.sharma7 at gmail.com Fri May 17 16:26:26 2013 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Fri, 17 May 2013 12:26:26 -0400 Subject: [Bioperl-l] Downloading annotated Gene nucleotide fasta files In-Reply-To: References: Message-ID: Hey Francisco, Thanks a lot. Basically i just wanted gene nucleotide fasta files with GI numbers. I think i will have to parse it from gbk files. -Shalabh On Fri, May 17, 2013 at 11:59 AM, Francisco J. Ossand?n < fossandonc at hotmail.com> wrote: > Hi, > You can get the annotations from here: > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ > > The ".ffn" are the genes nucleotide fasta files but it does not show the > product name, on the other hand the ".faa" are the genes aminoacid fasta > files and shows the product name, but if you want both product and > nucleotide is much better to use the Genbank ".gbk" files that contains the > complete data and you can parse it easily using BioPerl to obtain all > genes, > and then print the /protein_id, /product, and the nucleotide sequences in a > new fasta file. Check these to see how to do it: > http://www.bioperl.org/wiki/HOWTO:SeqIO > http://www.bioperl.org/wiki/HOWTO:Feature-Annotation > > Cheers, > > Francisco J. Ossandon > > -----Mensaje original----- > De: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] En nombre de shalabh sharma > Enviado el: viernes, 17 de mayo de 2013 10:55 > Para: bioperl-l > Asunto: [Bioperl-l] Downloading annotated Gene nucleotide fasta files > > HI, > First of all i am really sorry for sending this mail here, > i am not sure if this is the right forum. I know lot of people work on > similar stuff. > I wrote to NCBI but nobody replied. > > Actually i am looking for all bacterial/microbial gene annotation > nucleotide > fasta files. > Does anyone knows where to download these kind of files. > I tried *ffn files but they are not annotated. > Or is there any module in bioperl that i can use ? > I would really appreciate your help. > > Thanks > Shalabh > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences University of Georgia Athens, GA 30602-3636 > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From cjfields at illinois.edu Fri May 17 17:37:53 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 17 May 2013 17:37:53 +0000 Subject: [Bioperl-l] sets of sequences - how to read? In-Reply-To: References: <118F034CF4C3EF48A96F86CE585B94BF74E21252@CHIMBX5.ad.uillinois.edu> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E22218@CHIMBX5.ad.uillinois.edu> On May 17, 2013, at 12:12 AM, Carn? Draug wrote: > On 17 May 2013 05:08, Fields, Christopher J wrote: >> This doesn't surprise me too much; I know there have been some changes brewing, but didn't know when they would land. I guess that would be... ... now. >> >> My feeling is this will require writing some code for a higher-level layer of abstraction, say a Bio::DB::* (which would allow some internal indexing of the files maybe using a Bio::Index::*, look ups for specific gene IDs, etc). How hard that would be to implement is another thing, have no idea w/o seeing what the data look like beyond they are in ASN1. > > :s I'm not sure I understood your suggestion. I think the problem is > just the introduction of a new concept, a "set" of stuff (genes in > this case), and how should SeqIO handle multiple sets. > > Carn? (note: critical point in this is Bio::ASN1::Entrezgene would allow this, I'm not sure it would. Otherwise this is all really hand-wavy) To me a 'set of stuff', particularly when the 'stuff' is stored sequentially in a flat file, is a simple 'database' or 'store' of similar items, where the class allows one the ability to look up particular members in the set, but also could store higher level information about the set as a whole if needed. If it were me, I would implement a method particular to Bio::SeqIO::entrezgene that specifically creates and returns this ( next_geneset(), for instance ); next_seq() could then be implemented to iterate through the items in that database/store. Two useful things come out of this. First, if the data for the Entrez Gene file/chunk are parsed to store offsets per ID, one would only need to parse out the chunks needed (offset of ID to next offset), then pass that into the parser and create objects on the fly. This would probably be as fast or faster than (for instance) the greedy method of parsing the entire file and storing everything in objects up-front, then iterating through those objects one at a time, which I think is current behavior. Second: if an index is created, the upfront cost is already paid (you could reuse the same index when parsing the same data). An analogous example might be storing all FASTQ data in a sequencing run; I don't want to expend the effort to parse all the FASTQ data, but I may want to run operations on individual items in the set as well as store additional information about the data (barcodes per run, lanes, overall quality stats, etc). Does that make sense? The pieces for this are lying around (Bio::Index::* for instance has methods for indexing flat files, and classes like Bio::DB::Fasta). chris From shalabh.sharma7 at gmail.com Sun May 19 19:33:16 2013 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Sun, 19 May 2013 15:33:16 -0400 Subject: [Bioperl-l] Downloading annotated Gene nucleotide fasta files In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF37365BCAEC2@exchsth.agresearch.co.nz> References: <18DF7D20DFEC044098A1062202F5FFF37365BCAEC2@exchsth.agresearch.co.nz> Message-ID: Thanks Russell, Actually i wanted all the Bacterial gene nucleotide files, so i parsed it from *gbk. But yes these files might help me for my other parts of my work. Thanks Shalabh On Sun, May 19, 2013 at 3:26 PM, Smithies, Russell < Russell.Smithies at agresearch.co.nz> wrote: > Another option that I've used before is to download the gene2accession, > gene2refseq, and gene_info files from here > ftp://ftp.ncbi.nih.gov/gene/DATA/ then parse out the data you require. > It might work for you? > > --Russell > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto: > bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma > Sent: Saturday, 18 May 2013 4:26 a.m. > To: Francisco J. Ossand?n > Cc: bioperl-l > Subject: Re: [Bioperl-l] Downloading annotated Gene nucleotide fasta files > > Hey Francisco, > Thanks a lot. Basically i just wanted gene nucleotide fasta > files with GI numbers. > I think i will have to parse it from gbk files. > > -Shalabh > > > On Fri, May 17, 2013 at 11:59 AM, Francisco J. Ossand?n < > fossandonc at hotmail.com> wrote: > > > Hi, > > You can get the annotations from here: > > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ > > > > The ".ffn" are the genes nucleotide fasta files but it does not show > > the product name, on the other hand the ".faa" are the genes aminoacid > > fasta files and shows the product name, but if you want both product > > and nucleotide is much better to use the Genbank ".gbk" files that > > contains the complete data and you can parse it easily using BioPerl > > to obtain all genes, and then print the /protein_id, /product, and the > > nucleotide sequences in a new fasta file. Check these to see how to do > > it: > > http://www.bioperl.org/wiki/HOWTO:SeqIO > > http://www.bioperl.org/wiki/HOWTO:Feature-Annotation > > > > Cheers, > > > > Francisco J. Ossandon > > > > -----Mensaje original----- > > De: bioperl-l-bounces at lists.open-bio.org > > [mailto:bioperl-l-bounces at lists.open-bio.org] En nombre de shalabh > > sharma Enviado el: viernes, 17 de mayo de 2013 10:55 > > Para: bioperl-l > > Asunto: [Bioperl-l] Downloading annotated Gene nucleotide fasta files > > > > HI, > > First of all i am really sorry for sending this mail > > here, i am not sure if this is the right forum. I know lot of people > > work on similar stuff. > > I wrote to NCBI but nobody replied. > > > > Actually i am looking for all bacterial/microbial gene annotation > > nucleotide fasta files. > > Does anyone knows where to download these kind of files. > > I tried *ffn files but they are not annotated. > > Or is there any module in bioperl that i can use ? > > I would really appreciate your help. > > > > Thanks > > Shalabh > > > > -- > > Shalabh Sharma > > Scientific Computing Professional Associate (Bioinformatics > > Specialist) Department of Marine Sciences University of Georgia > > Athens, GA 30602-3636 _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics Specialist) > Department of Marine Sciences University of Georgia Athens, GA 30602-3636 > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 From Russell.Smithies at agresearch.co.nz Sun May 19 19:26:35 2013 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Mon, 20 May 2013 07:26:35 +1200 Subject: [Bioperl-l] Downloading annotated Gene nucleotide fasta files In-Reply-To: References: Message-ID: <18DF7D20DFEC044098A1062202F5FFF37365BCAEC2@exchsth.agresearch.co.nz> Another option that I've used before is to download the gene2accession, gene2refseq, and gene_info files from here ftp://ftp.ncbi.nih.gov/gene/DATA/ then parse out the data you require. It might work for you? --Russell -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of shalabh sharma Sent: Saturday, 18 May 2013 4:26 a.m. To: Francisco J. Ossand?n Cc: bioperl-l Subject: Re: [Bioperl-l] Downloading annotated Gene nucleotide fasta files Hey Francisco, Thanks a lot. Basically i just wanted gene nucleotide fasta files with GI numbers. I think i will have to parse it from gbk files. -Shalabh On Fri, May 17, 2013 at 11:59 AM, Francisco J. Ossand?n < fossandonc at hotmail.com> wrote: > Hi, > You can get the annotations from here: > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ > > The ".ffn" are the genes nucleotide fasta files but it does not show > the product name, on the other hand the ".faa" are the genes aminoacid > fasta files and shows the product name, but if you want both product > and nucleotide is much better to use the Genbank ".gbk" files that > contains the complete data and you can parse it easily using BioPerl > to obtain all genes, and then print the /protein_id, /product, and the > nucleotide sequences in a new fasta file. Check these to see how to do > it: > http://www.bioperl.org/wiki/HOWTO:SeqIO > http://www.bioperl.org/wiki/HOWTO:Feature-Annotation > > Cheers, > > Francisco J. Ossandon > > -----Mensaje original----- > De: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] En nombre de shalabh > sharma Enviado el: viernes, 17 de mayo de 2013 10:55 > Para: bioperl-l > Asunto: [Bioperl-l] Downloading annotated Gene nucleotide fasta files > > HI, > First of all i am really sorry for sending this mail > here, i am not sure if this is the right forum. I know lot of people > work on similar stuff. > I wrote to NCBI but nobody replied. > > Actually i am looking for all bacterial/microbial gene annotation > nucleotide fasta files. > Does anyone knows where to download these kind of files. > I tried *ffn files but they are not annotated. > Or is there any module in bioperl that i can use ? > I would really appreciate your help. > > Thanks > Shalabh > > -- > Shalabh Sharma > Scientific Computing Professional Associate (Bioinformatics > Specialist) Department of Marine Sciences University of Georgia > Athens, GA 30602-3636 _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- Shalabh Sharma Scientific Computing Professional Associate (Bioinformatics Specialist) Department of Marine Sciences University of Georgia Athens, GA 30602-3636 _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From miquel.ramia at uab.cat Tue May 21 15:08:18 2013 From: miquel.ramia at uab.cat (=?ISO-8859-1?Q?Miquel_R=E0mia?=) Date: Tue, 21 May 2013 17:08:18 +0200 Subject: [Bioperl-l] Problem compiling Bio::DB::Sam In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF74E212B0@CHIMBX5.ad.uillinois.edu> References: <5194B815.2010401@uab.cat> <"118F034CF4C3EF48A96F86CE585B94BF74E1F C2C"@CHIMBX5.ad.uillinois.edu> <118F034CF4C3EF48A96F86CE585B94BF74E212B0@CHIMBX5.ad.uillinois.edu> Message-ID: <519B8DE2.2070308@uab.cat> On 17/05/13 06:16, Fields, Christopher J wrote: > For the record, this is now fixed in the latest Bio::Samtools (via Lincoln). > > chris > > Compiled correctly! thank you -- Miquel R?mia Jes?s PhD. candidate (PIF) Evolutionary Bioinformatics Group (Genomics, Bioinformatics and Evolution Group) Lab MRB/014 - 93 586 89 58 MRB - Institut de Biologia i Biomedicina (IBB) Universitat Aut?noma de Barcelona (UAB) 08193, Cerdanyola del Vall?s Barcelona (Spain) From b.l.cohen_home at btinternet.com Mon May 20 18:49:50 2013 From: b.l.cohen_home at btinternet.com (Bernard Cohen) Date: Mon, 20 May 2013 19:49:50 +0100 (BST) Subject: [Bioperl-l] Phylip format error Message-ID: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Hello! I happen to have checked to see what the PERL webpage says about Phylip format for DNA alignment files and see that it is erroneous.? I am not a PERL user and do not want to be bothered to register or otherwise learn how to make an official comment, so forward this for someone to pick up. Phylip format allows up to 10 spaces for taxon names; the data must start in the 11th space. This can be checked on Jo Felsenstein's site. The PERL page accessed by searching for "Phylip format PERL" allows only 8 spaces for the name.? B. L. Cohen From senanu.pearson at gmail.com Wed May 22 20:15:24 2013 From: senanu.pearson at gmail.com (Senanu) Date: Wed, 22 May 2013 13:15:24 -0700 Subject: [Bioperl-l] Speed issues with making IUPAC consensus from alignment Message-ID: <1DE07C6C-0DEA-43EB-8F1A-BACB8534400F@gmail.com> Hi all, I am wondering if the consensus_iupac method of Bio::Align is known to be extremely slow, or if I'm doing something wrong. I have bacterial whole-genome alignments (~7 Mbases) that I made in progressiveMauve and wish to get an IUPAC consensus. (I know that progressiveMauve uses a non-standard XMFA format, but Bio::AlignIO seems to read them just fine.) The code below takes more than all night to make a consensus. It works fine on tiny test alignments. Is this a known problem? Is there another way to generate such a consensus? my $in = Bio::AlignIO->new(-file => $files[0], -format => 'XMFA'); while (my $aln = $in->next_aln()) { foreach my $seq ($aln->each_seq) { $seq->alphabet('dna'); } my $con = $aln->consensus_iupac(); } Thanks in advance. Ngwenyama From cjfields at illinois.edu Wed May 22 23:17:50 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 22 May 2013 23:17:50 +0000 Subject: [Bioperl-l] Speed issues with making IUPAC consensus from alignment In-Reply-To: <1DE07C6C-0DEA-43EB-8F1A-BACB8534400F@gmail.com> References: <1DE07C6C-0DEA-43EB-8F1A-BACB8534400F@gmail.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E2A01D@CHIMBX5.ad.uillinois.edu> On May 22, 2013, at 3:15 PM, Senanu wrote: > Hi all, > > I am wondering if the consensus_iupac method of Bio::Align is known to be extremely slow, or if I'm doing something wrong. Probably the former, but... > I have bacterial whole-genome alignments (~7 Mbases) that I made in progressiveMauve and wish to get an IUPAC consensus. (I know that progressiveMauve uses a non-standard XMFA format, but Bio::AlignIO seems to read them just fine.) The code below takes more than all night to make a consensus. It works fine on tiny test alignments. It shouldn't take that long, 7 Mb isn't that large. Or is that 7 Mb for one genome? > Is this a known problem? Is there another way to generate such a consensus? The code isn't really optimized for this, but again this isn't terribly large. Is the bottleneck reading the alignment in, or is it the consensus_iupac() step? Hard to say w/o seeing the alignment data itself. > my $in = Bio::AlignIO->new(-file => $files[0], > -format => 'XMFA'); > while (my $aln = $in->next_aln()) { > foreach my $seq ($aln->each_seq) { > $seq->alphabet('dna'); > } > my $con = $aln->consensus_iupac(); > } > > > Thanks in advance. > Ngwenyama > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l chris From alexeymorozov1991 at gmail.com Thu May 23 07:22:13 2013 From: alexeymorozov1991 at gmail.com (Alexey Morozov) Date: Thu, 23 May 2013 16:22:13 +0900 Subject: [Bioperl-l] Phylip format error In-Reply-To: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Message-ID: Which is also worsened by the fact that there is relaxed phylip format, which allows up to 250 chars for taxon name. They are separated from a sequence by single space, which creates problems if names were extended to 10 chars in strict Felsenstein's format by whitespaces. On the whole, phylip is as messily defined format as one can make from a plain textfile with information content of fasta. Bioperl documentation says nothing about whether Bio::SeqIO accepts relaxed phylip and how does it tell dialects from one another. Even if code support is OK, it may be worthwile to explain it somewhere at bioperl.org 2013/5/21 Bernard Cohen > Hello! > > I happen to have checked to see what the PERL webpage says about Phylip > format for DNA alignment files and see that it is erroneous. > > I am not a PERL user and do not want to be bothered to register or > otherwise learn how to make an official comment, so forward this for > someone to pick up. > > Phylip format allows up to 10 spaces for taxon names; the data must start > in the 11th space. This can be checked on Jo Felsenstein's site. > > The PERL page accessed by searching for "Phylip format PERL" allows only 8 > spaces for the name. > > B. L. Cohen > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- Alexey Morozov, LIN SB RAS, bioinformatics group. Irkutsk, Russia. From p.j.a.cock at googlemail.com Thu May 23 08:30:21 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 23 May 2013 09:30:21 +0100 Subject: [Bioperl-l] Phylip format error In-Reply-To: References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Message-ID: On Thu, May 23, 2013 at 8:22 AM, Alexey Morozov wrote: > Which is also worsened by the fact that there is relaxed phylip format, > which allows up to 250 chars for taxon name. They are separated from a > sequence by single space, which creates problems if names were extended to > 10 chars in strict Felsenstein's format by whitespaces. On the whole, > phylip is as messily defined format as one can make from a plain textfile > with information content of fasta. > Bioperl documentation says nothing about whether Bio::SeqIO accepts relaxed > phylip and how does it tell dialects from one another. Even if code support > is OK, it may be worthwile to explain it somewhere at bioperl.org Biopython's AlignIO defines both a (strict) "phylip" and "relaxed-phylip" as two separate formats (or variants, like the "fastq" variants). Doing the same in BioPerl would seem sensible since auto-detection is not easy. http://biopython.org/wiki/AlignIO#File_Formats Peter P.S. Where does that 250 characters for the taxon name limit come from? The trouble with relaxed phylip is that some tools are more relaxed than others ;) From awitney at sgul.ac.uk Thu May 23 08:43:15 2013 From: awitney at sgul.ac.uk (Adam Witney) Date: Thu, 23 May 2013 09:43:15 +0100 Subject: [Bioperl-l] Phylip format error In-Reply-To: References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Message-ID: <519DD6A3.8090304@sgul.ac.uk> Not sure if there is an actual question in these messages, but BioPerl can be used to generate valid Phylip format and run, like this: ## Build Align object my $aln = Bio::SimpleAlign->new(-seqs=>$seqs); ## swap the taxa names with 8 characters long unique IDs my ($aln_safe, $ref_name) = $aln->set_displayname_safe(8); ## Write out phylip format infile Bio::AlignIO->new(-file=>'>infile.out', -format=>'phylip', -interleaved => 0)->write_aln($aln); ## run PHYLIP's pars program my @params = (idlength=>10); #, jumble=>"17,10"); my $tree_factory = Bio::Tools::Run::Phylo::Phylip::Pars->new(@params); $tree_factory->quiet(1); # Suppress pars messages to terminal my $tree = $tree_factory->create_tree($aln_safe); ## fix the node labels back my @nodes = sort { defined $a->id && defined $b->id && $a->id cmp $b->id } $tree->get_nodes(); foreach my $nd (@nodes) { if ( $nd->is_Leaf ) { $nd->id($ref_name->{$nd->id_output}) } } HTH Adam On 23/05/2013 08:22, Alexey Morozov wrote: > Which is also worsened by the fact that there is relaxed phylip format, > which allows up to 250 chars for taxon name. They are separated from a > sequence by single space, which creates problems if names were extended to > 10 chars in strict Felsenstein's format by whitespaces. On the whole, > phylip is as messily defined format as one can make from a plain textfile > with information content of fasta. > Bioperl documentation says nothing about whether Bio::SeqIO accepts relaxed > phylip and how does it tell dialects from one another. Even if code support > is OK, it may be worthwile to explain it somewhere at bioperl.org > > > 2013/5/21 Bernard Cohen > >> Hello! >> >> I happen to have checked to see what the PERL webpage says about Phylip >> format for DNA alignment files and see that it is erroneous. >> >> I am not a PERL user and do not want to be bothered to register or >> otherwise learn how to make an official comment, so forward this for >> someone to pick up. >> >> Phylip format allows up to 10 spaces for taxon names; the data must start >> in the 11th space. This can be checked on Jo Felsenstein's site. >> >> The PERL page accessed by searching for "Phylip format PERL" allows only 8 >> spaces for the name. >> >> B. L. Cohen >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > > From cjfields at illinois.edu Thu May 23 13:48:31 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 23 May 2013 13:48:31 +0000 Subject: [Bioperl-l] Phylip format error In-Reply-To: References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E323A4@CITESMBX5.ad.uillinois.edu> Alexey, Just want to point out that 'relaxed phylip' format was introduced long after this parser was created; in fact (as Adam points out) there was an alternative workaround to deal with the lossy names. The content of that page is on a wiki, which anyone is free to edit (just need an OpenID to set up an account). chris On May 23, 2013, at 2:22 AM, Alexey Morozov wrote: > Which is also worsened by the fact that there is relaxed phylip format, > which allows up to 250 chars for taxon name. They are separated from a > sequence by single space, which creates problems if names were extended to > 10 chars in strict Felsenstein's format by whitespaces. On the whole, > phylip is as messily defined format as one can make from a plain textfile > with information content of fasta. > Bioperl documentation says nothing about whether Bio::SeqIO accepts relaxed > phylip and how does it tell dialects from one another. Even if code support > is OK, it may be worthwile to explain it somewhere at bioperl.org > > > 2013/5/21 Bernard Cohen > >> Hello! >> >> I happen to have checked to see what the PERL webpage says about Phylip >> format for DNA alignment files and see that it is erroneous. >> >> I am not a PERL user and do not want to be bothered to register or >> otherwise learn how to make an official comment, so forward this for >> someone to pick up. >> >> Phylip format allows up to 10 spaces for taxon names; the data must start >> in the 11th space. This can be checked on Jo Felsenstein's site. >> >> The PERL page accessed by searching for "Phylip format PERL" allows only 8 >> spaces for the name. >> >> B. L. Cohen >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > > > -- > Alexey Morozov, > LIN SB RAS, bioinformatics group. > Irkutsk, Russia. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Thu May 23 14:05:32 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 23 May 2013 14:05:32 +0000 Subject: [Bioperl-l] Phylip format error In-Reply-To: References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E32551@CITESMBX5.ad.uillinois.edu> On May 23, 2013, at 3:30 AM, Peter Cock wrote: > On Thu, May 23, 2013 at 8:22 AM, Alexey Morozov > wrote: >> Which is also worsened by the fact that there is relaxed phylip format, >> which allows up to 250 chars for taxon name. They are separated from a >> sequence by single space, which creates problems if names were extended to >> 10 chars in strict Felsenstein's format by whitespaces. On the whole, >> phylip is as messily defined format as one can make from a plain textfile >> with information content of fasta. >> Bioperl documentation says nothing about whether Bio::SeqIO accepts relaxed >> phylip and how does it tell dialects from one another. Even if code support >> is OK, it may be worthwile to explain it somewhere at bioperl.org > > Biopython's AlignIO defines both a (strict) "phylip" and "relaxed-phylip" > as two separate formats (or variants, like the "fastq" variants). Doing > the same in BioPerl would seem sensible since auto-detection is not > easy. > > http://biopython.org/wiki/AlignIO#File_Formats > > Peter > > P.S. Where does that 250 characters for the taxon name limit come from? > The trouble with relaxed phylip is that some tools are more relaxed than > others ;) As Adam pointed out, prior to the introduction of 'relaxed phylip' we had an alternative solution that didn't require a modified format but still allowed one to use PHYLIP and other tools requesting the format. I think 'relaxed phylip' was introduced by CIPRES a few years back. Frankly, this is the first time I have seen this mentioned on the list; yay, yet another format variation :) The variant format parsing (as implemented for SeqIO::fastq, as you know) deals with variant names like 'fastq-sanger', where the main format name is first, the variant of the format second. The order in this case is reversed (relaxed-phylip), which I'm pretty sure will not work. Not impossible to allow, but we would probably allow support like this initially: my $in = Bio::AlignIO->new(-format => 'phylip', -variant => 'relaxed', ?); chris From cjfields at illinois.edu Thu May 23 13:56:32 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 23 May 2013 13:56:32 +0000 Subject: [Bioperl-l] Speed issues with making IUPAC consensus from alignment In-Reply-To: <92CD2F9D-7C8E-4941-9354-4BD80D8C2CB6@gmail.com> References: <1DE07C6C-0DEA-43EB-8F1A-BACB8534400F@gmail.com> <118F034CF4C3EF48A96F86CE585B94BF74E2A01D@CHIMBX5.ad.uillinois.edu> <92CD2F9D-7C8E-4941-9354-4BD80D8C2CB6@gmail.com> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E32489@CITESMBX5.ad.uillinois.edu> (keep the list cc'd) On May 22, 2013, at 6:31 PM, Senanu wrote: > On May 22, 2013, at 4:17 PM, Fields, Christopher J wrote: > >> Hi all, >>> >>> I am wondering if the consensus_iupac method of Bio::Align is known to be extremely slow, or if I'm doing something wrong. >> >> Probably the former, but... >> >>> I have bacterial whole-genome alignments (~7 Mbases) that I made in progressiveMauve and wish to get an IUPAC consensus. (I know that progressiveMauve uses a non-standard XMFA format, but Bio::AlignIO seems to read them just fine.) The code below takes more than all night to make a consensus. It works fine on tiny test alignments. >> >> It shouldn't take that long, 7 Mb isn't that large. Or is that 7 Mb for one genome? > > It is 7Mb per genome, but there are only 2 genomes in the alignment, and the sequences are very similar to one another. > >> >>> Is this a known problem? Is there another way to generate such a consensus? >> >> The code isn't really optimized for this, but again this isn't terribly large. Is the bottleneck reading the alignment in, or is it the consensus_iupac() step? Hard to say w/o seeing the alignment data itself. > > The bottleneck is definitely with the consensus_iupac step. Reading the alignment in takes a few seconds. That's interesting, but again not surprising. One would have to look at the code, but I wouldn't be surprised if the method is terribly inefficient. chris From p.j.a.cock at googlemail.com Thu May 23 14:53:09 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 23 May 2013 15:53:09 +0100 Subject: [Bioperl-l] Phylip format error In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF74E32551@CITESMBX5.ad.uillinois.edu> References: <1369075790.65165.YahooMailNeo@web87404.mail.ir2.yahoo.com> <118F034CF4C3EF48A96F86CE585B94BF74E32551@CITESMBX5.ad.uillinois.edu> Message-ID: On Thu, May 23, 2013 at 3:05 PM, Fields, Christopher J wrote: > > I think 'relaxed phylip' was introduced by CIPRES a few years back. > Frankly, this is the first time I have seen this mentioned on the list; yay, > yet another format variation :) The relaxed phylip 'format' goes back further than that, e.g. http://lists.open-bio.org/pipermail/biopython-dev/2008-June/003899.html RAxML and PHYML support relaxed phylip - but with their own ID limits. Peter From Ben.Ward at sainsbury-laboratory.ac.uk Fri May 24 19:14:17 2013 From: Ben.Ward at sainsbury-laboratory.ac.uk (Ben Ward (TSL)) Date: Fri, 24 May 2013 19:14:17 +0000 Subject: [Bioperl-l] Building Many trees with Bioperl Message-ID: Hi, I'm new to Bioperl and plan to make a script to automate making trees with many alignment files (themselves generated by automating the process of multiple alignment for many datasets by using clustalw in a bioperl script). But I need to get it right for one pice of test data before I can do it for all: What I have produced so far is the below. It's supposed to load in the alignment file as as SimpleAlign. Then use that alignment in phyml. I looked at the documentation and tried to follow examples. However it gives me many errors like: --------------------- WARNING --------------------- MSG: Replacing one sequence [FXCNDTJ02P/1-366] And then gives me: Phyml command = /Users/wardb/clustalw2/phyml /var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip 0 i 1 0 HKY e e 1 e BIONJ y y ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Phyml call (/Users/wardb/clustalw2/phyml /var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip 0 i 1 0 HKY e e 1 e BIONJ y y) did not give an output [/var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip_phyml_stat.txt]: 11 STACK: Error::throw STACK: Bio::Root::Root::throw /Library/Perl/5.12/Bio/Root/Root.pm:472 STACK: Bio::Tools::Run::Phylo::Phyml::_run /Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:851 STACK: /Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:338 ----------------------------------------------------------- Can someone let me know if I'm going about this correctly and what I need to do to get it to work. I've also tried to run phyml by giving the filename in the run() method like: my $inputfilename = 'outputalignmentfile'; my $tree = $factory->run("$inputfilename"); But I get the same EXCEPTION: Bio::Root::Exception message. Thanks, Ben W. SCRIPT --- #!/usr/bin/perl use warnings; use strict; BEGIN { $ENV{PHYMLDIR} = '/Users/wardb/clustalw2' } use Bio::TreeIO; use Bio::AlignIO; use Bio::Tools::Run::Phylo::Phyml; my $alnin = Bio::AlignIO->new(-file => " 'phylip'); my $aln = $alnin->next_aln(); # Make a Phyml factory. my $factory = Bio::Tools::Run::Phylo::Phyml->new(-verbose => 2, -data_type => 'dna'); # Pass the factory an alignment and run: # my $inputfilename = 'outputalignmentfile'; my $tree = $factory->run($aln); # $tree is a Bio::Tree::Tree object. # Setup tree output stream... my $treeio = Bio::TreeIO->new(-format => 'newick', -file => 'tree.newick'); $treeio->write_tree($tree); exit 0; From bosborne11 at verizon.net Fri May 24 21:25:38 2013 From: bosborne11 at verizon.net (Brian Osborne) Date: Fri, 24 May 2013 17:25:38 -0400 Subject: [Bioperl-l] Building Many trees with Bioperl In-Reply-To: References: Message-ID: <93D0963E-8399-468C-A32B-875FAA5054D1@verizon.net> Ben, What happens when you take the Phyml command itself and run it from the command line? Also, a minor point: the message "MSG: Replacing one sequence [FXCNDTJ02P/1-366]" is not an error, it is a warning. An error accompanies an exit, warnings are just informative. Brian O. On May 24, 2013, at 3:14 PM, Ben Ward (TSL) wrote: > Hi, > > I'm new to Bioperl and plan to make a script to automate making trees with many alignment files (themselves generated by automating the process of multiple alignment for many datasets by using clustalw in a bioperl script). > But I need to get it right for one pice of test data before I can do it for all: > > What I have produced so far is the below. It's supposed to load in the alignment file as as SimpleAlign. Then use that alignment in phyml. I looked at the documentation and tried to follow examples. > > However it gives me many errors like: > --------------------- WARNING --------------------- > MSG: Replacing one sequence [FXCNDTJ02P/1-366] > > And then gives me: > Phyml command = /Users/wardb/clustalw2/phyml /var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip 0 i 1 0 HKY e e 1 e BIONJ y y > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: Phyml call (/Users/wardb/clustalw2/phyml /var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip 0 i 1 0 HKY e e 1 e BIONJ y y) did not give an output [/var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phylip_phyml_stat.txt]: 11 > STACK: Error::throw > STACK: Bio::Root::Root::throw /Library/Perl/5.12/Bio/Root/Root.pm:472 > STACK: Bio::Tools::Run::Phylo::Phyml::_run /Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:851 > STACK: /Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:338 > ----------------------------------------------------------- > > Can someone let me know if I'm going about this correctly and what I need to do to get it to work. I've also tried to run phyml by giving the filename in the run() method like: > my $inputfilename = 'outputalignmentfile'; > my $tree = $factory->run("$inputfilename"); > > But I get the same EXCEPTION: Bio::Root::Exception message. > > Thanks, > Ben W. > > SCRIPT --- > > #!/usr/bin/perl > use warnings; > use strict; > BEGIN { $ENV{PHYMLDIR} = '/Users/wardb/clustalw2' } > use Bio::TreeIO; > use Bio::AlignIO; > use Bio::Tools::Run::Phylo::Phyml; > > my $alnin = Bio::AlignIO->new(-file => " -format => 'phylip'); > > my $aln = $alnin->next_aln(); > > > # Make a Phyml factory. > my $factory = Bio::Tools::Run::Phylo::Phyml->new(-verbose => 2, > -data_type => 'dna'); > > # Pass the factory an alignment and run: > # my $inputfilename = 'outputalignmentfile'; > > my $tree = $factory->run($aln); # $tree is a Bio::Tree::Tree object. > > > # Setup tree output stream... > my $treeio = Bio::TreeIO->new(-format => 'newick', > -file => 'tree.newick'); > > $treeio->write_tree($tree); > > exit 0; > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From Ben.Ward at sainsbury-laboratory.ac.uk Fri May 24 21:46:40 2013 From: Ben.Ward at sainsbury-laboratory.ac.uk (Ben Ward (TSL)) Date: Fri, 24 May 2013 21:46:40 +0000 Subject: [Bioperl-l] Building Many trees with Bioperl In-Reply-To: <93D0963E-8399-468C-A32B-875FAA5054D1@verizon.net> Message-ID: Hi, If I just run phyml on the command line it seems to run ok - it accept's my file and appears to undergo the tree building process - I haven't actually see it complete yet, but ML and Bayes always does take a while - and I have many reads that need to be aligned. But PhyML gets to the point it asks - are you sure you want to proceed - I say yes, then it keeps quiet and is currently working along to itself: . 766 patterns found (out of a total of 795 sites). . 58 sites without polymorphism (7.30%). . Computing pairwise distances... . Building BioNJ tree... . WARNING: this analysis requires at least 556 MB of memory space. . Do you really want to proceed? [Y/n] Y It appears to be working =/ Best, Ben. On 24/05/2013 22:25, "Brian Osborne" wrote: >Ben, > >What happens when you take the Phyml command itself and run it from the >command line? > >Also, a minor point: the message "MSG: Replacing one sequence >[FXCNDTJ02P/1-366]" is not an error, it is a warning. An error >accompanies an exit, warnings are just informative. > >Brian O. > > >On May 24, 2013, at 3:14 PM, Ben Ward (TSL) > wrote: > >> Hi, >> >> I'm new to Bioperl and plan to make a script to automate making trees >>with many alignment files (themselves generated by automating the >>process of multiple alignment for many datasets by using clustalw in a >>bioperl script). >> But I need to get it right for one pice of test data before I can do it >>for all: >> >> What I have produced so far is the below. It's supposed to load in the >>alignment file as as SimpleAlign. Then use that alignment in phyml. I >>looked at the documentation and tried to follow examples. >> >> However it gives me many errors like: >> --------------------- WARNING --------------------- >> MSG: Replacing one sequence [FXCNDTJ02P/1-366] >> >> And then gives me: >> Phyml command = /Users/wardb/clustalw2/phyml >>/var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phyli >>p 0 i 1 0 HKY e e 1 e BIONJ y y >> >> ------------- EXCEPTION: Bio::Root::Exception ------------- >> MSG: Phyml call (/Users/wardb/clustalw2/phyml >>/var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phyli >>p 0 i 1 0 HKY e e 1 e BIONJ y y) did not give an output >>[/var/folders/kp/clkqvqn9739ffw2755zjwy74_skf_z/T/MTFzfN4jED/aln8058.phyl >>ip_phyml_stat.txt]: 11 >> STACK: Error::throw >> STACK: Bio::Root::Root::throw /Library/Perl/5.12/Bio/Root/Root.pm:472 >> STACK: Bio::Tools::Run::Phylo::Phyml::_run >>/Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:851 >> STACK: /Library/Perl/5.12/Bio/Tools/Run/Phylo/Phyml.pm:338 >> ----------------------------------------------------------- >> >> Can someone let me know if I'm going about this correctly and what I >>need to do to get it to work. I've also tried to run phyml by giving the >>filename in the run() method like: >> my $inputfilename = 'outputalignmentfile'; >> my $tree = $factory->run("$inputfilename"); >> >> But I get the same EXCEPTION: Bio::Root::Exception message. >> >> Thanks, >> Ben W. >> >> SCRIPT --- >> >> #!/usr/bin/perl >> use warnings; >> use strict; >> BEGIN { $ENV{PHYMLDIR} = '/Users/wardb/clustalw2' } >> use Bio::TreeIO; >> use Bio::AlignIO; >> use Bio::Tools::Run::Phylo::Phyml; >> >> my $alnin = Bio::AlignIO->new(-file => "> -format => 'phylip'); >> >> my $aln = $alnin->next_aln(); >> >> >> # Make a Phyml factory. >> my $factory = Bio::Tools::Run::Phylo::Phyml->new(-verbose => 2, >> -data_type => 'dna'); >> >> # Pass the factory an alignment and run: >> # my $inputfilename = 'outputalignmentfile'; >> >> my $tree = $factory->run($aln); # $tree is a Bio::Tree::Tree object. >> >> >> # Setup tree output stream... >> my $treeio = Bio::TreeIO->new(-format => 'newick', >> -file => 'tree.newick'); >> >> $treeio->write_tree($tree); >> >> exit 0; >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > From himaghna.bhattacharjee at gmail.com Tue May 28 15:59:02 2013 From: himaghna.bhattacharjee at gmail.com (Himaghna Bhattacharjee) Date: Tue, 28 May 2013 21:29:02 +0530 Subject: [Bioperl-l] error in the link to install Kobe repository for windows Message-ID: Hey, the link to install Kobe's repository for per 5.10 <" http://cpan.uwinnipeg.ca/PPMPackages/10xx/ --"> seems to be broken as it shows Error 503 Service Temporarily Unavailable. Could you please suggest an alternative ? Thanks . Himaghna Bhattacharjee 3rd year B.E.(Hons.)Chemical Engineering Birla Institute of Technology and Science,Pilani Rajasthan 333 031 From wgallin at ualberta.ca Tue May 28 17:49:02 2013 From: wgallin at ualberta.ca (Warren Gallin) Date: Tue, 28 May 2013 11:49:02 -0600 Subject: [Bioperl-l] ReplacedBy value in esummary Message-ID: <81794641-2693-4476-B2FA-6B742190D141@ualberta.ca> Hi, I just encountered a glitch when I was trying to update some entries in a database by finding updated GENBANK protein entries. The original record was gi 118091304 which has been replaced by gi 363734282 I retrieved an ESummary of the record for gi 118091304 as a Bio::Tools::EUtilities::Summary::DocSum object (called $ds). When I then tried to retrieve the gi number for the replacement by using: my $replaced = $ds->get_contents_by_name('ReplacedBy'); the returned value was 1, and when I dumped the ESummary record the relevant pair is ReplacedBy :XP_421022.3. The full Esummary dump is: UID :118091304 Caption :XP_421022 Title :PREDICTED: similar to Potassium voltage-gated channel, subfamily Q, member :1 [Gallus gallus] Extra :gi|118091304|ref|XP_421022.2|[118091304] Gi :118091304 CreateDate :2004/07/28 UpdateDate :2006/11/16 Flags :512 TaxId :9031 Length :643 Status :replaced ReplacedBy :XP_421022.3 Comment : This record was replaced or removed. So two questions: 1) Is the ReplacedBy value always supposed to be the new accession number version rather than the new gi number? 2) Why would I be getting a returned value of 1 instead of the accession number that is in the summary record? Any advice appreciated. Warren Gallin From cjfields at illinois.edu Tue May 28 18:31:30 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 28 May 2013 18:31:30 +0000 Subject: [Bioperl-l] ReplacedBy value in esummary In-Reply-To: <81794641-2693-4476-B2FA-6B742190D141@ualberta.ca> References: <81794641-2693-4476-B2FA-6B742190D141@ualberta.ca> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74E44377@CHIMBX5.ad.uillinois.edu> On May 28, 2013, at 12:49 PM, Warren Gallin wrote: > Hi, > > I just encountered a glitch when I was trying to update some entries in a database by finding updated GENBANK protein entries. > > The original record was gi 118091304 which has been replaced by gi 363734282 > > I retrieved an ESummary of the record for gi 118091304 as a Bio::Tools::EUtilities::Summary::DocSum object (called $ds). > > When I then tried to retrieve the gi number for the replacement by using: > > my $replaced = $ds->get_contents_by_name('ReplacedBy'); > > the returned value was 1, and when I dumped the ESummary record the relevant pair is ReplacedBy :XP_421022.3. > > The full Esummary dump is: > > UID :118091304 > Caption :XP_421022 > Title :PREDICTED: similar to Potassium voltage-gated channel, subfamily Q, member > :1 [Gallus gallus] > Extra :gi|118091304|ref|XP_421022.2|[118091304] > Gi :118091304 > CreateDate :2004/07/28 > UpdateDate :2006/11/16 > Flags :512 > TaxId :9031 > Length :643 > Status :replaced > ReplacedBy :XP_421022.3 > Comment : This record was replaced or removed. > > > So two questions: > > 1) Is the ReplacedBy value always supposed to be the new accession number version rather than the new gi number? No idea, the best people to answer that would be NCBI (the idea of these modules was to simplify getting at that data instead of munging the XML, but whatever they report is mainly from NCBI, not bioperl). > 2) Why would I be getting a returned value of 1 instead of the accession number that is in the summary record? The text dump above indicates the values do exist. However, you are calling a method that returns a list (note the plural in the name) in scalar context, so you get the number of values. If you always expect a single value, use: my ($replaced) = $ds->get_contents_by_name('ReplacedBy'); which forces array context. That should fix it. chris > Any advice appreciated. > > Warren Gallin > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l