From jason at bioperl.org Mon Sep 1 03:42:26 2008 From: jason at bioperl.org (Jason Stajich) Date: Mon, 1 Sep 2008 00:42:26 -0700 Subject: [Bioperl-l] Bio::Tools::dpAlign feature request In-Reply-To: <868888.19741.qm@web30406.mail.mud.yahoo.com> References: <868888.19741.qm@web30406.mail.mud.yahoo.com> Message-ID: <99A57523-94B3-4A56-A617-22675FA10AB8@bioperl.org> Safe to ignore the tests. Those that are failing aren't even test for Bio::Align::dpAlign - but were written to test a bug that has not been fixed in the EVD module if I remember correctly that is why they are marked in a TODO block, but I can't tell if the Test.pm is actually skipping these tests or not. I think we probably need to deprecate some of these modules as there is no maintainer of Ewan's code in here. At a minimum we need to modularlize the tests for these modules into separate t dir and fix the need for multiple Makefile.PL in here and probably move to Build.PL -jason On Aug 28, 2008, at 11:00 AM, Yee Man Chan wrote: > > Hi Alexie > > My understanding is that you can ignore these failures. > > I believe test cases 17-20 were added by Jason Stajich before I > added the feature you requested. I am not sure what he was doing > there. > > I suppose he can give you the definite answer to whether this is > something important or not. > > By the way, did you try out the new feature? Does it work? > > Thanks > Yee Man > > --- On Thu, 8/28/08, Alexie Papanicolaou > wrote: > >> From: Alexie Papanicolaou >> Subject: Re: Bio::Tools::dpAlign feature request >> To: "Yee Man Chan" >> Date: Thursday, August 28, 2008, 6:15 AM >> hi >> >> is the version you emailed me newer or older than the >> subversion one? >> >> i'm testing the subversion version for bioperl-ext and >> >> not ok 17 # TODO evalues vary based on platform, needs >> fixing >> # Failed (TODO) test at test.pl line 156. >> # got: '2027805538' >> # expected: '1764904' >> not ok 18 # TODO evalues vary based on platform, needs >> fixing >> # Failed (TODO) test at test.pl line 157. >> # got: '-1375488148' >> # expected: '1764872' >> not ok 19 # TODO evalues vary based on platform, needs >> fixing >> # Failed (TODO) test at test.pl line 158. >> # got: '-808808307' >> # expected: '1764872' >> not ok 20 # TODO evalues vary based on platform, needs >> fixing >> # Failed (TODO) test at test.pl line 159. >> # got: '-2118162890' >> # expected: '1764872' >> >> you think these are ok to ignore? >> >> >> On Fri, 2008-08-01 at 21:03 -0700, Yee Man Chan wrote: >> >>> Hi Alexie >>> >>> Attached are the files that contains the feature >> you requested. linspc.c is the one that does the work and >> test.pl has a test case for it. The scoring scheme is as >> described before. Please let me know if it works. >>> >>> Yee Man >>> >>> --- On Wed, 7/30/08, Alexie Papanicolaou >> wrote: >>> >>>> From: Alexie Papanicolaou >> >>>> Subject: Re: Bio::Tools::dpAlign feature request >>>> To: ymc at yahoo.com >>>> Date: Wednesday, July 30, 2008, 2:44 PM >>>> Oh sorry >>>> >>>> Say match=3 and mismatch=-1, gopen= -10, gext=-5 >>>> for aligning >>>> seq1: ATG >>>> seq2: ATT >>>> match: 3,3,-1 >>>> >>>> seq1: AT-G >>>> seq2: ATTG >>>> match: 3,3,-1,-10,3 >>>> >>>> is that possible? or am I missing something? I >> was only >>>> today wondering >>>> if it is even possible... >>>> >>>> a >>>> >>>> >>>> Yee Man Chan wrote: >>>>> Sorry, I don't quite get it. Can you >> give me an >>>> example of the output you want? >>>>> >>>>> Yee Man >>>>> >>>>> --- On Wed, 7/30/08, Alexie Papanicolaou >>>> wrote: >>>>> >>>>> >>>>>> From: Alexie Papanicolaou >>>> >>>>>> Subject: Re: Bio::Tools::dpAlign feature >> request >>>>>> To: ymc at yahoo.com >>>>>> Date: Wednesday, July 30, 2008, 9:50 AM >>>>>> Dear Yee Man, >>>>>> >>>>>> Do you think it is possible to code a >> method for >>>> creating a >>>>>> delimited >>>>>> (space or comma) "score-line"? >>>>>> >>>>>> I'd like to parse it into an array >> and have >>>> the >>>>>> individual score for >>>>>> each alignment position. Is it easy to >> do? >>>>>> >>>>>> a >>>>>> >>>>>> Yee Man Chan wrote: >>>>>> >>>>>>> Hi Alexie >>>>>>> >>>>>>> How about I implement the simple >> case? >>>>>>> >>>>>>> So for match = +3, mismatch = -1, >>>>>>> >>>>>>> A and R = +3 >>>>>>> A and Y = -1 >>>>>>> A and B = -1 >>>>>>> A and D = +3 >>>>>>> A and N = +3 >>>>>>> A and X = -1 >>>>>>> >>>>>>> What do you think? >>>>>>> Yee Man >>>>>>> >>>>>>> >>>>>>> --- On Tue, 7/29/08, Alexie >> Papanicolaou >>>>>>> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>>> From: Alexie Papanicolaou >>>>>>>> >>>>>> >>>>>> >>>>>>>> Subject: Re: Bio::Tools::dpAlign >> feature >>>> request >>>>>>>> To: ymc at yahoo.com >>>>>>>> Date: Tuesday, July 29, 2008, >> 10:58 AM >>>>>>>> Dear Yee Man, >>>>>>>> hello, I was wondering how is >> this >>>> progressing and >>>>>>>> >>>>>> if you >>>>>> >>>>>>>> need help? >>>>>>>> >>>>>>>> many thanks >>>>>>>> alexie >>>>>>>> >>>>>>>> Yee Man Chan wrote: >>>>>>>> >>>>>>>> >>>>>>>>> Hi Alexie >>>>>>>>> >>>>>>>>> There are two ways to >> compute the >>>> score for >>>>>>>>> >>>>>> each >>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> aligned basepair in dpAlign >> module. One is >>>>>>>> >>>>>> match/mismatch >>>>>> >>>>>>>> if you specify your sequence as >> DNA and >>>> the other >>>>>>>> >>>>>> is a >>>>>> >>>>>>>> scoring matrix if you specify >> your >>>> sequence as >>>>>>>> >>>>>> protein. >>>>>> >>>>>>>> Obviously, the latter can >> completely >>>> dominate the >>>>>>>> >>>>>> former. >>>>>> >>>>>>>> If you take the time to type the >> scoring >>>> matrix >>>>>>>> >>>>>> file, then >>>>>> >>>>>>>> you can handle those IUPAC code >> by >>>> specifying the >>>>>>>> >>>>>> sequence >>>>>> >>>>>>>> as proteins. >>>>>>>> >>>>>>>> >>>>>>>>> If you think this is too >>>> troublesome, then >>>>>>>>> >>>>>> I might >>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> be able to extend the >> match/mismatch route >>>> to >>>>>>>> >>>>>> handle IUPAC >>>>>> >>>>>>>> codes. But the problem here is, >> how should >>>> I score >>>>>>>> >>>>>> a match >>>>>> >>>>>>>> of A and W when match is +3 and >> mismatch >>>> is -1? >>>>>>>> >>>>>> Should it >>>>>> >>>>>>>> have a score of +3/3 = +1 for >> match or >>>> +3/3-1*2/3 >>>>>>>> >>>>>> = +1/3? >>>>>> >>>>>>>> Do you know what the convention >> is? If >>>> not, maybe >>>>>>>> >>>>>> you can >>>>>> >>>>>>>> tell me what you think the score >> will be? >>>>>>>> >>>>>>>> >>>>>>>>> Yee Man >>>>>>>>> >>>>>>>>> --- On Thu, 6/26/08, Alexie >>>> Papanicolaou >>>>>>>>> >>>>>>>>> >>>>>>>> >> wrote: >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> From: Alexie >> Papanicolaou >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> Subject: >> Bio::Tools::dpAlign >>>> feature >>>>>>>>>> >>>>>> request >>>>>> >>>>>>>>>> To: ymc at yahoo.com >>>>>>>>>> Date: Thursday, June 26, >> 2008, >>>> 4:15 AM >>>>>>>>>> Dear Yee Man Chan, >>>>>>>>>> >>>>>>>>>> Many thank you for this >> module. I >>>> like it >>>>>>>>>> >>>>>> very >>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> much. I was >>>>>>>> >>>>>>>> >>>>>>>>>> wondering if >>>>>>>>>> it would be possible for >> you to >>>> allow for >>>>>>>>>> >>>>>> IUPAC >>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> DNA codes. >>>>>>>> >>>>>>>> >>>>>>>>>> I see it is in your TODO >> list and >>>> I hoped >>>>>>>>>> >>>>>> to >>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> inspire you >>>>>>>> >>>>>>>> >>>>>>>>>> :-) >>>>>>>>>> >>>>>>>>>> Even a simple measure >> with the >>>> degenerate >>>>>>>>>> >>>>>> base >>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> containing >>>>>>>> >>>>>>>> >>>>>>>>>> the aligned >>>>>>>>>> base count as a >> (perfect) match >>>> would be >>>>>>>>>> >>>>>> very >>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> useful to me >>>>>>>> >>>>>>>> >>>>>>>>>> (i'm sorry, i >>>>>>>>>> 'm not a good coder >> to do it >>>> myself). >>>>>>>>>> >>>>>>>>>> many thanks for your >> work so far. >>>>>>>>>> alexie >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> -- >>>>>>>>>> "Eppur si >> evolve" >>>> ("And yet >>>>>>>>>> >>>>>> it >>>>>> >>>>>>>>>> evolves") >>>>>>>>>> -Galileo Jr (ca 21st >> century) >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Alexie Papanicolaou >>>>>>>>>> Entomology >>>>>>>>>> Max Planck Institute for >> Chemical >>>> Ecology >>>>>>>>>> Hans Knoell Str 8 >>>>>>>>>> Jena 07745 >>>>>>>>>> Germany >>>>>>>>>> Email >> apapanicolaou at ice.mpg.de >>>>>>>>>> Tel +493641571561 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> -- >>>>>>>> -- >>>>>>>> "Eppur si evolve" >> ("And yet >>>> it >>>>>>>> evolves") >>>>>>>> -Galileo Jr (ca 21st century) >>>>>>>> >>>>>>>> "One Galileo in two >> thousand years is >>>>>>>> >>>>>> enough." >>>>>> >>>>>>>> -Pope Pius XII >>>>>>>> -- >>>>>>>> Alexie Papanicolaou >>>>>>>> Entomology >>>>>>>> Max Planck Institute for >> Chemical Ecology >>>>>>>> Hans Knoell Str 8 >>>>>>>> Jena 07745 >>>>>>>> Germany >>>>>>>> Email apapanicolaou at ice.mpg.de >>>>>>>> Tel +493641571561 >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> -- >>>>>> -- >>>>>> "Eppur si evolve" ("And >> yet it >>>>>> evolves") >>>>>> -Galileo Jr (ca 21st century) >>>>>> >>>>>> "One Galileo in two thousand years >> is >>>> enough." >>>>>> -Pope Pius XII >>>>>> -- >>>>>> Alexie Papanicolaou >>>>>> Entomology >>>>>> Max Planck Institute for Chemical >> Ecology >>>>>> Hans Knoell Str 8 >>>>>> Jena 07745 >>>>>> Germany >>>>>> Email apapanicolaou at ice.mpg.de >>>>>> Tel +493641571561 >>>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> -- >>>> -- >>>> "Eppur si evolve" ("And yet it >>>> evolves") >>>> -Galileo Jr (ca 21st century) >>>> >>>> "One Galileo in two thousand years is >> enough." >>>> -Pope Pius XII >>>> -- >>>> Alexie Papanicolaou >>>> Entomology >>>> Max Planck Institute for Chemical Ecology >>>> Hans Knoell Str 8 >>>> Jena 07745 >>>> Germany >>>> Email apapanicolaou at ice.mpg.de >>>> Tel +493641571561 >>> >>> >>> >> >> -- >> -- >> "Eppur si evolve" ("And yet it >> evolves") >> -Galileo Jr (ca 21st century) >> >> "One Galileo in two thousand years is enough." >> -Pope Pius XII >> -- >> Alexie Papanicolaou >> Entomology >> Max Planck Institute for Chemical Ecology >> Hans Knoell Str 8 >> Jena 07745 >> Germany >> Email apapanicolaou at ice.mpg.de >> Tel +493641571561 > > > Jason Stajich jason at bioperl.org From cjfields at illinois.edu Mon Sep 1 13:49:56 2008 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 1 Sep 2008 12:49:56 -0500 Subject: [Bioperl-l] Bio::Tools::dpAlign feature request In-Reply-To: <99A57523-94B3-4A56-A617-22675FA10AB8@bioperl.org> References: <868888.19741.qm@web30406.mail.mud.yahoo.com> <99A57523-94B3-4A56-A617-22675FA10AB8@bioperl.org> Message-ID: <44D25826-BE42-4539-8567-468FE74571B4@illinois.edu> On pSW: I agree, I don't think it is worth maintaining it considering there are actively supported C/C++-based toolkits with similar functionality (SeqAn) and Petr's BioLib initiative will likely be a more maintainable effort. chris On Sep 1, 2008, at 2:42 AM, Jason Stajich wrote: > Safe to ignore the tests. Those that are failing aren't even test > for Bio::Align::dpAlign - but were written to test a bug that has > not been fixed in the EVD module if I remember correctly that is why > they are marked in a TODO block, but I can't tell if the Test.pm is > actually skipping these tests or not. > > I think we probably need to deprecate some of these modules as there > is no maintainer of Ewan's code in here. > > At a minimum we need to modularlize the tests for these modules into > separate t dir and fix the need for multiple Makefile.PL in here and > probably move to Build.PL > > -jason > On Aug 28, 2008, at 11:00 AM, Yee Man Chan wrote: > >> >> Hi Alexie >> >> My understanding is that you can ignore these failures. >> >> I believe test cases 17-20 were added by Jason Stajich before I >> added the feature you requested. I am not sure what he was doing >> there. >> >> I suppose he can give you the definite answer to whether this is >> something important or not. >> >> By the way, did you try out the new feature? Does it work? >> >> Thanks >> Yee Man >> >> --- On Thu, 8/28/08, Alexie Papanicolaou >> wrote: >> >>> From: Alexie Papanicolaou >>> Subject: Re: Bio::Tools::dpAlign feature request >>> To: "Yee Man Chan" >>> Date: Thursday, August 28, 2008, 6:15 AM >>> hi >>> >>> is the version you emailed me newer or older than the >>> subversion one? >>> >>> i'm testing the subversion version for bioperl-ext and >>> >>> not ok 17 # TODO evalues vary based on platform, needs >>> fixing >>> # Failed (TODO) test at test.pl line 156. >>> # got: '2027805538' >>> # expected: '1764904' >>> not ok 18 # TODO evalues vary based on platform, needs >>> fixing >>> # Failed (TODO) test at test.pl line 157. >>> # got: '-1375488148' >>> # expected: '1764872' >>> not ok 19 # TODO evalues vary based on platform, needs >>> fixing >>> # Failed (TODO) test at test.pl line 158. >>> # got: '-808808307' >>> # expected: '1764872' >>> not ok 20 # TODO evalues vary based on platform, needs >>> fixing >>> # Failed (TODO) test at test.pl line 159. >>> # got: '-2118162890' >>> # expected: '1764872' >>> >>> you think these are ok to ignore? >>> >>> >>> On Fri, 2008-08-01 at 21:03 -0700, Yee Man Chan wrote: >>> >>>> Hi Alexie >>>> >>>> Attached are the files that contains the feature >>> you requested. linspc.c is the one that does the work and >>> test.pl has a test case for it. The scoring scheme is as >>> described before. Please let me know if it works. >>>> >>>> Yee Man >>>> >>>> --- On Wed, 7/30/08, Alexie Papanicolaou >>> wrote: >>>> >>>>> From: Alexie Papanicolaou >>> >>>>> Subject: Re: Bio::Tools::dpAlign feature request >>>>> To: ymc at yahoo.com >>>>> Date: Wednesday, July 30, 2008, 2:44 PM >>>>> Oh sorry >>>>> >>>>> Say match=3 and mismatch=-1, gopen= -10, gext=-5 >>>>> for aligning >>>>> seq1: ATG >>>>> seq2: ATT >>>>> match: 3,3,-1 >>>>> >>>>> seq1: AT-G >>>>> seq2: ATTG >>>>> match: 3,3,-1,-10,3 >>>>> >>>>> is that possible? or am I missing something? I >>> was only >>>>> today wondering >>>>> if it is even possible... >>>>> >>>>> a >>>>> >>>>> >>>>> Yee Man Chan wrote: >>>>>> Sorry, I don't quite get it. Can you >>> give me an >>>>> example of the output you want? >>>>>> >>>>>> Yee Man >>>>>> >>>>>> --- On Wed, 7/30/08, Alexie Papanicolaou >>>>> wrote: >>>>>> >>>>>> >>>>>>> From: Alexie Papanicolaou >>>>> >>>>>>> Subject: Re: Bio::Tools::dpAlign feature >>> request >>>>>>> To: ymc at yahoo.com >>>>>>> Date: Wednesday, July 30, 2008, 9:50 AM >>>>>>> Dear Yee Man, >>>>>>> >>>>>>> Do you think it is possible to code a >>> method for >>>>> creating a >>>>>>> delimited >>>>>>> (space or comma) "score-line"? >>>>>>> >>>>>>> I'd like to parse it into an array >>> and have >>>>> the >>>>>>> individual score for >>>>>>> each alignment position. Is it easy to >>> do? >>>>>>> >>>>>>> a >>>>>>> >>>>>>> Yee Man Chan wrote: >>>>>>> >>>>>>>> Hi Alexie >>>>>>>> >>>>>>>> How about I implement the simple >>> case? >>>>>>>> >>>>>>>> So for match = +3, mismatch = -1, >>>>>>>> >>>>>>>> A and R = +3 >>>>>>>> A and Y = -1 >>>>>>>> A and B = -1 >>>>>>>> A and D = +3 >>>>>>>> A and N = +3 >>>>>>>> A and X = -1 >>>>>>>> >>>>>>>> What do you think? >>>>>>>> Yee Man >>>>>>>> >>>>>>>> >>>>>>>> --- On Tue, 7/29/08, Alexie >>> Papanicolaou >>>>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> From: Alexie Papanicolaou >>>>>>>>> >>>>>>> >>>>>>> >>>>>>>>> Subject: Re: Bio::Tools::dpAlign >>> feature >>>>> request >>>>>>>>> To: ymc at yahoo.com >>>>>>>>> Date: Tuesday, July 29, 2008, >>> 10:58 AM >>>>>>>>> Dear Yee Man, >>>>>>>>> hello, I was wondering how is >>> this >>>>> progressing and >>>>>>>>> >>>>>>> if you >>>>>>> >>>>>>>>> need help? >>>>>>>>> >>>>>>>>> many thanks >>>>>>>>> alexie >>>>>>>>> >>>>>>>>> Yee Man Chan wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>>> Hi Alexie >>>>>>>>>> >>>>>>>>>> There are two ways to >>> compute the >>>>> score for >>>>>>>>>> >>>>>>> each >>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> aligned basepair in dpAlign >>> module. One is >>>>>>>>> >>>>>>> match/mismatch >>>>>>> >>>>>>>>> if you specify your sequence as >>> DNA and >>>>> the other >>>>>>>>> >>>>>>> is a >>>>>>> >>>>>>>>> scoring matrix if you specify >>> your >>>>> sequence as >>>>>>>>> >>>>>>> protein. >>>>>>> >>>>>>>>> Obviously, the latter can >>> completely >>>>> dominate the >>>>>>>>> >>>>>>> former. >>>>>>> >>>>>>>>> If you take the time to type the >>> scoring >>>>> matrix >>>>>>>>> >>>>>>> file, then >>>>>>> >>>>>>>>> you can handle those IUPAC code >>> by >>>>> specifying the >>>>>>>>> >>>>>>> sequence >>>>>>> >>>>>>>>> as proteins. >>>>>>>>> >>>>>>>>> >>>>>>>>>> If you think this is too >>>>> troublesome, then >>>>>>>>>> >>>>>>> I might >>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> be able to extend the >>> match/mismatch route >>>>> to >>>>>>>>> >>>>>>> handle IUPAC >>>>>>> >>>>>>>>> codes. But the problem here is, >>> how should >>>>> I score >>>>>>>>> >>>>>>> a match >>>>>>> >>>>>>>>> of A and W when match is +3 and >>> mismatch >>>>> is -1? >>>>>>>>> >>>>>>> Should it >>>>>>> >>>>>>>>> have a score of +3/3 = +1 for >>> match or >>>>> +3/3-1*2/3 >>>>>>>>> >>>>>>> = +1/3? >>>>>>> >>>>>>>>> Do you know what the convention >>> is? If >>>>> not, maybe >>>>>>>>> >>>>>>> you can >>>>>>> >>>>>>>>> tell me what you think the score >>> will be? >>>>>>>>> >>>>>>>>> >>>>>>>>>> Yee Man >>>>>>>>>> >>>>>>>>>> --- On Thu, 6/26/08, Alexie >>>>> Papanicolaou >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> From: Alexie >>> Papanicolaou >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>>> Subject: >>> Bio::Tools::dpAlign >>>>> feature >>>>>>>>>>> >>>>>>> request >>>>>>> >>>>>>>>>>> To: ymc at yahoo.com >>>>>>>>>>> Date: Thursday, June 26, >>> 2008, >>>>> 4:15 AM >>>>>>>>>>> Dear Yee Man Chan, >>>>>>>>>>> >>>>>>>>>>> Many thank you for this >>> module. I >>>>> like it >>>>>>>>>>> >>>>>>> very >>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> much. I was >>>>>>>>> >>>>>>>>> >>>>>>>>>>> wondering if >>>>>>>>>>> it would be possible for >>> you to >>>>> allow for >>>>>>>>>>> >>>>>>> IUPAC >>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> DNA codes. >>>>>>>>> >>>>>>>>> >>>>>>>>>>> I see it is in your TODO >>> list and >>>>> I hoped >>>>>>>>>>> >>>>>>> to >>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> inspire you >>>>>>>>> >>>>>>>>> >>>>>>>>>>> :-) >>>>>>>>>>> >>>>>>>>>>> Even a simple measure >>> with the >>>>> degenerate >>>>>>>>>>> >>>>>>> base >>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> containing >>>>>>>>> >>>>>>>>> >>>>>>>>>>> the aligned >>>>>>>>>>> base count as a >>> (perfect) match >>>>> would be >>>>>>>>>>> >>>>>>> very >>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> useful to me >>>>>>>>> >>>>>>>>> >>>>>>>>>>> (i'm sorry, i >>>>>>>>>>> 'm not a good coder >>> to do it >>>>> myself). >>>>>>>>>>> >>>>>>>>>>> many thanks for your >>> work so far. >>>>>>>>>>> alexie >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> -- >>>>>>>>>>> "Eppur si >>> evolve" >>>>> ("And yet >>>>>>>>>>> >>>>>>> it >>>>>>> >>>>>>>>>>> evolves") >>>>>>>>>>> -Galileo Jr (ca 21st >>> century) >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Alexie Papanicolaou >>>>>>>>>>> Entomology >>>>>>>>>>> Max Planck Institute for >>> Chemical >>>>> Ecology >>>>>>>>>>> Hans Knoell Str 8 >>>>>>>>>>> Jena 07745 >>>>>>>>>>> Germany >>>>>>>>>>> Email >>> apapanicolaou at ice.mpg.de >>>>>>>>>>> Tel +493641571561 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> -- >>>>>>>>> -- >>>>>>>>> "Eppur si evolve" >>> ("And yet >>>>> it >>>>>>>>> evolves") >>>>>>>>> -Galileo Jr (ca 21st century) >>>>>>>>> >>>>>>>>> "One Galileo in two >>> thousand years is >>>>>>>>> >>>>>>> enough." >>>>>>> >>>>>>>>> -Pope Pius XII >>>>>>>>> -- >>>>>>>>> Alexie Papanicolaou >>>>>>>>> Entomology >>>>>>>>> Max Planck Institute for >>> Chemical Ecology >>>>>>>>> Hans Knoell Str 8 >>>>>>>>> Jena 07745 >>>>>>>>> Germany >>>>>>>>> Email apapanicolaou at ice.mpg.de >>>>>>>>> Tel +493641571561 >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> -- >>>>>>> -- >>>>>>> "Eppur si evolve" ("And >>> yet it >>>>>>> evolves") >>>>>>> -Galileo Jr (ca 21st century) >>>>>>> >>>>>>> "One Galileo in two thousand years >>> is >>>>> enough." >>>>>>> -Pope Pius XII >>>>>>> -- >>>>>>> Alexie Papanicolaou >>>>>>> Entomology >>>>>>> Max Planck Institute for Chemical >>> Ecology >>>>>>> Hans Knoell Str 8 >>>>>>> Jena 07745 >>>>>>> Germany >>>>>>> Email apapanicolaou at ice.mpg.de >>>>>>> Tel +493641571561 >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> -- >>>>> "Eppur si evolve" ("And yet it >>>>> evolves") >>>>> -Galileo Jr (ca 21st century) >>>>> >>>>> "One Galileo in two thousand years is >>> enough." >>>>> -Pope Pius XII >>>>> -- >>>>> Alexie Papanicolaou >>>>> Entomology >>>>> Max Planck Institute for Chemical Ecology >>>>> Hans Knoell Str 8 >>>>> Jena 07745 >>>>> Germany >>>>> Email apapanicolaou at ice.mpg.de >>>>> Tel +493641571561 >>>> >>>> >>>> >>> >>> -- >>> -- >>> "Eppur si evolve" ("And yet it >>> evolves") >>> -Galileo Jr (ca 21st century) >>> >>> "One Galileo in two thousand years is enough." >>> -Pope Pius XII >>> -- >>> Alexie Papanicolaou >>> Entomology >>> Max Planck Institute for Chemical Ecology >>> Hans Knoell Str 8 >>> Jena 07745 >>> Germany >>> Email apapanicolaou at ice.mpg.de >>> Tel +493641571561 >> >> >> > > Jason Stajich > jason at bioperl.org > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Marie-Claude Hofmann College of Veterinary Medicine University of Illinois Urbana-Champaign From lmanchon at univ-montp2.fr Wed Sep 3 09:04:14 2008 From: lmanchon at univ-montp2.fr (Laurent Manchon) Date: Wed, 03 Sep 2008 15:04:14 +0200 Subject: [Bioperl-l] parsing result of CAP3 (ACE file) Message-ID: <5.0.2.1.2.20080903150203.00c0db18@pop.univ-montp2.fr> -- Hi, Is somebody have a piece of code to parse result of CAP3 assembly program which format is ACE ? I need to retrieve the alignment from this file. thank you, Laurent -- +---------------------------------------------+ Laurent Manchon Email: lmanchon at univ-montp2.fr +---------------------------------------------+ From osborne6 at gmail.com Wed Sep 3 10:33:15 2008 From: osborne6 at gmail.com (John Osborne) Date: Wed, 3 Sep 2008 09:33:15 -0500 Subject: [Bioperl-l] interpro parsing enhancement? Message-ID: <324fccf0809030733y5c5e8592t5f0617d0ff4d2203@mail.gmail.com> Hi - I'm wondering if anyone is working on adding functionality to Bio::SeqIO::interpro to grab the Gene Ontology/GO classifications out of the interpro xml output? I've started working on that myself, but wanted to check if anyone else is doing the same. Thanks! -- John Osborne osborne6 at ieee.org/osborne6 at gmail.com/jro at freeshell.org From gundalav at gmail.com Wed Sep 3 10:48:07 2008 From: gundalav at gmail.com (Gundala Viswanath) Date: Wed, 3 Sep 2008 23:48:07 +0900 Subject: [Bioperl-l] Fitch's Parsimony Algorithm with Perl Message-ID: <73f827b50809030748o725d3772m681af9da3c0c26c0@mail.gmail.com> Hi, What's a correct way to implement Fitch's parsimony algorithm? Especially to compute minimum substitiution rate per column in the aligned sequence. Is there a Bioperl module to do it? For example CGGCGGAAAACTGTCCTCCGTGC mouse CGACGGAACATTCTCCTCCGCGC rat CGACGGAATATTCCCCTCCGTGC human CGACGGAAGACTCTCCTCCGTGC chimp 00100000302011000000100 -> number of subst per site (max parsimony) My code below doesn't seem to do the job. __BEGIN__ use Data::Dumper; use List::MoreUtils qw(uniq); # The related phylogenetic in Newick format tree is: my $tree = ' (mouse,rat,(human,chimp))'; my $sites = [ 'CGGCGGAAAACTGTCCTCCGTGC', # mouse 'CGACGGAACATTCTCCTCCGCGC', # rat 'CGACGGAATATTCCCCTCCGTGC', # human 'CGACGGAAGACTCTCCTCCGTGC', # chimp ]; my @val = my_parsimony($sites); print Dumper \@val; sub my_parsimony { my $tfbs = shift; my $mlen = length($tfbs->[0]); my $sum_min = 0; my @mincol; foreach my $pos ( 0 .. $mlen-1 ) { my @colbp = (); foreach my $site ( @{$tfbs} ) { my $bp = substr($site,$pos,1); push @colbp, $bp; } # this heuristic seems to be faulty # Column 11 it predicts 1 instead of 2 # Not sure how can I make use of the tree my $min_mm = scalar( uniq(@colbp) ) - 1; push @mincol, $min_mm; } return @mincol; } __END__ - Gundala Viswanath Jakarta - Indonesia From raulmendez at cbm.uam.es Wed Sep 3 10:33:46 2008 From: raulmendez at cbm.uam.es (Raul Mendez Giraldez) Date: Wed, 03 Sep 2008 16:33:46 +0200 Subject: [Bioperl-l] SeqHound In-Reply-To: <111DD141-75F8-4437-9EAD-E049BBADB515@uiuc.edu> Message-ID: <1220452426.31595.92.camel@pepa.cbm.uam.es> Hi Chris, I'm trying to set up and run bioperl Seqhound donwloaded from: http://bond.unleashedinformatics.com/downloads/api//seqhound-bioperl-4.0.tar.gz and I always get connection error messages. Do you know which version of SeqHound should I use and how can I configure to make it work? I've tried several possibilities for server1 at .shoundremrc as [remote] server1 = bond.unleashedinformatics.com CGI = /cgi-bin/seqrem port=8080 Also, I would like to get all the possible protein-protein interaction for a set of protein sequences. Would this be possible using SeqHound? Thanks, Ra?l -- Ra?l M?ndez Gir?ldez, Ph.D. Bioinformatics Unit Centre for Molecular Biology "Severo Ochoa" Universidad Aut?noma de Madrid C/ Nicol?s Cabrera, 1 Cantoblanco 28049, Madrid SPAIN Phone: +34 91 196 4633 From jaudall at gmail.com Wed Sep 3 11:38:08 2008 From: jaudall at gmail.com (Joshua Udall) Date: Wed, 3 Sep 2008 09:38:08 -0600 Subject: [Bioperl-l] parsing result of CAP3 (ACE file) In-Reply-To: <5.0.2.1.2.20080903150203.00c0db18@pop.univ-montp2.fr> References: <5.0.2.1.2.20080903150203.00c0db18@pop.univ-montp2.fr> Message-ID: <52cea20c0809030838k6fb1498btc15b8e76d98f9d70@mail.gmail.com> Laurent - I have modified modules that will do it as I recently ran into problems with the DB_FILE module in Assembly::IO. In addition, the current version of cap3 seems to put a contig length where a pad length is expected (based on the Ace format description). The modules I have will parse the ace file contig-by-contig rather than having the entire assembly slurped into memory (or a tied hash) all at once. You are welcome to them if you are interested and I'd like to get them in Bioperl at some point. Bascially, there are three files - a modified Contig.pm, ContigIO.pm, and a modified ace.pm (in a ContigIO directory). Josh On Wed, Sep 3, 2008 at 7:04 AM, Laurent Manchon wrote: > -- Hi, > > Is somebody have a piece of code to parse result of CAP3 assembly program > which > format is ACE ? > I need to retrieve the alignment from this file. > > thank you, > Laurent -- > > > > > +---------------------------------------------+ > Laurent Manchon > Email: lmanchon at univ-montp2.fr > +---------------------------------------------+ > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- Joshua Udall Assistant Professor 295 WIDB Plant and Wildlife Science Dept. Brigham Young University Provo, UT 84602 801-422-9307 Fax: 801-422-0008 USA From hartzell at alerce.com Wed Sep 3 19:19:45 2008 From: hartzell at alerce.com (George Hartzell) Date: Wed, 3 Sep 2008 16:19:45 -0700 Subject: [Bioperl-l] What's up with line 248 of Bio::Coordinate::Pair? Message-ID: <18623.7057.95449.99461@almost.alerce.com> Ok, confess. None of you know what's up with line 248 of Bio::Coordinate::Pair, do you? You probably don't even know what's *on* that line. Wonder how many will go look. Now that I either have your attention or have pissed you off (or both...), I think that creating a new Bio::Location::Split object in Bio::Coordinate::Pair::map() is a leftover or something, but I'm not quite sure enough to excise it and commit the change. Anyone up for it? g. From cjfields at illinois.edu Wed Sep 3 21:29:49 2008 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 3 Sep 2008 20:29:49 -0500 Subject: [Bioperl-l] What's up with line 248 of Bio::Coordinate::Pair? In-Reply-To: <18623.7057.95449.99461@almost.alerce.com> References: <18623.7057.95449.99461@almost.alerce.com> Message-ID: Well, it doesn't look like the SplitLocation is even used, so I think it is safe to remove. chris On Sep 3, 2008, at 6:19 PM, George Hartzell wrote: > > Ok, confess. None of you know what's up with line 248 of > Bio::Coordinate::Pair, do you? You probably don't even know what's > *on* that line. Wonder how many will go look. > > Now that I either have your attention or have pissed you off (or > both...), I think that creating a new Bio::Location::Split object in > Bio::Coordinate::Pair::map() is a leftover or something, but I'm not > quite sure enough to excise it and commit the change. > > Anyone up for it? > > g. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Marie-Claude Hofmann College of Veterinary Medicine University of Illinois Urbana-Champaign From jason at bioperl.org Thu Sep 4 00:40:38 2008 From: jason at bioperl.org (Jason Stajich) Date: Wed, 3 Sep 2008 21:40:38 -0700 Subject: [Bioperl-l] What's up with line 248 of Bio::Coordinate::Pair? In-Reply-To: References: <18623.7057.95449.99461@almost.alerce.com> Message-ID: <222DB5D8-BCCB-448E-BDEE-068A4A432660@bioperl.org> Agreed - I don't know if that was something was changed mid-stream, but removing it should cause no pain... -j On Sep 3, 2008, at 6:29 PM, Chris Fields wrote: > Well, it doesn't look like the SplitLocation is even used, so I > think it is safe to remove. > > chris > > On Sep 3, 2008, at 6:19 PM, George Hartzell wrote: > >> >> Ok, confess. None of you know what's up with line 248 of >> Bio::Coordinate::Pair, do you? You probably don't even know what's >> *on* that line. Wonder how many will go look. >> >> Now that I either have your attention or have pissed you off (or >> both...), I think that creating a new Bio::Location::Split object in >> Bio::Coordinate::Pair::map() is a leftover or something, but I'm not >> quite sure enough to excise it and commit the change. >> >> Anyone up for it? >> >> g. >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Marie-Claude Hofmann > College of Veterinary Medicine > University of Illinois Urbana-Champaign > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason at bioperl.org From heikki at sanbi.ac.za Thu Sep 4 02:17:31 2008 From: heikki at sanbi.ac.za (Heikki Lehvaslaiho) Date: Thu, 4 Sep 2008 08:17:31 +0200 Subject: [Bioperl-l] What's up with line 248 of Bio::Coordinate::Pair? In-Reply-To: <222DB5D8-BCCB-448E-BDEE-068A4A432660@bioperl.org> References: <18623.7057.95449.99461@almost.alerce.com> <222DB5D8-BCCB-448E-BDEE-068A4A432660@bioperl.org> Message-ID: <200809040817.31916.heikki@sanbi.ac.za> Quilty. So I removed the line. George, please do not try to piss us off. You can get all the attention you want from us. :) What are you planning to do to Bio::Coordinate classes? -Heikki On Thursday 04 September 2008 06:40:38 Jason Stajich wrote: > Agreed - I don't know if that was something was changed mid-stream, > but removing it should cause no pain... > -j > > On Sep 3, 2008, at 6:29 PM, Chris Fields wrote: > > Well, it doesn't look like the SplitLocation is even used, so I > > think it is safe to remove. > > > > chris > > > > On Sep 3, 2008, at 6:19 PM, George Hartzell wrote: > >> Ok, confess. None of you know what's up with line 248 of > >> Bio::Coordinate::Pair, do you? You probably don't even know what's > >> *on* that line. Wonder how many will go look. > >> > >> Now that I either have your attention or have pissed you off (or > >> both...), I think that creating a new Bio::Location::Split object in > >> Bio::Coordinate::Pair::map() is a leftover or something, but I'm not > >> quite sure enough to excise it and commit the change. > >> > >> Anyone up for it? > >> > >> g. > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > Christopher Fields > > Postdoctoral Researcher > > Lab of Dr. Marie-Claude Hofmann > > College of Veterinary Medicine > > University of Illinois Urbana-Champaign > > > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Jason Stajich > jason at bioperl.org > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- ______ _/ _/_____________________________________________________ _/ _/ _/ _/ _/ Heikki Lehvaslaiho heikki at_sanbi _ac _za _/_/_/_/_/ Senior Scientist skype: heikki_lehvaslaiho _/ _/ _/ SANBI, South African National Bioinformatics Institute _/ _/ _/ University of Western Cape, South Africa _/ Phone: +27 21 959 2096 FAX: +27 21 959 2512 ___ _/_/_/_/_/________________________________________________________ From hartzell at alerce.com Thu Sep 4 12:00:21 2008 From: hartzell at alerce.com (George Hartzell) Date: Thu, 4 Sep 2008 09:00:21 -0700 Subject: [Bioperl-l] What's up with line 248 of Bio::Coordinate::Pair? In-Reply-To: <200809040817.31916.heikki@sanbi.ac.za> References: <18623.7057.95449.99461@almost.alerce.com> <222DB5D8-BCCB-448E-BDEE-068A4A432660@bioperl.org> <200809040817.31916.heikki@sanbi.ac.za> Message-ID: <18624.1557.758033.258065@almost.alerce.com> Heikki Lehvaslaiho writes: > Quilty. So I removed the line. > > George, please do not try to piss us off. You can get all the attention you > want from us. :) Try not. Do... or do not. There is no try. > What are you planning to do to Bio::Coordinate classes? It's a project for a paying customer (yikes...). Pretty much exactly what GeneMapper does, though I'll probably end up dangling a couple more named coordinate spaces off of it. Nothing earth shattering. The classes look *great*. g. From hartzell at alerce.com Fri Sep 5 15:01:35 2008 From: hartzell at alerce.com (George Hartzell) Date: Fri, 05 Sep 2008 12:01:35 -0700 Subject: [Bioperl-l] Bio::Coordinate::Collection could DoWhatIMean better (w/ patch) Message-ID: Hi all, Bio::Coordinate::Collection surprised me a bit. At first I thought there was a bug, but it's clearly doing what it's supposed to. Now I'm wondering if what it's supposed to be doing makes sense in some context, or if what I expected would be better functionality. t/CoordinateMapper.t sets up the following scenario: # # Collection # # 1 5 6 10 # |---| |---| #-----|----------------------- # 1 5 9 15 19 # pair1 pair2 Then goes on to do the following query: # match more than two $pos = Bio::Location::Simple->new (-start => 5, -end => 19 ); ok $res = $transcribe->map($pos); is $res->each_gap, 2; is $res->each_match, 2; I was surprised to see that there were two gaps, one gene:10-19 and one from gene:5-14. Looking at the code, what's really happening is that, for the exon1 mapper there's match with gene:5-9 and a gap with gene:10-19 and for the exon2 mapper there's a gap with gene:5-14 and a match with gene:15-19. All four Result's just get tossed into the return value. The result my intuition wants is that there are two matches (gene:5-9 with exon1 and gene:15-19 with exon2) and a gap (gene:10-14). Yes, I guess that I could just synthesize these myself from the result in my app. It still seems that the current result is a bug though, since there's no way of knowing when you're walking through $res->each_Location that the first "gap" is with respect to the exon1 mapper and that the second "gap" is with respect to the exon2 mapper. The gaps are meaningless. I "fixed" it to work the way I think it should (two matches, one gap). I actually extended the test case a bit so that there's a multi-base gap, a match, another multibase-gap, another match, then a single base gap (just to make sure I got that right...). I had to touch up the test file a bit to account for my new test. The gaps that I return have a strand of 'undef', which seems to be The Right Thing. There's also a bit of funny business where I hang onto the seq_id of the gapped sequence. It assumes that the "in" sequence is the same for all of the mappers. This seems safe since otherwise the entire query is kind of weird.... There's a patch to todays svn head at: http://shrimp.alerce.com/bioperl/collection-diffs.txt The patch changes Build.PL to include a dependency on Set::IntSpan, CoordinateMapper.t to update the tests, and Bio/CoordinateMapper/Collection.pm for the new code. Who's code would this break. If anyone's relying on the current behaviour re: gaps, what's the situation in which you find it useful? Thanks! g. From ajmackey at gmail.com Fri Sep 5 16:54:56 2008 From: ajmackey at gmail.com (Aaron Mackey) Date: Fri, 5 Sep 2008 16:54:56 -0400 Subject: [Bioperl-l] Bio::Coordinate::Collection could DoWhatIMean better (w/ patch) In-Reply-To: References: Message-ID: <24c96eca0809051354h5b7218edtaa720140901d023f@mail.gmail.com> There are two uses for Collection: 1) all the "in" seq_id's are the same, and George's patch makes sense to me (i.e. agrees with my intuition) 2) all the "in" seq_id's are *not* the same (i.e. the collection is just a hash of indivual pairs), in which case my query would only match the subset of pairs having identical seq_id's to that specified by the query ... and then you're back to case #1 So overall, it looks like this was a bug, but I'd of course want to hear Heikki's opinion. Thanks for raising this, -Aaron On Fri, Sep 5, 2008 at 3:01 PM, George Hartzell wrote: > > Hi all, > > Bio::Coordinate::Collection surprised me a bit. At first I thought > there was a bug, but it's clearly doing what it's supposed to. Now > I'm wondering if what it's supposed to be doing makes sense in some > context, or if what I expected would be better functionality. > > t/CoordinateMapper.t sets up the following scenario: > > # > # Collection > # > # 1 5 6 10 > # |---| |---| > #-----|----------------------- > # 1 5 9 15 19 > # pair1 pair2 > > Then goes on to do the following query: > > # match more than two > $pos = Bio::Location::Simple->new (-start => 5, -end => 19 ); > ok $res = $transcribe->map($pos); > is $res->each_gap, 2; > is $res->each_match, 2; > > I was surprised to see that there were two gaps, one gene:10-19 and > one from gene:5-14. Looking at the code, what's really happening is > that, for the exon1 mapper there's match with gene:5-9 and a gap with > gene:10-19 and for the exon2 mapper there's a gap with gene:5-14 and a > match with gene:15-19. All four Result's just get tossed into the > return value. > > The result my intuition wants is that there are two matches (gene:5-9 > with exon1 and gene:15-19 with exon2) and a gap (gene:10-14). > > Yes, I guess that I could just synthesize these myself from the result > in my app. > > It still seems that the current result is a bug though, since there's > no way of knowing when you're walking through $res->each_Location that > the first "gap" is with respect to the exon1 mapper and that the > second "gap" is with respect to the exon2 mapper. The gaps are > meaningless. > > I "fixed" it to work the way I think it should (two matches, one > gap). I actually extended the test case a bit so that there's a > multi-base gap, a match, another multibase-gap, another match, then a > single base gap (just to make sure I got that right...). I had to > touch up the test file a bit to account for my new test. > > The gaps that I return have a strand of 'undef', which seems to be The > Right Thing. There's also a bit of funny business where I hang onto > the seq_id of the gapped sequence. It assumes that the "in" sequence > is the same for all of the mappers. This seems safe since otherwise > the entire query is kind of weird.... > > There's a patch to todays svn head at: > > http://shrimp.alerce.com/bioperl/collection-diffs.txt > > The patch changes Build.PL to include a dependency on Set::IntSpan, > CoordinateMapper.t to update the tests, and > Bio/CoordinateMapper/Collection.pm for the new code. > > Who's code would this break. > > If anyone's relying on the current behaviour re: gaps, what's the > situation in which you find it useful? > > Thanks! > > g. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From jimhu at tamu.edu Mon Sep 8 14:44:22 2008 From: jimhu at tamu.edu (Jim Hu) Date: Mon, 8 Sep 2008 13:44:22 -0500 Subject: [Bioperl-l] Circular genomes in Chado/BioPerl Message-ID: <87B87004-A586-44A0-BB10-D4AA3FE9E669@tamu.edu> In discussions with GMOD about Gbrowse, we've come up with a proposal for handling circular genomes and features that cross the origin in such genomes. This applies to lots of prokaryotic and viral genomes, and might be valuable for some ways of representing terminally redundant linear genomes. 1) Keep the requirement that start < end 2) allow end > parent feature length 3) parent feature gets an is_circular boolean 4) use modular arithmetic to calculate the real position of end on the parent feature. We'd like to do this in a way that will be consistent with Chado and BioPerl representation of features as much as possible (realizing that there is the usual interbase or not coordinate issue). What do people think? Lincoln is on board for modifying the GFF3 spec. Thanks! Jim Hu ===================================== Jim Hu Associate Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054 From ajmackey at gmail.com Mon Sep 8 15:57:50 2008 From: ajmackey at gmail.com (Aaron Mackey) Date: Mon, 8 Sep 2008 15:57:50 -0400 Subject: [Bioperl-l] [Gmod-schema] Circular genomes in Chado/BioPerl In-Reply-To: <87B87004-A586-44A0-BB10-D4AA3FE9E669@tamu.edu> References: <87B87004-A586-44A0-BB10-D4AA3FE9E669@tamu.edu> Message-ID: <24c96eca0809081257l16461b23uaefe8154ed038bea@mail.gmail.com> How can you handle features that may cross the origin more than once? The modulus, though simple, seems to be only half the solution. It also makes it difficult to place features in the genome "by eye" (having to do the modulus subtraction in my head), or in sorting/filtering operations. I have an alternative that I wondered if you considered: allow the start/end to have an additional "circular revolution" prefix: a typical range tuple like: 100 200 - is thus shorthand for: 0:100 0:200 - (i.e. both the 100 and 200 are in the same "revolution" around the genome) and is then distinguishable from an "around the genome + 100" feature of: 1:100 0:200 - Just an alternative to consider (if you haven't already). I'm not wedded to the syntax, but I wouldn't want to see new columns in GFF just for this. Essentially, what you want is some form of compound polar coordinates, it seems. -Aaron On Mon, Sep 8, 2008 at 2:44 PM, Jim Hu wrote: > In discussions with GMOD about Gbrowse, we've come up with a proposal for > handling circular genomes and features that cross the origin in such > genomes. This applies to lots of prokaryotic and viral genomes, and might > be valuable for some ways of representing terminally redundant linear > genomes. > 1) Keep the requirement that start < end > 2) allow end > parent feature length > 3) parent feature gets an is_circular boolean > 4) use modular arithmetic to calculate the real position of end on the > parent feature. > We'd like to do this in a way that will be consistent with Chado and BioPerl > representation of features as much as possible (realizing that there is the > usual interbase or not coordinate issue). What do people think? Lincoln is > on board for modifying the GFF3 spec. > Thanks! > Jim Hu > > ===================================== > > Jim Hu > > Associate Professor > > Dept. of Biochemistry and Biophysics > > 2128 TAMU > > Texas A&M Univ. > > College Station, TX 77843-2128 > > 979-862-4054 > > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > Build the coolest Linux based applications with Moblin SDK & win great > prizes > Grand prize is a trip for two to an Open Source event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ > Gmod-schema mailing list > Gmod-schema at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/gmod-schema > > From js5 at sanger.ac.uk Mon Sep 8 16:13:12 2008 From: js5 at sanger.ac.uk (James Smith) Date: Mon, 8 Sep 2008 21:13:12 +0100 (BST) Subject: [Bioperl-l] Circular genomes in Chado/BioPerl In-Reply-To: <87B87004-A586-44A0-BB10-D4AA3FE9E669@tamu.edu> References: <87B87004-A586-44A0-BB10-D4AA3FE9E669@tamu.edu> Message-ID: On Mon, 8 Sep 2008, Jim Hu wrote: > In discussions with GMOD about Gbrowse, we've come up with a proposal for > handling circular genomes and features that cross the origin in such genomes. > This applies to lots of prokaryotic and viral genomes, and might be valuable > for some ways of representing terminally redundant linear genomes. > > 1) Keep the requirement that start < end > 2) allow end > parent feature length > 3) parent feature gets an is_circular boolean > 4) use modular arithmetic to calculate the real position of end on the parent > feature. This is how we are considering handling features in Ensembl as well (the Ensembl genomes project will be setting up websites for bacterial and viral genomes) > > We'd like to do this in a way that will be consistent with Chado and BioPerl > representation of features as much as possible (realizing that there is the > usual interbase or not coordinate issue). What do people think? Lincoln is > on board for modifying the GFF3 spec. > > Thanks! > > Jim Hu > > ===================================== > Jim Hu > Associate Professor > Dept. of Biochemistry and Biophysics > 2128 TAMU > Texas A&M Univ. > College Station, TX 77843-2128 > 979-862-4054 > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From heikki at sanbi.ac.za Tue Sep 9 03:50:11 2008 From: heikki at sanbi.ac.za (Heikki Lehvaslaiho) Date: Tue, 9 Sep 2008 09:50:11 +0200 Subject: [Bioperl-l] Bio::Coordinate::Collection could DoWhatIMean better (w/ patch) In-Reply-To: References: Message-ID: <200809090950.11821.heikki@sanbi.ac.za> George, This is an error from my side. Great that you have a fix already. My only worry is the number of external dependencies in BioPerl. To limit these we have recoded number of functionalities into BioPerl-specific modules. Before you commit the fix, could you see if Bio::RangeI could be used or easily extended to be used instead of Set::IntSpan? Thanks, -Heikki On Friday 05 September 2008 21:01:35 George Hartzell wrote: > Hi all, > > Bio::Coordinate::Collection surprised me a bit. At first I thought > there was a bug, but it's clearly doing what it's supposed to. Now > I'm wondering if what it's supposed to be doing makes sense in some > context, or if what I expected would be better functionality. > > t/CoordinateMapper.t sets up the following scenario: > > # > # Collection > # > # 1 5 6 10 > # |---| |---| > #-----|----------------------- > # 1 5 9 15 19 > # pair1 pair2 > > Then goes on to do the following query: > > # match more than two > $pos = Bio::Location::Simple->new (-start => 5, -end => 19 ); > ok $res = $transcribe->map($pos); > is $res->each_gap, 2; > is $res->each_match, 2; > > I was surprised to see that there were two gaps, one gene:10-19 and > one from gene:5-14. Looking at the code, what's really happening is > that, for the exon1 mapper there's match with gene:5-9 and a gap with > gene:10-19 and for the exon2 mapper there's a gap with gene:5-14 and a > match with gene:15-19. All four Result's just get tossed into the > return value. > > The result my intuition wants is that there are two matches (gene:5-9 > with exon1 and gene:15-19 with exon2) and a gap (gene:10-14). > > Yes, I guess that I could just synthesize these myself from the result > in my app. > > It still seems that the current result is a bug though, since there's > no way of knowing when you're walking through $res->each_Location that > the first "gap" is with respect to the exon1 mapper and that the > second "gap" is with respect to the exon2 mapper. The gaps are > meaningless. > > I "fixed" it to work the way I think it should (two matches, one > gap). I actually extended the test case a bit so that there's a > multi-base gap, a match, another multibase-gap, another match, then a > single base gap (just to make sure I got that right...). I had to > touch up the test file a bit to account for my new test. > > The gaps that I return have a strand of 'undef', which seems to be The > Right Thing. There's also a bit of funny business where I hang onto > the seq_id of the gapped sequence. It assumes that the "in" sequence > is the same for all of the mappers. This seems safe since otherwise > the entire query is kind of weird.... > > There's a patch to todays svn head at: > > http://shrimp.alerce.com/bioperl/collection-diffs.txt > > The patch changes Build.PL to include a dependency on Set::IntSpan, > CoordinateMapper.t to update the tests, and > Bio/CoordinateMapper/Collection.pm for the new code. > > Who's code would this break. > > If anyone's relying on the current behaviour re: gaps, what's the > situation in which you find it useful? > > Thanks! > > g. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- ______ _/ _/_____________________________________________________ _/ _/ _/ _/ _/ Heikki Lehvaslaiho heikki at_sanbi _ac _za _/_/_/_/_/ Senior Scientist skype: heikki_lehvaslaiho _/ _/ _/ SANBI, South African National Bioinformatics Institute _/ _/ _/ University of Western Cape, South Africa _/ Phone: +27 21 959 2096 FAX: +27 21 959 2512 ___ _/_/_/_/_/________________________________________________________ From heikki at sanbi.ac.za Tue Sep 9 03:50:11 2008 From: heikki at sanbi.ac.za (Heikki Lehvaslaiho) Date: Tue, 9 Sep 2008 09:50:11 +0200 Subject: [Bioperl-l] Bio::Coordinate::Collection could DoWhatIMean better (w/ patch) In-Reply-To: References: Message-ID: <200809090950.11821.heikki@sanbi.ac.za> George, This is an error from my side. Great that you have a fix already. My only worry is the number of external dependencies in BioPerl. To limit these we have recoded number of functionalities into BioPerl-specific modules. Before you commit the fix, could you see if Bio::RangeI could be used or easily extended to be used instead of Set::IntSpan? Thanks, -Heikki On Friday 05 September 2008 21:01:35 George Hartzell wrote: > Hi all, > > Bio::Coordinate::Collection surprised me a bit. At first I thought > there was a bug, but it's clearly doing what it's supposed to. Now > I'm wondering if what it's supposed to be doing makes sense in some > context, or if what I expected would be better functionality. > > t/CoordinateMapper.t sets up the following scenario: > > # > # Collection > # > # 1 5 6 10 > # |---| |---| > #-----|----------------------- > # 1 5 9 15 19 > # pair1 pair2 > > Then goes on to do the following query: > > # match more than two > $pos = Bio::Location::Simple->new (-start => 5, -end => 19 ); > ok $res = $transcribe->map($pos); > is $res->each_gap, 2; > is $res->each_match, 2; > > I was surprised to see that there were two gaps, one gene:10-19 and > one from gene:5-14. Looking at the code, what's really happening is > that, for the exon1 mapper there's match with gene:5-9 and a gap with > gene:10-19 and for the exon2 mapper there's a gap with gene:5-14 and a > match with gene:15-19. All four Result's just get tossed into the > return value. > > The result my intuition wants is that there are two matches (gene:5-9 > with exon1 and gene:15-19 with exon2) and a gap (gene:10-14). > > Yes, I guess that I could just synthesize these myself from the result > in my app. > > It still seems that the current result is a bug though, since there's > no way of knowing when you're walking through $res->each_Location that > the first "gap" is with respect to the exon1 mapper and that the > second "gap" is with respect to the exon2 mapper. The gaps are > meaningless. > > I "fixed" it to work the way I think it should (two matches, one > gap). I actually extended the test case a bit so that there's a > multi-base gap, a match, another multibase-gap, another match, then a > single base gap (just to make sure I got that right...). I had to > touch up the test file a bit to account for my new test. > > The gaps that I return have a strand of 'undef', which seems to be The > Right Thing. There's also a bit of funny business where I hang onto > the seq_id of the gapped sequence. It assumes that the "in" sequence > is the same for all of the mappers. This seems safe since otherwise > the entire query is kind of weird.... > > There's a patch to todays svn head at: > > http://shrimp.alerce.com/bioperl/collection-diffs.txt > > The patch changes Build.PL to include a dependency on Set::IntSpan, > CoordinateMapper.t to update the tests, and > Bio/CoordinateMapper/Collection.pm for the new code. > > Who's code would this break. > > If anyone's relying on the current behaviour re: gaps, what's the > situation in which you find it useful? > > Thanks! > > g. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- ______ _/ _/_____________________________________________________ _/ _/ _/ _/ _/ Heikki Lehvaslaiho heikki at_sanbi _ac _za _/_/_/_/_/ Senior Scientist skype: heikki_lehvaslaiho _/ _/ _/ SANBI, South African National Bioinformatics Institute _/ _/ _/ University of Western Cape, South Africa _/ Phone: +27 21 959 2096 FAX: +27 21 959 2512 ___ _/_/_/_/_/________________________________________________________ From Frigerio at pierroton.inra.fr Tue Sep 9 04:45:19 2008 From: Frigerio at pierroton.inra.fr (Jean-Marc FRIGERIO) Date: Tue, 9 Sep 2008 10:45:19 +0200 Subject: [Bioperl-l] parsing result of CAP3 (ACE file) Message-ID: <200809091045.19249.Frigerio@pierroton.inra.fr> > > -- Hi, > > > > Is somebody have a piece of code to parse result of CAP3 assembly program > > which > > format is ACE ? > > I need to retrieve the alignment from this file. > > > > thank you, > > Laurent -- > > > > > > > > > > +---------------------------------------------+ > > Laurent Manchon > > Email: lmanchon at univ-montp2.fr > > +---------------------------------------------+ > > Laurent - > > I have modified modules that will do it as I recently ran into problems > with the DB_FILE module in Assembly::IO. In addition, the current version > of cap3 seems to put a contig length where a pad length is expected (based > on the Ace format description). The modules I have will parse the ace file > contig-by-contig rather than having the entire assembly slurped into memory > (or a tied hash) all at once. You are welcome to them if you are > interested and I'd like to get them in Bioperl at some point. Bascially, > there are three files - a modified Contig.pm, ContigIO.pm, and a modified > ace.pm (in a ContigIO directory). > > Josh Hi, Here are a 2 pieces of code running on an ace file (output of phrap is that the same as cap3 ?) ----------------------------- 1 ----------------------------------------- my $assembly = Bio::Assembly::IO->new('-file' => $file, '-format' => 'ace')->next_assembly; for my $contig ($assembly->all_contigs) { my $ct_seq = $contig->get_consensus_sequence; (my $ref_seq = uc $ct_seq->seq) =~ s/-//g; my $debut = $pos - 100 > 0 ? $pos - 100 : 1; my $fin = $pos + 100 <= length $ref_seq ? $pos + 100 : length $ref_seq; my $coll = $contig->get_features_collection; my @coll = $coll->features_in_range('-start' => $debut, '-end' => $fin); for my $tag (@coll) { next unless $tag->primary_tag eq 'comment'; #print "TAG: ",$tag->start,"\n"; my $tag_pos = $contig->change_coord('gapped consensus','ungapped consensus',$tag->start); #print "TAG POS: $tag_pos\n"; next if $pos == $tag_pos; substr($ref_seq,$tag_pos-1,1,'N'); } } ------------------------------------ 2 ------------------- my $assembly = Bio::Assembly::IO->new( '-file' => $file, '-format' => 'ace')->next_assembly; for my $contig ($assembly->all_contigs) { for my $seq ($contig->each_seq) { my $id = $seq->id; my $s = $seq->seq; my ($start,$end) = ($contig->change_coord("aligned $id","ungapped consensus", $seq->start), $contig->change_coord("aligned $id","ungapped consensus",$seq->end)); my $dir = $seq->strand < 0 ? 'R' : 'F'; ...... } -- Jean-Marc From zheboyang at gmail.com Tue Sep 9 07:05:15 2008 From: zheboyang at gmail.com (boyang zhe) Date: Tue, 9 Sep 2008 19:05:15 +0800 Subject: [Bioperl-l] help:HMM parsing error Message-ID: <127e75f60809090405h644e51eftcab073e8bf179720@mail.gmail.com> I write a script to parse the HMMER report ,it is as follows: #!/usr/bin/perl -w #TODO: Parse the HMMER report use strict; use Bio::SearchIO; my $directory="./HMM/"; opendir(HMMDIR, $directory), or die "Can't open the directory!"; my @filelist=readdir(HMMDIR); foreach my $filename(@filelist) { if ($filename !~/^\./) { my $infile="$directory"."$filename"; my $outfile="$infile"."HMMParse"; my $in = new Bio::SearchIO(-format => 'hmmer',-file =>"$infile"); - Ignored: while (my $result= $in->next_result ) { # get a Bio::Search::Result::HMMERResult object # get hits numbers my $hitnumber=$result->num_hits; if ($hitnumber != 0) { open(OUT, ">$outfile"), or die "can't open the output file!!!!"; while (my $hits= $result->next_hit ) { my $value=$hits->significance; if ($value <=0.01) { print OUT $hits->name,"\t",$hits->description,"\t",$hits->significance,"\n"; } } close OUT; } } } } closedir(HMMDIR); ############################################################## When it run, you will see that: -------------------- WARNING --------------------- MSG: unrecognized line: +E +L i T eek+ e+ ++ +l++H Y+ I+ + --------------------------------------------------- why? I hope to get your help, hanks very much! - Done. From bix at sendu.me.uk Tue Sep 9 07:46:39 2008 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 09 Sep 2008 12:46:39 +0100 Subject: [Bioperl-l] help:HMM parsing error In-Reply-To: <127e75f60809090405h644e51eftcab073e8bf179720@mail.gmail.com> References: <127e75f60809090405h644e51eftcab073e8bf179720@mail.gmail.com> Message-ID: <48C6621F.8040304@sendu.me.uk> boyang zhe wrote: > I write a script to parse the HMMER report ,it is as follows: [...] > my $in = new Bio::SearchIO(-format => 'hmmer',-file =>"$infile"); [...] > -------------------- WARNING --------------------- > MSG: unrecognized line: +E +L i T eek+ e+ ++ +l++H > Y+ I+ + > > --------------------------------------------------- > > why? I hope to get your help, hanks very much! I didn't check your code, but the easiest thing to try would be to use -format => 'hmmer_pull' to use an alternate parser that may be able to recognise that line. You might need to install the latest Bioperl from SVN (or at least 1.5.2) to get access to the hmmer_pull parser. From bosborne11 at verizon.net Tue Sep 9 10:50:38 2008 From: bosborne11 at verizon.net (Brian Osborne) Date: Tue, 9 Sep 2008 10:50:38 -0400 Subject: [Bioperl-l] SeqHound In-Reply-To: <1220452426.31595.92.camel@pepa.cbm.uam.es> References: <1220452426.31595.92.camel@pepa.cbm.uam.es> Message-ID: <55655D67-352A-4E0D-B402-7FC30628C1B1@verizon.net> Raul, After spending a few minutes at bond.unleashedinformatics.com I have to admit that it's not clear how one accesses their free version of BOND. There are no examples that I can see in their packages. If you are interested in looking at protein-protein networks in the Bioperl context you can also check out the bioperl-network package: http://www.bioperl.org/wiki/Network_package If you don't care what language you're using then you should consider Cytoscape, it's probably the package with the most capability. Brian O. On Sep 3, 2008, at 10:33 AM, Raul Mendez Giraldez wrote: > Hi Chris, > > I'm trying to set up and run bioperl Seqhound donwloaded from: > > http://bond.unleashedinformatics.com/downloads/api//seqhound-bioperl-4.0.tar.gz > > and I always get connection error messages. Do you know which > version of > SeqHound should I use and how can I configure to make it work? I've > tried several possibilities for server1 at .shoundremrc as > > [remote] > server1 = bond.unleashedinformatics.com > CGI = /cgi-bin/seqrem > port=8080 > > Also, I would like to get all the possible protein-protein interaction > for a set of protein sequences. Would this be possible using SeqHound? > > Thanks, > Ra?l > > > > > -- > Ra?l M?ndez Gir?ldez, Ph.D. > Bioinformatics Unit > Centre for Molecular Biology "Severo Ochoa" > Universidad Aut?noma de Madrid > C/ Nicol?s Cabrera, 1 > Cantoblanco 28049, Madrid > SPAIN > > Phone: +34 91 196 4633 > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From jimhu at tamu.edu Tue Sep 9 12:05:59 2008 From: jimhu at tamu.edu (Jim Hu) Date: Tue, 9 Sep 2008 11:05:59 -0500 Subject: [Bioperl-l] [Gmod-schema] Circular genomes in Chado/BioPerl In-Reply-To: <24c96eca0809081257l16461b23uaefe8154ed038bea@mail.gmail.com> References: <87B87004-A586-44A0-BB10-D4AA3FE9E669@tamu.edu> <24c96eca0809081257l16461b23uaefe8154ed038bea@mail.gmail.com> Message-ID: Hi Aaron, I was thinking this would be handled by making the end=parent feature length x 2 + end coord. end/parent length = number of times crosses origin. Jim On Sep 8, 2008, at 2:57 PM, Aaron Mackey wrote: > How can you handle features that may cross the origin more than once? > The modulus, though simple, seems to be only half the solution. It > also makes it difficult to place features in the genome "by eye" > (having to do the modulus subtraction in my head), or in > sorting/filtering operations. > > I have an alternative that I wondered if you considered: allow the > start/end to have an additional "circular revolution" prefix: > > a typical range tuple like: 100 200 - > is thus shorthand for: 0:100 0:200 - > (i.e. both the 100 and 200 are in the same "revolution" around the > genome) > > and is then distinguishable from an "around the genome + 100" > feature of: > 1:100 0:200 - > > Just an alternative to consider (if you haven't already). I'm not > wedded to the syntax, but I wouldn't want to see new columns in GFF > just for this. Essentially, what you want is some form of compound > polar coordinates, it seems. > > -Aaron > > On Mon, Sep 8, 2008 at 2:44 PM, Jim Hu wrote: >> In discussions with GMOD about Gbrowse, we've come up with a >> proposal for >> handling circular genomes and features that cross the origin in such >> genomes. This applies to lots of prokaryotic and viral genomes, >> and might >> be valuable for some ways of representing terminally redundant linear >> genomes. >> 1) Keep the requirement that start < end >> 2) allow end > parent feature length >> 3) parent feature gets an is_circular boolean >> 4) use modular arithmetic to calculate the real position of end on >> the >> parent feature. >> We'd like to do this in a way that will be consistent with Chado >> and BioPerl >> representation of features as much as possible (realizing that >> there is the >> usual interbase or not coordinate issue). What do people think? >> Lincoln is >> on board for modifying the GFF3 spec. >> Thanks! >> Jim Hu >> >> ===================================== >> >> Jim Hu >> >> Associate Professor >> >> Dept. of Biochemistry and Biophysics >> >> 2128 TAMU >> >> Texas A&M Univ. >> >> College Station, TX 77843-2128 >> >> 979-862-4054 >> >> >> ------------------------------------------------------------------------- >> This SF.Net email is sponsored by the Moblin Your Move Developer's >> challenge >> Build the coolest Linux based applications with Moblin SDK & win >> great >> prizes >> Grand prize is a trip for two to an Open Source event anywhere in >> the world >> http://moblin-contest.org/redirect.php?banner_id=100&url=/ >> _______________________________________________ >> Gmod-schema mailing list >> Gmod-schema at lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/gmod-schema >> >> ===================================== Jim Hu Associate Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054 From jason at bioperl.org Tue Sep 9 12:07:45 2008 From: jason at bioperl.org (Jason Stajich) Date: Tue, 9 Sep 2008 09:07:45 -0700 Subject: [Bioperl-l] help:HMM parsing error In-Reply-To: <48C6621F.8040304@sendu.me.uk> References: <127e75f60809090405h644e51eftcab073e8bf179720@mail.gmail.com> <48C6621F.8040304@sendu.me.uk> Message-ID: Although it would be good to fix the parser as well -- best solution is to submit that report as a bug to bugzilla http://bugzilla.open-bio.org/ -jason On Sep 9, 2008, at 4:46 AM, Sendu Bala wrote: > boyang zhe wrote: >> I write a script to parse the HMMER report ,it is as follows: > [...] >> my $in = new Bio::SearchIO(-format => 'hmmer',-file =>"$infile"); > [...] >> -------------------- WARNING --------------------- >> MSG: unrecognized line: +E +L i T eek+ e+ + >> + +l++H >> Y+ I+ + >> --------------------------------------------------- >> why? I hope to get your help, hanks very much! > > I didn't check your code, but the easiest thing to try would be to > use -format => 'hmmer_pull' to use an alternate parser that may be > able to recognise that line. You might need to install the latest > Bioperl from SVN (or at least 1.5.2) to get access to the hmmer_pull > parser. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason at bioperl.org From cain.cshl at gmail.com Tue Sep 9 13:33:12 2008 From: cain.cshl at gmail.com (Scott Cain) Date: Tue, 9 Sep 2008 13:33:12 -0400 Subject: [Bioperl-l] [Gmod-schema] Circular genomes in Chado/BioPerl In-Reply-To: References: <87B87004-A586-44A0-BB10-D4AA3FE9E669@tamu.edu> <24c96eca0809081257l16461b23uaefe8154ed038bea@mail.gmail.com> Message-ID: <536f21b00809091033v1e412f4ft54d8e139c347a20a@mail.gmail.com> Hi Jim and All, While I agree with Aaron's point that it is not easy to place features by visual inspection, this seems like a fairly minor point. The vast majority of GFF3 manipulation will be done in software, so as long as the API handles everything correctly, life is good. If we discount that objection, there doesn't seem to be much advantage of using Aaron's suggested method over Jim's. (As a side note--I have the same complaint about anything in XML--it is awful for a human to read. I still live with XML when I have to though :-) Additionally, the fact that Ensembl is using the same method as what Jim describes is a fairly powerful argument for doing the same. Hopefully there can be some code reuse. Scott On Tue, Sep 9, 2008 at 12:05 PM, Jim Hu wrote: > Hi Aaron, > I was thinking this would be handled by making the end=parent feature length > x 2 + end coord. end/parent length = number of times crosses origin. > Jim > On Sep 8, 2008, at 2:57 PM, Aaron Mackey wrote: > > How can you handle features that may cross the origin more than once? > The modulus, though simple, seems to be only half the solution. It > also makes it difficult to place features in the genome "by eye" > (having to do the modulus subtraction in my head), or in > sorting/filtering operations. > > I have an alternative that I wondered if you considered: allow the > start/end to have an additional "circular revolution" prefix: > > a typical range tuple like: 100 200 - > is thus shorthand for: 0:100 0:200 - > (i.e. both the 100 and 200 are in the same "revolution" around the genome) > > and is then distinguishable from an "around the genome + 100" feature of: > 1:100 0:200 - > > Just an alternative to consider (if you haven't already). I'm not > wedded to the syntax, but I wouldn't want to see new columns in GFF > just for this. Essentially, what you want is some form of compound > polar coordinates, it seems. > > -Aaron > > On Mon, Sep 8, 2008 at 2:44 PM, Jim Hu wrote: > > In discussions with GMOD about Gbrowse, we've come up with a proposal for > > handling circular genomes and features that cross the origin in such > > genomes. This applies to lots of prokaryotic and viral genomes, and might > > be valuable for some ways of representing terminally redundant linear > > genomes. > > 1) Keep the requirement that start < end > > 2) allow end > parent feature length > > 3) parent feature gets an is_circular boolean > > 4) use modular arithmetic to calculate the real position of end on the > > parent feature. > > We'd like to do this in a way that will be consistent with Chado and BioPerl > > representation of features as much as possible (realizing that there is the > > usual interbase or not coordinate issue). What do people think? Lincoln is > > on board for modifying the GFF3 spec. > > Thanks! > > Jim Hu > > ===================================== > > Jim Hu > > Associate Professor > > Dept. of Biochemistry and Biophysics > > 2128 TAMU > > Texas A&M Univ. > > College Station, TX 77843-2128 > > 979-862-4054 > > > ------------------------------------------------------------------------- > > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > > Build the coolest Linux based applications with Moblin SDK & win great > > prizes > > Grand prize is a trip for two to an Open Source event anywhere in the world > > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > > _______________________________________________ > > Gmod-schema mailing list > > Gmod-schema at lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/gmod-schema > > > > ===================================== > > Jim Hu > > Associate Professor > > Dept. of Biochemistry and Biophysics > > 2128 TAMU > > Texas A&M Univ. > > College Station, TX 77843-2128 > > 979-862-4054 > > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > Build the coolest Linux based applications with Moblin SDK & win great > prizes > Grand prize is a trip for two to an Open Source event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ > Gmod-schema mailing list > Gmod-schema at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/gmod-schema > > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. cain.cshl at gmail.com GMOD Coordinator (http://gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory From lincoln.stein at gmail.com Tue Sep 9 13:52:36 2008 From: lincoln.stein at gmail.com (Lincoln Stein) Date: Tue, 9 Sep 2008 13:52:36 -0400 Subject: [Bioperl-l] [Gmod-schema] Circular genomes in Chado/BioPerl In-Reply-To: <24c96eca0809081257l16461b23uaefe8154ed038bea@mail.gmail.com> References: <87B87004-A586-44A0-BB10-D4AA3FE9E669@tamu.edu> <24c96eca0809081257l16461b23uaefe8154ed038bea@mail.gmail.com> Message-ID: <6dce9a0b0809091052g1c398a84tfe8f89d1bf9132c8@mail.gmail.com> It seems to me that the proposed modulus syntax handles multiple revolutions. Consider a 100 bp genome (to make it simple) and a feature that starts at 50, goes around twice, and ends at position 60: start = 50 end = 260 length = end - start + 1 revolutions = int (length/genome) stop position = length % genome + 1 Lincoln On Mon, Sep 8, 2008 at 3:57 PM, Aaron Mackey wrote: > How can you handle features that may cross the origin more than once? > The modulus, though simple, seems to be only half the solution. It > also makes it difficult to place features in the genome "by eye" > (having to do the modulus subtraction in my head), or in > sorting/filtering operations. > > I have an alternative that I wondered if you considered: allow the > start/end to have an additional "circular revolution" prefix: > > a typical range tuple like: 100 200 - > is thus shorthand for: 0:100 0:200 - > (i.e. both the 100 and 200 are in the same "revolution" around the genome) > > and is then distinguishable from an "around the genome + 100" feature of: > 1:100 0:200 - > > Just an alternative to consider (if you haven't already). I'm not > wedded to the syntax, but I wouldn't want to see new columns in GFF > just for this. Essentially, what you want is some form of compound > polar coordinates, it seems. > > -Aaron > > On Mon, Sep 8, 2008 at 2:44 PM, Jim Hu wrote: > > In discussions with GMOD about Gbrowse, we've come up with a proposal for > > handling circular genomes and features that cross the origin in such > > genomes. This applies to lots of prokaryotic and viral genomes, and > might > > be valuable for some ways of representing terminally redundant linear > > genomes. > > 1) Keep the requirement that start < end > > 2) allow end > parent feature length > > 3) parent feature gets an is_circular boolean > > 4) use modular arithmetic to calculate the real position of end on the > > parent feature. > > We'd like to do this in a way that will be consistent with Chado and > BioPerl > > representation of features as much as possible (realizing that there is > the > > usual interbase or not coordinate issue). What do people think? Lincoln > is > > on board for modifying the GFF3 spec. > > Thanks! > > Jim Hu > > > > ===================================== > > > > Jim Hu > > > > Associate Professor > > > > Dept. of Biochemistry and Biophysics > > > > 2128 TAMU > > > > Texas A&M Univ. > > > > College Station, TX 77843-2128 > > > > 979-862-4054 > > > > > > ------------------------------------------------------------------------- > > This SF.Net email is sponsored by the Moblin Your Move Developer's > challenge > > Build the coolest Linux based applications with Moblin SDK & win great > > prizes > > Grand prize is a trip for two to an Open Source event anywhere in the > world > > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > > _______________________________________________ > > Gmod-schema mailing list > > Gmod-schema at lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/gmod-schema > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- Lincoln D. Stein Ontario Institute for Cancer Research 101 College St., Suite 800 Toronto, ON, Canada M5G0A3 416 673-8514 Assistant: Stacey Quinn Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 USA (516) 367-8380 Assistant: Sandra Michelsen From cjfields at illinois.edu Tue Sep 9 14:24:49 2008 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 9 Sep 2008 13:24:49 -0500 Subject: [Bioperl-l] [Gmod-schema] Circular genomes in Chado/BioPerl In-Reply-To: References: <87B87004-A586-44A0-BB10-D4AA3FE9E669@tamu.edu> <24c96eca0809081257l16461b23uaefe8154ed038bea@mail.gmail.com> Message-ID: Is there any particular reason we don't treat this similarly to the way BioPerl does, which is to simply treat the origin-overlapping feature as a split location? GenBank treats this similarly. For an faux example, the bug I just fixed for bugzilla has one: http://bugzilla.open-bio.org/show_bug.cgi?id=2579 An actual GenBank case is the Sulfolobus solfataricus genome (NC_002754), and I'm sure Jim could come up with more. The only caveat is whether we should represent this As for multiple revolutions, I'm not sure the hand-wringing about specifics is worth it unless we have explicit workable examples to test against (preferably examples which would potentially pop up), but Lincoln's proposal sounds fine. chris On Sep 9, 2008, at 11:05 AM, Jim Hu wrote: > Hi Aaron, > > I was thinking this would be handled by making the end=parent > feature length x 2 + end coord. end/parent length = number of times > crosses origin. > > Jim > > On Sep 8, 2008, at 2:57 PM, Aaron Mackey wrote: > >> How can you handle features that may cross the origin more than once? >> The modulus, though simple, seems to be only half the solution. It >> also makes it difficult to place features in the genome "by eye" >> (having to do the modulus subtraction in my head), or in >> sorting/filtering operations. >> >> I have an alternative that I wondered if you considered: allow the >> start/end to have an additional "circular revolution" prefix: >> >> a typical range tuple like: 100 200 - >> is thus shorthand for: 0:100 0:200 - >> (i.e. both the 100 and 200 are in the same "revolution" around the >> genome) >> >> and is then distinguishable from an "around the genome + 100" >> feature of: >> 1:100 0:200 - >> >> Just an alternative to consider (if you haven't already). I'm not >> wedded to the syntax, but I wouldn't want to see new columns in GFF >> just for this. Essentially, what you want is some form of compound >> polar coordinates, it seems. >> >> -Aaron >> >> On Mon, Sep 8, 2008 at 2:44 PM, Jim Hu wrote: >>> In discussions with GMOD about Gbrowse, we've come up with a >>> proposal for >>> handling circular genomes and features that cross the origin in such >>> genomes. This applies to lots of prokaryotic and viral genomes, >>> and might >>> be valuable for some ways of representing terminally redundant >>> linear >>> genomes. >>> 1) Keep the requirement that start < end >>> 2) allow end > parent feature length >>> 3) parent feature gets an is_circular boolean >>> 4) use modular arithmetic to calculate the real position of end on >>> the >>> parent feature. >>> We'd like to do this in a way that will be consistent with Chado >>> and BioPerl >>> representation of features as much as possible (realizing that >>> there is the >>> usual interbase or not coordinate issue). What do people think? >>> Lincoln is >>> on board for modifying the GFF3 spec. >>> Thanks! >>> Jim Hu >>> >>> ===================================== >>> >>> Jim Hu >>> >>> Associate Professor >>> >>> Dept. of Biochemistry and Biophysics >>> >>> 2128 TAMU >>> >>> Texas A&M Univ. >>> >>> College Station, TX 77843-2128 >>> >>> 979-862-4054 >>> >>> >>> ------------------------------------------------------------------------- >>> This SF.Net email is sponsored by the Moblin Your Move Developer's >>> challenge >>> Build the coolest Linux based applications with Moblin SDK & win >>> great >>> prizes >>> Grand prize is a trip for two to an Open Source event anywhere in >>> the world >>> http://moblin-contest.org/redirect.php?banner_id=100&url=/ >>> _______________________________________________ >>> Gmod-schema mailing list >>> Gmod-schema at lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/gmod-schema >>> >>> > > ===================================== > Jim Hu > Associate Professor > Dept. of Biochemistry and Biophysics > 2128 TAMU > Texas A&M Univ. > College Station, TX 77843-2128 > 979-862-4054 > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Marie-Claude Hofmann College of Veterinary Medicine University of Illinois Urbana-Champaign From ajmackey at gmail.com Tue Sep 9 14:48:12 2008 From: ajmackey at gmail.com (Aaron Mackey) Date: Tue, 9 Sep 2008 14:48:12 -0400 Subject: [Bioperl-l] [Gmod-schema] Circular genomes in Chado/BioPerl In-Reply-To: <6dce9a0b0809091052g1c398a84tfe8f89d1bf9132c8@mail.gmail.com> References: <87B87004-A586-44A0-BB10-D4AA3FE9E669@tamu.edu> <24c96eca0809081257l16461b23uaefe8154ed038bea@mail.gmail.com> <6dce9a0b0809091052g1c398a84tfe8f89d1bf9132c8@mail.gmail.com> Message-ID: <24c96eca0809091148l738604a7q13fba54ac05de01c@mail.gmail.com> Right, the modulus calculation continues to work, but for instance, what'll happen when I now ask Gbrowse (or Ensembl) to show me positions 50..260? Will it show me 50 .. 60, 1:100, or "unroll" the genome twice from 50..260 (that'd be a pretty cute trick, by the way!) You're (re)using simple arithmetic to compress a compound coordinate into a single-valued coordinate (which I realize can be trivially packed and unpacked by software), but I worry about the downstream consequences of software having to always remember that the coordinates given may have to be unpacked or not, and not being able to immediately identify whether "260" is a real or compound coordinate. To say it another way, I'm happy (that is, don't care much) whether Chado or any other underlying data storage uses such compound coordinates, because only Chado-reliant tools will need to care; but I do worry about GFF3 as a (relatively) simple exchange format having that kind of silent bug-causing complexity. I'd much rather see GFF be syntactically explicit, and not quite so cleverly implicit. Just one GFF user's two cents, thanks for listening, -Aaron On Tue, Sep 9, 2008 at 1:52 PM, Lincoln Stein wrote: > It seems to me that the proposed modulus syntax handles multiple > revolutions. Consider a 100 bp genome (to make it simple) and a feature that > starts at 50, goes around twice, and ends at position 60: > > start = 50 > end = 260 > > length = end - start + 1 > revolutions = int (length/genome) > stop position = length % genome + 1 > > Lincoln > > On Mon, Sep 8, 2008 at 3:57 PM, Aaron Mackey wrote: >> >> How can you handle features that may cross the origin more than once? >> The modulus, though simple, seems to be only half the solution. It >> also makes it difficult to place features in the genome "by eye" >> (having to do the modulus subtraction in my head), or in >> sorting/filtering operations. >> >> I have an alternative that I wondered if you considered: allow the >> start/end to have an additional "circular revolution" prefix: >> >> a typical range tuple like: 100 200 - >> is thus shorthand for: 0:100 0:200 - >> (i.e. both the 100 and 200 are in the same "revolution" around the genome) >> >> and is then distinguishable from an "around the genome + 100" feature of: >> 1:100 0:200 - >> >> Just an alternative to consider (if you haven't already). I'm not >> wedded to the syntax, but I wouldn't want to see new columns in GFF >> just for this. Essentially, what you want is some form of compound >> polar coordinates, it seems. >> >> -Aaron >> >> On Mon, Sep 8, 2008 at 2:44 PM, Jim Hu wrote: >> > In discussions with GMOD about Gbrowse, we've come up with a proposal >> > for >> > handling circular genomes and features that cross the origin in such >> > genomes. This applies to lots of prokaryotic and viral genomes, and >> > might >> > be valuable for some ways of representing terminally redundant linear >> > genomes. >> > 1) Keep the requirement that start < end >> > 2) allow end > parent feature length >> > 3) parent feature gets an is_circular boolean >> > 4) use modular arithmetic to calculate the real position of end on the >> > parent feature. >> > We'd like to do this in a way that will be consistent with Chado and >> > BioPerl >> > representation of features as much as possible (realizing that there is >> > the >> > usual interbase or not coordinate issue). What do people think? >> > Lincoln is >> > on board for modifying the GFF3 spec. >> > Thanks! >> > Jim Hu >> > >> > ===================================== >> > >> > Jim Hu >> > >> > Associate Professor >> > >> > Dept. of Biochemistry and Biophysics >> > >> > 2128 TAMU >> > >> > Texas A&M Univ. >> > >> > College Station, TX 77843-2128 >> > >> > 979-862-4054 >> > >> > >> > >> > ------------------------------------------------------------------------- >> > This SF.Net email is sponsored by the Moblin Your Move Developer's >> > challenge >> > Build the coolest Linux based applications with Moblin SDK & win great >> > prizes >> > Grand prize is a trip for two to an Open Source event anywhere in the >> > world >> > http://moblin-contest.org/redirect.php?banner_id=100&url=/ >> > _______________________________________________ >> > Gmod-schema mailing list >> > Gmod-schema at lists.sourceforge.net >> > https://lists.sourceforge.net/lists/listinfo/gmod-schema >> > >> > >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > Lincoln D. Stein > > Ontario Institute for Cancer Research > 101 College St., Suite 800 > Toronto, ON, Canada M5G0A3 > 416 673-8514 > Assistant: Stacey Quinn > > Cold Spring Harbor Laboratory > 1 Bungtown Road > Cold Spring Harbor, NY 11724 USA > (516) 367-8380 > Assistant: Sandra Michelsen > From cjfields at illinois.edu Tue Sep 9 14:49:13 2008 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 9 Sep 2008 13:49:13 -0500 Subject: [Bioperl-l] [Gmod-schema] Circular genomes in Chado/BioPerl In-Reply-To: References: <87B87004-A586-44A0-BB10-D4AA3FE9E669@tamu.edu> <24c96eca0809081257l16461b23uaefe8154ed038bea@mail.gmail.com> Message-ID: <8C6E5EBC-25B1-41E3-BB89-FD3C228A6B70@illinois.edu> Sent just a bit too early! On Sep 9, 2008, at 1:24 PM, Chris Fields wrote: > Is there any particular reason we don't treat this similarly to the > way BioPerl does, which is to simply treat the origin-overlapping > feature as a split location? GenBank treats this similarly. For an > faux example, the bug I just fixed for bugzilla has one: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2579 > > An actual GenBank case is the Sulfolobus solfataricus genome > (NC_002754), and I'm sure Jim could come up with more. The only > caveat is whether we should represent this ... as a 'special case' for features overlapping the origin in a circular sequence. > As for multiple revolutions, I'm not sure the hand-wringing about > specifics is worth it unless we have explicit workable examples to > test against (preferably examples which would potentially pop up), > but Lincoln's proposal sounds fine. > > chris From Russell.Smithies at agresearch.co.nz Tue Sep 9 16:46:26 2008 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Wed, 10 Sep 2008 08:46:26 +1200 Subject: [Bioperl-l] [Gmod-schema] Circular genomes in Chado/BioPerl In-Reply-To: <6dce9a0b0809091052g1c398a84tfe8f89d1bf9132c8@mail.gmail.com> References: <87B87004-A586-44A0-BB10-D4AA3FE9E669@tamu.edu><24c96eca0809081257l16461b23uaefe8154ed038bea@mail.gmail.com> <6dce9a0b0809091052g1c398a84tfe8f89d1bf9132c8@mail.gmail.com> Message-ID: Excuse my ignorance (I'm not a biologist) but is it biologically possible/likely for a gene or feature to wrap more than once around a genome? Anyone got an example? Russell Smithies Bioinformatics Applications Developer Invermay? Research Centre Puddle Alley, Mosgiel, New Zealand > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open- > bio.org] On Behalf Of Lincoln Stein > Sent: Wednesday, 10 September 2008 5:53 a.m. > To: Aaron Mackey > Cc: GMOD Schema List; Jim Hu; Roy Welch; bioperl-l at bioperl.org; Mike Gribskov > Subject: Re: [Bioperl-l] [Gmod-schema] Circular genomes in Chado/BioPerl > > It seems to me that the proposed modulus syntax handles multiple > revolutions. Consider a 100 bp genome (to make it simple) and a feature that > starts at 50, goes around twice, and ends at position 60: > > start = 50 > end = 260 > > length = end - start + 1 > revolutions = int (length/genome) > stop position = length % genome + 1 > > Lincoln > > On Mon, Sep 8, 2008 at 3:57 PM, Aaron Mackey wrote: > > > How can you handle features that may cross the origin more than once? > > The modulus, though simple, seems to be only half the solution. It > > also makes it difficult to place features in the genome "by eye" > > (having to do the modulus subtraction in my head), or in > > sorting/filtering operations. > > > > I have an alternative that I wondered if you considered: allow the > > start/end to have an additional "circular revolution" prefix: > > > > a typical range tuple like: 100 200 - > > is thus shorthand for: 0:100 0:200 - > > (i.e. both the 100 and 200 are in the same "revolution" around the genome) > > > > and is then distinguishable from an "around the genome + 100" feature of: > > 1:100 0:200 - > > > > Just an alternative to consider (if you haven't already). I'm not > > wedded to the syntax, but I wouldn't want to see new columns in GFF > > just for this. Essentially, what you want is some form of compound > > polar coordinates, it seems. > > > > -Aaron > > > > On Mon, Sep 8, 2008 at 2:44 PM, Jim Hu wrote: > > > In discussions with GMOD about Gbrowse, we've come up with a proposal for > > > handling circular genomes and features that cross the origin in such > > > genomes. This applies to lots of prokaryotic and viral genomes, and > > might > > > be valuable for some ways of representing terminally redundant linear > > > genomes. > > > 1) Keep the requirement that start < end > > > 2) allow end > parent feature length > > > 3) parent feature gets an is_circular boolean > > > 4) use modular arithmetic to calculate the real position of end on the > > > parent feature. > > > We'd like to do this in a way that will be consistent with Chado and > > BioPerl > > > representation of features as much as possible (realizing that there is > > the > > > usual interbase or not coordinate issue). What do people think? Lincoln > > is > > > on board for modifying the GFF3 spec. > > > Thanks! > > > Jim Hu > > > > > > ===================================== > > > > > > Jim Hu > > > > > > Associate Professor > > > > > > Dept. of Biochemistry and Biophysics > > > > > > 2128 TAMU > > > > > > Texas A&M Univ. > > > > > > College Station, TX 77843-2128 > > > > > > 979-862-4054 > > > > > > > > > ------------------------------------------------------------------------- > > > This SF.Net email is sponsored by the Moblin Your Move Developer's > > challenge > > > Build the coolest Linux based applications with Moblin SDK & win great > > > prizes > > > Grand prize is a trip for two to an Open Source event anywhere in the > > world > > > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > > > _______________________________________________ > > > Gmod-schema mailing list > > > Gmod-schema at lists.sourceforge.net > > > https://lists.sourceforge.net/lists/listinfo/gmod-schema > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > -- > Lincoln D. Stein > > Ontario Institute for Cancer Research > 101 College St., Suite 800 > Toronto, ON, Canada M5G0A3 > 416 673-8514 > Assistant: Stacey Quinn > > Cold Spring Harbor Laboratory > 1 Bungtown Road > Cold Spring Harbor, NY 11724 USA > (516) 367-8380 > Assistant: Sandra Michelsen > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From cjm at berkeleybop.org Tue Sep 9 18:56:55 2008 From: cjm at berkeleybop.org (Chris Mungall) Date: Tue, 9 Sep 2008 15:56:55 -0700 Subject: [Bioperl-l] [Gmod-schema] Circular genomes in Chado/BioPerl In-Reply-To: <6dce9a0b0809091052g1c398a84tfe8f89d1bf9132c8@mail.gmail.com> References: <87B87004-A586-44A0-BB10-D4AA3FE9E669@tamu.edu> <24c96eca0809081257l16461b23uaefe8154ed038bea@mail.gmail.com> <6dce9a0b0809091052g1c398a84tfe8f89d1bf9132c8@mail.gmail.com> Message-ID: <18A0C3BE-ED04-4494-9231-B945158D6CE4@berkeleybop.org> I think I am happy with the modulo approach. Though I believe we first of all need for a formal specification of genome interval semantics that is independent of any particular syntax or implementation. This can be a fairly short specification - along the lines of what Lincoln has written below (although I would naturally prefer the normative version to be interbase - this doesn't preclude derived axioms in GFF coordinates). This spec should also define and standardize the terminology used: Lincoln draws a distinction between 'stop' and 'end'. I'm relatively happy with these terms - however, the choice we makes need to become enshrined otherwise we'll end up with confusion and mismatches between software and specification. One clarification: > revolutions = int (length/genome) This axiom is presumaby contextual on the genome being circular, which will have to be indicated using a new flag, as Jim suggest, yep? So the context independent axiom would be: > revolutions = IF src_is_circular THEN int (length/genome) ELSE 0 On Sep 9, 2008, at 10:52 AM, Lincoln Stein wrote: > It seems to me that the proposed modulus syntax handles multiple > revolutions. Consider a 100 bp genome (to make it simple) and a > feature that > starts at 50, goes around twice, and ends at position 60: > > start = 50 > end = 260 > > length = end - start + 1 > revolutions = int (length/genome) > stop position = length % genome + 1 > > Lincoln > > On Mon, Sep 8, 2008 at 3:57 PM, Aaron Mackey > wrote: > >> How can you handle features that may cross the origin more than once? >> The modulus, though simple, seems to be only half the solution. It >> also makes it difficult to place features in the genome "by eye" >> (having to do the modulus subtraction in my head), or in >> sorting/filtering operations. >> >> I have an alternative that I wondered if you considered: allow the >> start/end to have an additional "circular revolution" prefix: >> >> a typical range tuple like: 100 200 - >> is thus shorthand for: 0:100 0:200 - >> (i.e. both the 100 and 200 are in the same "revolution" around the >> genome) >> >> and is then distinguishable from an "around the genome + 100" >> feature of: >> 1:100 0:200 - >> >> Just an alternative to consider (if you haven't already). I'm not >> wedded to the syntax, but I wouldn't want to see new columns in GFF >> just for this. Essentially, what you want is some form of compound >> polar coordinates, it seems. >> >> -Aaron >> >> On Mon, Sep 8, 2008 at 2:44 PM, Jim Hu wrote: >>> In discussions with GMOD about Gbrowse, we've come up with a >>> proposal for >>> handling circular genomes and features that cross the origin in such >>> genomes. This applies to lots of prokaryotic and viral genomes, and >> might >>> be valuable for some ways of representing terminally redundant >>> linear >>> genomes. >>> 1) Keep the requirement that start < end >>> 2) allow end > parent feature length >>> 3) parent feature gets an is_circular boolean >>> 4) use modular arithmetic to calculate the real position of end on >>> the >>> parent feature. >>> We'd like to do this in a way that will be consistent with Chado and >> BioPerl >>> representation of features as much as possible (realizing that >>> there is >> the >>> usual interbase or not coordinate issue). What do people think? >>> Lincoln >> is >>> on board for modifying the GFF3 spec. >>> Thanks! >>> Jim Hu >>> >>> ===================================== >>> >>> Jim Hu >>> >>> Associate Professor >>> >>> Dept. of Biochemistry and Biophysics >>> >>> 2128 TAMU >>> >>> Texas A&M Univ. >>> >>> College Station, TX 77843-2128 >>> >>> 979-862-4054 >>> >>> >>> ------------------------------------------------------------------------- >>> This SF.Net email is sponsored by the Moblin Your Move Developer's >> challenge >>> Build the coolest Linux based applications with Moblin SDK & win >>> great >>> prizes >>> Grand prize is a trip for two to an Open Source event anywhere in >>> the >> world >>> http://moblin-contest.org/redirect.php?banner_id=100&url=/ >>> _______________________________________________ >>> Gmod-schema mailing list >>> Gmod-schema at lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/gmod-schema >>> >>> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > > > -- > Lincoln D. Stein > > Ontario Institute for Cancer Research > 101 College St., Suite 800 > Toronto, ON, Canada M5G0A3 > 416 673-8514 > Assistant: Stacey Quinn > > Cold Spring Harbor Laboratory > 1 Bungtown Road > Cold Spring Harbor, NY 11724 USA > (516) 367-8380 > Assistant: Sandra Michelsen > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From heikki at sanbi.ac.za Wed Sep 10 07:32:06 2008 From: heikki at sanbi.ac.za (Heikki Lehvaslaiho) Date: Wed, 10 Sep 2008 13:32:06 +0200 Subject: [Bioperl-l] phylogeny-trait association methods into BioPerl Message-ID: <200809101332.07137.heikki@sanbi.ac.za> FYI, I've been recently writing code to analyse phylogeny-trait associations. These traits are typically geographical location of the sequence but they can be any phenotypic characters associated with the sequences. This involves trees, i.e. Bio::Tree::Tree and Bio::Tree::Node objects and strings describing the traits. I've been using tags to store trait values within nodes. The tag methods are: Bio::Tree::Node::add_tag_value Bio::Tree::Node::get_all_tags Bio::Tree::Node::get_tag_values Bio::Tree::Node::has_tag Bio::Tree::Node::remove_all_tags Bio::Tree::Node::remove_tag Question: Is there any particular reason why there is no set_tag_value(scalar|@array) method? I am getting tired of writing: $node->remove_tag($key); map {$node->add_tag_value($key)} @values ; so I am going to implement that unless there is are strong objections. Otherwise it has been smooth sailing. I am going to add Bio::Tree::TreeFunctions::is_binary() and start populating Bio::Tree::Statistics soon with these methods: ps() - Parsimony Score (PS) from Fitch 1971 ai() - Association index (AI) of Whang et al. 2001 mc() - Monophyletic Clade (MC) size statistics by Salemi at al. 2005 cherries() - number of leaf node pairs If you have any comments, please feel free to post them here. -Heikki -- ______ _/ _/_____________________________________________________ _/ _/ _/ _/ _/ Heikki Lehvaslaiho heikki at_sanbi _ac _za _/_/_/_/_/ Senior Scientist skype: heikki_lehvaslaiho _/ _/ _/ SANBI, South African National Bioinformatics Institute _/ _/ _/ University of Western Cape, South Africa _/ Phone: +27 21 959 2096 FAX: +27 21 959 2512 ___ _/_/_/_/_/________________________________________________________ From hlapp at gmx.net Wed Sep 10 09:44:27 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 10 Sep 2008 07:44:27 -0600 Subject: [Bioperl-l] phylogeny-trait association methods into BioPerl In-Reply-To: <200809101332.07137.heikki@sanbi.ac.za> References: <200809101332.07137.heikki@sanbi.ac.za> Message-ID: <0FA6C4A1-9D83-4850-BC0A-4A55ED528C65@gmx.net> Sounds great Heikki! Just FYI, there is a considerable amount of code and packages for comparative analysis in R. You might want to look into the resources linked to from the http://r-phylo.org page. There is also a R-SIG- Phylo special interest group mailing list (should be linked from the aforementioned site). -hilmar On Sep 10, 2008, at 5:32 AM, Heikki Lehvaslaiho wrote: > FYI, > > I've been recently writing code to analyse phylogeny-trait > associations. These > traits are typically geographical location of the sequence but they > can be any > phenotypic characters associated with the sequences. > > This involves trees, i.e. Bio::Tree::Tree and Bio::Tree::Node > objects and > strings describing the traits. I've been using tags to store trait > values > within nodes. The tag methods are: > > Bio::Tree::Node::add_tag_value > Bio::Tree::Node::get_all_tags > Bio::Tree::Node::get_tag_values > Bio::Tree::Node::has_tag > Bio::Tree::Node::remove_all_tags > Bio::Tree::Node::remove_tag > > Question: Is there any particular reason why there is no > set_tag_value(scalar|@array) method? > > I am getting tired of writing: > $node->remove_tag($key); > map {$node->add_tag_value($key)} @values ; > so I am going to implement that unless there is are strong objections. > > Otherwise it has been smooth sailing. I am going to add > Bio::Tree::TreeFunctions::is_binary() and start populating > Bio::Tree::Statistics soon with these methods: > > ps() - Parsimony Score (PS) from Fitch 1971 > ai() - Association index (AI) of Whang et al. 2001 > mc() - Monophyletic Clade (MC) size statistics by Salemi at al. 2005 > cherries() - number of leaf node pairs > > If you have any comments, please feel free to post them here. > > -Heikki > > -- > ______ _/ _/_____________________________________________________ > _/ _/ > _/ _/ _/ Heikki Lehvaslaiho heikki at_sanbi _ac _za > _/_/_/_/_/ Senior Scientist skype: heikki_lehvaslaiho > _/ _/ _/ SANBI, South African National Bioinformatics Institute > _/ _/ _/ University of Western Cape, South Africa > _/ Phone: +27 21 959 2096 FAX: +27 21 959 2512 > ___ _/_/_/_/_/________________________________________________________ > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From Caroline.Johnston at iop.kcl.ac.uk Wed Sep 10 11:43:26 2008 From: Caroline.Johnston at iop.kcl.ac.uk (Johnston, Caroline) Date: Wed, 10 Sep 2008 16:43:26 +0100 Subject: [Bioperl-l] seqs, seqfeatures, locations etc Message-ID: <5626ED9CB91C814197079FB9940312E506445F81@MAIL.bc.iop.kcl.ac.uk> Hello, I'm trying to get my head around the various classes for storing sequences, features and locations and was hoping someone could give me some implementation advice: I've got a Bio::EnsEMBL::Slice and I want to turn it into a Bio::Seq or SeqFeature object, with Bio::SeqFeature;:Gene::GeneStructure/Transcript/Exon info attached. I can create a Bio::Seq fine, but I also need to keep track of the chromosomal co-ordinates (chr, start, end, species, strand, genome release, database name) and I can't figure out how to store this in Bioperl. I was thinking that what I needed was some extension of a standard Bio::Seq to have genome-coordinate data attached and associated methods to translate the SeqFeature positions (relative to the Bio::Seq) to genome positions. I guess it's probably already possible to store this type of info in some collection of Bioperl objects, but between Bioperl and the EnsEMBL API I'm getting lost in perl modules. Can someone point me in the right direction? Thanks, Cass From bosborne11 at verizon.net Wed Sep 10 13:34:35 2008 From: bosborne11 at verizon.net (Brian Osborne) Date: Wed, 10 Sep 2008 13:34:35 -0400 Subject: [Bioperl-l] seqs, seqfeatures, locations etc In-Reply-To: <5626ED9CB91C814197079FB9940312E506445F81@MAIL.bc.iop.kcl.ac.uk> References: <5626ED9CB91C814197079FB9940312E506445F81@MAIL.bc.iop.kcl.ac.uk> Message-ID: Cass, There is a HOWTO about these Bioperl objects: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation I think it addresses your questions. Brian O. On Sep 10, 2008, at 11:43 AM, Johnston, Caroline wrote: > Hello, > > I'm trying to get my head around the various classes for storing > sequences, features and locations and was hoping someone could give > me some implementation advice: > > I've got a Bio::EnsEMBL::Slice and I want to turn it into a Bio::Seq > or SeqFeature object, with Bio::SeqFeature;:Gene::GeneStructure/ > Transcript/Exon info attached. I can create a Bio::Seq fine, but I > also need to keep track of the chromosomal co-ordinates (chr, start, > end, species, strand, genome release, database name) and I can't > figure out how to store this in Bioperl. I was thinking that what I > needed was some extension of a standard Bio::Seq to have genome- > coordinate data attached and associated methods to translate the > SeqFeature positions (relative to the Bio::Seq) to genome positions. > I guess it's probably already possible to store this type of info in > some collection of Bioperl objects, but between Bioperl and the > EnsEMBL API I'm getting lost in perl modules. Can someone point me > in the right direction? > > Thanks, > Cass > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From jason at bioperl.org Wed Sep 10 13:44:25 2008 From: jason at bioperl.org (Jason Stajich) Date: Wed, 10 Sep 2008 10:44:25 -0700 Subject: [Bioperl-l] phylogeny-trait association methods into BioPerl In-Reply-To: <200809101332.07137.heikki@sanbi.ac.za> References: <200809101332.07137.heikki@sanbi.ac.za> Message-ID: <24F25525-0394-45BE-859F-A5735DE610DF@bioperl.org> Those are just take from Bio::SeqFeature::Generic tag manipulation methods so please feel free to add a better one - maybe can propagate that new method to Bio::SeqFeature::Generic as well? Would be nice to see those other methods add as well so am glad to see them. -jason On Sep 10, 2008, at 4:32 AM, Heikki Lehvaslaiho wrote: > FYI, > > I've been recently writing code to analyse phylogeny-trait > associations. These > traits are typically geographical location of the sequence but they > can be any > phenotypic characters associated with the sequences. > > This involves trees, i.e. Bio::Tree::Tree and Bio::Tree::Node > objects and > strings describing the traits. I've been using tags to store trait > values > within nodes. The tag methods are: > > Bio::Tree::Node::add_tag_value > Bio::Tree::Node::get_all_tags > Bio::Tree::Node::get_tag_values > Bio::Tree::Node::has_tag > Bio::Tree::Node::remove_all_tags > Bio::Tree::Node::remove_tag > > Question: Is there any particular reason why there is no > set_tag_value(scalar|@array) method? > > I am getting tired of writing: > $node->remove_tag($key); > map {$node->add_tag_value($key)} @values ; > so I am going to implement that unless there is are strong objections. > > Otherwise it has been smooth sailing. I am going to add > Bio::Tree::TreeFunctions::is_binary() and start populating > Bio::Tree::Statistics soon with these methods: > > ps() - Parsimony Score (PS) from Fitch 1971 > ai() - Association index (AI) of Whang et al. 2001 > mc() - Monophyletic Clade (MC) size statistics by Salemi at al. 2005 > cherries() - number of leaf node pairs > > If you have any comments, please feel free to post them here. > > -Heikki > > -- > ______ _/ _/_____________________________________________________ > _/ _/ > _/ _/ _/ Heikki Lehvaslaiho heikki at_sanbi _ac _za > _/_/_/_/_/ Senior Scientist skype: heikki_lehvaslaiho > _/ _/ _/ SANBI, South African National Bioinformatics Institute > _/ _/ _/ University of Western Cape, South Africa > _/ Phone: +27 21 959 2096 FAX: +27 21 959 2512 > ___ _/_/_/_/_/________________________________________________________ > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason at bioperl.org From philsf79 at gmail.com Thu Sep 11 04:40:26 2008 From: philsf79 at gmail.com (Felipe Figueiredo) Date: Thu, 11 Sep 2008 05:40:26 -0300 Subject: [Bioperl-l] difference in opening file from @ARGV and STDIN? Message-ID: <1221122426.8059.49.camel@localhost> I'm not sure if this is related to bioperl (Bio::AlignIO) or if it's a general perl error on my part, but I find strange that the following code gives differente results depending on how I input the alignment: --- test.pl --- #!/usr/bin/perl use warnings; use strict; use Bio::AlignIO; my $file; if (@ARGV) { $file = shift @ARGV; } else { $file = "-"; } my $align = Bio::AlignIO->new(-file=>$file)->next_aln; printf "Sequences: %s\n",$align->no_sequences; --- test.pl --- If I run this using a file containing 4 sequences, the following hapens: --- run tests --- $ ./test.pl exemplo-alinhamento.fasta Sequences: 4 $ ./test.pl < exemplo-alinhamento.fasta Sequences: 3 $ cat exemplo-alinhamento.fasta | ./test.pl Sequences: 3 --- run tests --- The missing sequence is always the first one. Am I missing something, or my code for reading stdin is mistaken or is it a bug in Bio::AlignIO? I'm using bioperl 1.5.2.102-1ubuntu1, in Ubuntu 8.04 Hardy. best regards FF From David.Messina at sbc.su.se Thu Sep 11 07:21:51 2008 From: David.Messina at sbc.su.se (Dave Messina) Date: Thu, 11 Sep 2008 13:21:51 +0200 Subject: [Bioperl-l] difference in opening file from @ARGV and STDIN? In-Reply-To: <1221122426.8059.49.camel@localhost> References: <1221122426.8059.49.camel@localhost> Message-ID: <628aabb70809110421x3668606dx4a64a203ab0ff9d2@mail.gmail.com> Hi Felipe, Specifying STDIN via a '-' argument to the -file parameter is not valid. While that is a convention with some UNIX tools, it's not, as far as I know, something you should be able to count on. In BioPerl, one can specify STDIN by passing the \*STDIN filehandle glob to the -fh parameter (NOT to -file). In other words, my $align = Bio::AlignIO->new(-fh => \*STDIN)->next_aln; That is a convention in BioPerl, so the -file and -fh parameters should work the same way in AlignIO, SearchIO, SeqIO, etc. Take a look at the beginners' HOWTO for some examples. http://www.bioperl.org/wiki/HOWTO:Beginners Dave From bosborne11 at verizon.net Thu Sep 11 11:01:32 2008 From: bosborne11 at verizon.net (Brian Osborne) Date: Thu, 11 Sep 2008 11:01:32 -0400 Subject: [Bioperl-l] SeqHound Message-ID: <0E295726-33FE-4083-98A1-9035E37455BD@verizon.net> Raul, Good question about BIND, I don't know if the public version is up-to- date. For the latest protein-protein interaction data I look at public databases like IntAct. If you use bioperl-network you should be able to read IntAct data into a graph, then find the interactions that you're interested in in that graph. There are a few qualifications to this statement though, like do you have the "right" identifiers or names. So you're right, bioperl- network constructs graphs from the XML files but the interactions you want are in those graphs. Something like: my $graphio = Bio::Network::IO->new(-file => 'human.xml', -format => 'psi25'); my $graph = $graphio->next_network(); my $node = $graph->get_nodes_by_id('UniProt:P12345'); my @neighbors = $graph->neighbors($node); Brian O. On Sep 11, 2008, at 7:32 AM, Raul Mendez Giraldez wrote: > Hi Brian, > > Actually I realized later that SeqHound is a part of Bioperl itself, > and > that regarding BIND (at least the public database) is reachable trough > BIND SOAP protocol, that can be implemented in perl through the module > SOAP::Lite. I still don't know whether the public BIND database is out > of date, or which part of the BOND database it covers. > > Regarding the Bio::Network packages, at the Bioperl suite, I guess it > rather for representing protein - protein interaction graphs, isn't > it? > That could be interesting to me, but in a second step. I am more > concerned now in getting this protein - protein interaction data, > for a > set of proteins some biologists gave me. I don't know anything about > Cytoscape, normally I'm trying to exploit perl data management > capabilities. > > Thanks for the info. > > Cheers, > > Raul > From MEC at stowers-institute.org Thu Sep 11 14:01:01 2008 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Thu, 11 Sep 2008 13:01:01 -0500 Subject: [Bioperl-l] difference in opening file from @ARGV and STDIN? In-Reply-To: <628aabb70809110421x3668606dx4a64a203ab0ff9d2@mail.gmail.com> References: <1221122426.8059.49.camel@localhost> <628aabb70809110421x3668606dx4a64a203ab0ff9d2@mail.gmail.com> Message-ID: Filipe and Dave, I find that the following works generically for SeqIO and AlignIO (at least)... #after processing all options using GetOpt, #any remaining options should name files to process... @ARGV = ('-') unless @ARGV; # Default to standard input my %inopt; $inopt{-fh} ||= \*ARGV; my $AlignIO = Bio::AlignIO->new( %inopt ) or die "calling Bio::AlignIO->new on %inopt" ; --Malcolm -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Dave Messina Sent: Thursday, September 11, 2008 6:22 AM To: Felipe Figueiredo Cc: bioperl-l at lists.open-bio.org Subject: Re: [Bioperl-l] difference in opening file from @ARGV and STDIN? Hi Felipe, Specifying STDIN via a '-' argument to the -file parameter is not valid. While that is a convention with some UNIX tools, it's not, as far as I know, something you should be able to count on. In BioPerl, one can specify STDIN by passing the \*STDIN filehandle glob to the -fh parameter (NOT to -file). In other words, my $align = Bio::AlignIO->new(-fh => \*STDIN)->next_aln; That is a convention in BioPerl, so the -file and -fh parameters should work the same way in AlignIO, SearchIO, SeqIO, etc. Take a look at the beginners' HOWTO for some examples. http://www.bioperl.org/wiki/HOWTO:Beginners Dave _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From David.Messina at sbc.su.se Thu Sep 11 14:47:04 2008 From: David.Messina at sbc.su.se (Dave Messina) Date: Thu, 11 Sep 2008 20:47:04 +0200 Subject: [Bioperl-l] difference in opening file from @ARGV and STDIN? In-Reply-To: References: <1221122426.8059.49.camel@localhost> <628aabb70809110421x3668606dx4a64a203ab0ff9d2@mail.gmail.com> Message-ID: <628aabb70809111147y7d99e1bdh414a4ab1037c0990@mail.gmail.com> Thanks, Malcolm. So then, '-' as STDIN does work? D From MEC at stowers-institute.org Thu Sep 11 16:19:25 2008 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Thu, 11 Sep 2008 15:19:25 -0500 Subject: [Bioperl-l] difference in opening file from @ARGV and STDIN? In-Reply-To: <628aabb70809111147y7d99e1bdh414a4ab1037c0990@mail.gmail.com> References: <1221122426.8059.49.camel@localhost> <628aabb70809110421x3668606dx4a64a203ab0ff9d2@mail.gmail.com> <628aabb70809111147y7d99e1bdh414a4ab1037c0990@mail.gmail.com> Message-ID: Note exactly the way I would put it. Look at the difference between the first command and the second is the following transcript: > echo -e ">asdf\natgc\n" | perl -MBio::SeqIO -e 'my $s = Bio::SeqIO->new(-format => qw{fasta}, -fh => \*ARGV); print $ARGV[0] . qq{ has } . $s->next_seq()->seq . qq{\n}' -- '-' - has atgc > echo -e ">asdf\natgc\n" | perl -MBio::SeqIO -e 'my $s = Bio::SeqIO->new(-format => qw{fasta}, -fh => \*ARGV); print $ARGV[0] . qq{ has } . $s->next_seq()->seq . qq{\n}' -- 'NoSuchFile' Can't open NoSuchFile: No such file or directory at /home/mec/cvs/bioperl-live/Bio/Root/IO.pm line 458. Can't call method "seq" on an undefined value at -e line 1. THe only difference is that @ARG is the singleton list composed of '-' in the first call, and is the singlton list composed of 'NoSuchFile' in the second. If you passed in a list of multiple files that actually do exist, it should work fine. It is really a matter of ARGV processing magic. from http://perldoc.perl.org/perlop.html The null filehandle <> is special: it can be used to emulate the behavior of sed and awk. Input from <> comes either from standard input, or from each file listed on the command line. Here's how it works: the first time <> is evaluated, the @ARGV array is checked, and if it is empty, $ARGV[0] is set to "-", which when opened gives you standard input. The @ARGV array is then processed as a list of filenames. The loop while (<>) { ... # code for each line } is equivalent to the following Perl-like pseudo code: unshift(@ARGV, '-') unless @ARGV; while ($ARGV = shift) { open(ARGV, $ARGV); while () { ... # code for each line } } Malcolm Cook Database Applications Manager - Bioinformatics Stowers Institute for Medical Research - Kansas City, Missouri ________________________________ From: dave at davemessina.com [mailto:dave at davemessina.com] On Behalf Of Dave Messina Sent: Thursday, September 11, 2008 1:47 PM To: Cook, Malcolm Cc: Felipe Figueiredo; bioperl-l at lists.open-bio.org Subject: Re: [Bioperl-l] difference in opening file from @ARGV and STDIN? Thanks, Malcolm. So then, '-' as STDIN does work? D From David.Messina at sbc.su.se Thu Sep 11 17:15:44 2008 From: David.Messina at sbc.su.se (Dave Messina) Date: Thu, 11 Sep 2008 23:15:44 +0200 Subject: [Bioperl-l] difference in opening file from @ARGV and STDIN? In-Reply-To: References: <1221122426.8059.49.camel@localhost> <628aabb70809110421x3668606dx4a64a203ab0ff9d2@mail.gmail.com> <628aabb70809111147y7d99e1bdh414a4ab1037c0990@mail.gmail.com> Message-ID: <628aabb70809111415w27044d57jd3121adb40900e45@mail.gmail.com> Cool, thanks for the explanation Malcolm! At the risk of belaboring this point and your patience, one thing still confuses me, though: and if [@ARGV] is empty, $ARGV[0] is set to "-" > If $ARGV[0] is set (by Perl's ARGV processing magic) to '-', then why in your earlier example do you manually set $ARGV[0] to '-' instead of simply leaving @ARGV empty? @ARGV = ('-') unless @ARGV; If I run your example and omit '-' as an argument, it still works: > echo -e ">asdf\natgc\n" | perl -MBio::SeqIO -e 'my $s = Bio::SeqIO->new(-format => qw{fasta}, -fh => \*ARGV); print $ARGV[0] . qq{ has } . $s->next_seq()->seq . qq{\n}' has atgc Dave From acouperthwaite at gmail.com Fri Sep 12 17:17:53 2008 From: acouperthwaite at gmail.com (Andrew Couperthwaite) Date: Fri, 12 Sep 2008 15:17:53 -0600 Subject: [Bioperl-l] Bio::DB::Query::GenBank question Message-ID: Hi, I'm having difficulty using the Bio::DB::Query::GenBank module. The sample script on the page http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/DB/Query/GenBank.html doesn't seem to work. I'm trying to use this and the Bio::DB::GenBank module to find and download a set of sequences from GenBank... I'm rather new to bioperl, can anyone point me in the right direction? Thanks, -Andrew From jason at bioperl.org Sat Sep 13 02:42:28 2008 From: jason at bioperl.org (Jason Stajich) Date: Fri, 12 Sep 2008 23:42:28 -0700 Subject: [Bioperl-l] Bio::DB::Query::GenBank question In-Reply-To: References: Message-ID: <2CCFFD09-B7B0-4172-A10F-224875A6E968@bioperl.org> Hi Andrew - a) what is the exact script code you are trying, what are the error messages? b) what version of bioperl? The first thing we'll suggest is: did you get the latest code from SVN yet or a nightly build? -jason On Sep 12, 2008, at 2:17 PM, Andrew Couperthwaite wrote: > Hi, > > I'm having difficulty using the Bio::DB::Query::GenBank module. > The sample script on the page http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/DB/Query/GenBank.html > doesn't seem to work. > > I'm trying to use this and the Bio::DB::GenBank module to find and > download a set of sequences from GenBank... > I'm rather new to bioperl, can anyone point me in the right direction? > > Thanks, > -Andrew > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason at bioperl.org From cjfields at illinois.edu Mon Sep 15 00:13:57 2008 From: cjfields at illinois.edu (Chris Fields) Date: Sun, 14 Sep 2008 23:13:57 -0500 Subject: [Bioperl-l] significant bug with Bio::LocatableSeq Message-ID: <98938314-4AF1-44C2-8867-B383A44E542E@illinois.edu> While debugging some tests in bioperl, I noticed a fairly significant issue with Bio::LocatableSeq which is probably due to some inconsistencies with start/end coordinates. For some reason this started popping up with error messages recently when running AlignIO tests on bioperl-live (i.e. something changed which exposed the bug, maybe the verbosity level): 1..295 ok 1 - use Bio::AlignIO; ok 2 - The object isa Bio::AlignIO ok 3 - The object isa Bio::Align::AlignI ok 4 ok 5 ok 6 - The object isa Bio::AlignIO --------------------- WARNING --------------------- MSG: In sequence 02 residue count gives end value 399. Overriding value [355] with value 399 for Bio::LocatableSeq::end(). STACK Bio::LocatableSeq::end /Users/cjfields/bioperl/bioperl-live/blib/ lib/Bio/LocatableSeq.pm:150 STACK Bio::LocatableSeq::new /Users/cjfields/bioperl/bioperl-live/blib/ lib/Bio/LocatableSeq.pm:103 STACK Bio::AlignIO::arp::next_aln /Users/cjfields/bioperl/bioperl-live/ blib/lib/Bio/AlignIO/arp.pm:106 STACK toplevel t/AlignIO.t:34 --------------------------------------------------- .... followed by tons of similar errors. The problem is, no change is ever made. This is demonstrated by the following: ----------------------------- #!/usr/bin/perl -w use strict; use warnings; use Bio::LocatableSeq; my $seq = Bio::LocatableSeq->new( -id => 'foo', -seq => 'A----TGCGCTTCCTCGCTTCCG', -start => 10, -end => 100, # intentially bad -strand => -1); print $seq->end."\n"; ----------------------------- Results: --------------------- WARNING --------------------- MSG: In sequence foo residue count gives end value 28. Overriding value [100] with value 28 for Bio::LocatableSeq::end(). STACK Bio::LocatableSeq::end /Users/cjfields/bioperl/bioperl-live/Bio/ LocatableSeq.pm:150 STACK Bio::LocatableSeq::new /Users/cjfields/bioperl/bioperl-live/Bio/ LocatableSeq.pm:103 STACK toplevel seq.pl:7 --------------------------------------------------- 100 The warning pops up when -end is passed to LocatableSeq::new and indicates that the passed coordinate doesn't match up with the one calculated from the sequence (minus gaps). I've isolated the bug down to the end() method and am working on fixing it. Note that this affects LocatableSeq::length as well. This appears to affect arp, nexus, stockholm, and a few other AlignIO parsers as well. chris From acouperthwaite at gmail.com Mon Sep 15 15:05:13 2008 From: acouperthwaite at gmail.com (Andrew Couperthwaite) Date: Mon, 15 Sep 2008 13:05:13 -0600 Subject: [Bioperl-l] Bio::DB::Query::GenBank question In-Reply-To: <2CCFFD09-B7B0-4172-A10F-224875A6E968@bioperl.org> References: <2CCFFD09-B7B0-4172-A10F-224875A6E968@bioperl.org> Message-ID: <2BA47AC9-5A04-460D-BA56-64B152AFFF16@gmail.com> the code i'm starting with is this: ===== use Bio::DB::Query::GenBank; use Bio::DB::GenBank my $query_string = 'Oryza[Organism] AND EST[Keyword]'; my $query = Bio::DB::Query::GenBank->new(-query => 'Oryza[Organism] AND EST[Keyword]', -db=>'nucleotide'); my $count = $query->count; my @ids = $query->ids; # get a genbank database handle my $gb = Bio::DB::GenBank->new(); my $stream = $gb->get_Stream_by_query($query); while (my $seq = $stream->next_seq) { # do something with the sequence object print "hello"; } ===== It doesn't produce any error messages, it simply doesn't enter the while loop. It seems as though it isn't getting any