From chapmanb at 50mail.com Fri May 1 08:11:25 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 1 May 2009 08:11:25 -0400 Subject: [Biopython-dev] MUMmer In-Reply-To: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca> References: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca> Message-ID: <20090501121125.GD50777@sobchak.mgh.harvard.edu> Marcin; > I guess I should start with a nice 'hi' to everybody, now that I am > sending my first message to this group. So: Hi, Everybody! Welcome. We are happy to have you. > Now, that we have the formality out of the way, I will get to the point. > Recently, I have written some Python code for parsing and processing the > output of MUMmer tool (http://mummer.sourceforge.net/). More > specifically, the code I have manages invocations and handles outputs of > the nucmer pipeline (alignment of multiple closely related nucleotide > sequences) and of mummer itself (short exact matches). Obviously, the > results are ultimately rendered as pairs of biopython's Seq objects. This is great -- we don't have support for MUMmer alignments so this is very welcome. > I use this stuff only myself, in work on bacterial genomes, but I would > be more than willing to contribute it to the project. It may be rough > around the edges at the moment, but I think I could easily give it the > necessary polish if there is interest in having it included. As Bartek mentioned, the first step is to organize the code you have and start it as a branch on GitHub. Being able to see the code will help us make specific suggestions. Generally, based on what you've written it sounds like this will fit into the alignment interfaces. Peter and Cymon have been working on organizing this. Support for command lines and running programs lives in: http://github.com/biopython/biopython/tree/master/Bio/Align/Applications Parsing output and returning alignment objects is organized in the AlignIO module: http://github.com/biopython/biopython/tree/master/Bio/AlignIO http://www.biopython.org/wiki/AlignIO Tests are an important part of the submission process and many examples are found here: http://github.com/biopython/biopython/tree/master/Tests test_Clustalw.py is an example of a print and compare style test, and test_Mafft_tool.py is a unittest style test. We are more concerned with good testing coverage then how exactly the tests get written. We can definitely help with more specific feedback but hopefully this gives you a general idea to get started. Looking forward to seeing the code, Brad From chapmanb at 50mail.com Fri May 1 08:28:06 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 1 May 2009 08:28:06 -0400 Subject: [Biopython-dev] XML parsing library for new modules In-Reply-To: <3f6baf360904291228m1bba6008kb26e345510772e7a@mail.gmail.com> References: <3f6baf360904291228m1bba6008kb26e345510772e7a@mail.gmail.com> Message-ID: <20090501122806.GE50777@sobchak.mgh.harvard.edu> Eric; Thanks for summarizing the issues. I know Peter is taking a few well deserved days off but I suspect he will have some thoughts when he returns. We'd love to hear the experience of others who have used different python XML parsers. My lean is towards ElementTree for reasons of code clarity. SAX parsers require a lot of boilerplate style code. They also can be tricky with nested elements; I always find myself using a lot of "if in_tag; else if in_tag" style code. ElementTree eliminates a lot of these issues which should result in easier to maintain code. Brad > I'm writing a parser for the PhyloXML format for Google Summer of Code this > year, and as the name would imply, it requires parsing some large XML files. > The existing modules in Biopython for parsing XML formats seem to use > xml.sax in the standard library. In Python 2.5, a faster and more Pythonic > parser was added to the standard lib: ElementTree (xml.etree), in > pure-Python and C-enhanced flavors. How do you feel about each of these > libraries as the basis for a new Biopython module? > > Here are some interesting benchmarks: > http://effbot.org/zone/celementtree.htm#benchmarks > > The ElementTree library is also available as a standalone package, > compatible back to Python 2.1, and the lxml package also offers an > independent implementation. So maintaining compatibility with Python 2.4 > would require the availability of one of these third-party packages, and my > code would try each of these imports in order: > > from xml.etree import cElementTree as ElementTree > from xml.etree import ElementTree > # Separate lxml package > from lxml.etree import ElementTree > # Standalone elementtree package > import cElementTree as ElementTree > from elementtree import ElementTree > > Then one day, when Python 2.4 is no longer supported, only the first two > lines would be needed. (The second line is for sites that disable C > extensions, like Google App Engine, or alternate Python implementations like > Jython.) > > Another option is xml.parsers.expat, but just Googling around, it appears > that the Python zeitgeist is strongly in favor of xml.etree for new code. > > Thoughts? > > Thanks, > Eric > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From marcin.swiatek at mail.mcgill.ca Fri May 1 14:17:14 2009 From: marcin.swiatek at mail.mcgill.ca (Marcin Swiatek) Date: Fri, 1 May 2009 14:17:14 -0400 Subject: [Biopython-dev] MUMmer In-Reply-To: <8b34ec180904300950n2c75f010oed27493f52d0da14@mail.gmail.com> References: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca> <8b34ec180904300950n2c75f010oed27493f52d0da14@mail.gmail.com> Message-ID: <176A06E658ED0745965C072C5F2C116A037F084C@EXCHANGE2VS2.campus.mcgill.ca> Bartek, Brad, Thank you for the suggestions. I will set myself up as proposed and see what I can do to align my code with local customs and traditions. If questions arise, I will post again. As for the use of alignment object, I have actually chosen to represent 'candidate' matches by my own simplistic class. Nucmer, the way I use it, generates lots of spurious matches, which I always need to somehow filter. Thus, it seemed perfectly reasonable at the time to create the proper representation of alignment later on, in a separate function call. Following your suggestion I will probably change it to return an alignment object, rather than a pair of sequences. But details are best discussed once the code is available, so I think we will return to this matter later. Regards, Marcin -----Original Message----- From: barwil at gmail.com [mailto:barwil at gmail.com] On Behalf Of Bartek Wilczynski Sent: Thursday, April 30, 2009 12:51 PM To: Marcin Swiatek Cc: biopython-dev at biopython.org Subject: Re: [Biopython-dev] MUMmer Hi Marcin, On Thu, Apr 30, 2009 at 5:23 PM, Marcin Swiatek wrote: > Hello, > > > > I use this stuff only myself, in work on bacterial genomes, but I would > be more than willing to contribute it to the project. It may be rough > around the edges at the moment, but I think I could easily give it the > necessary polish if there is interest in having it included. > Contributions are always welome > > > Should that be the case, could one of the project leads point me in the > right direction, please? How should I go about the submission? > > I don't think I qualify as a lead, but nonetheless I think I can help here. I think that the best way to submit your code currently is to create a branch (fork) of biopython on github and submit your changes there and then notify people on biopython-dev that there is new code to review. You can also submit an enhancement bug to bugzilla. There are a couple of wiki pages which might be of interest to you: - http://biopython.org/wiki/Contributing - http://biopython.org/wiki/GitUsage If you have any questions or problems during the process, ask on the list. As for the code, I'm not sure, but maybe instead of returning a pair of sequences, an alignment object might be a better choice? You might want to also check out a recent code on application wrappers: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005766.html cheers Bartek From bugzilla-daemon at portal.open-bio.org Fri May 1 14:16:57 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 1 May 2009 14:16:57 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200905011816.n41IGvXO012709@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #8 from eric.talevich at gmail.com 2009-05-01 14:16 EST ------- (In reply to comment #7) > (In reply to comment #2) > > Python 2.6 includes a context manager that makes all these problems > > *completely* go away, by catching all of the warnings raised within a > > context and optionally storing them as a list of warning objects that > > can be inspected. > > That sounds much better :) > > > Would you be interested in having a unit test that does a more thorough > > check of the warnings system, but only runs on Py2.6? I'm guessing no, > > but hey, worth a shot. > > Yes - other than using the old print-and-compare test, this seems worth doing > in order to actually test the warnings we expect are being issued. It could be > a whole new file, test_PDB_warnings.py which required Python 2.6+, but as its > just one or two tests, maybe just use conditional method(s) within the > test_PDB_unit.py file. > > Peter > I have something that works on both Py2.5 and Py2.6 now: http://github.com/etal/biopython/tree/pdbtidy I added a new file called _PDB_extra.py which test_PDB_unit.py imports if an attribute called 'catch_warnings' is available in the current warnings module. If so, the method test_warnings is added to the class, otherwise nothing happens. So Py2.6 runs 9 tests in test_PDB_unit.py, while Py2.5 only runs 8. This seemed easier than creating a whole separate unittest suite for one tricky test, but I defer to you on the organization and naming. I think I'll need to do a similar separation of tests for PhyloXML, so I'd like to have a consistent pattern to follow here. Also, apparently tests are run in alphabetical order, and Exposure was jumping ahead of PDBExceptionTest. I renamed PDBExceptionTest to ExceptionTest to restore the natural order of things and stop setting off the warnings prematurely. Maybe test suites with multiple TestCase classes should be arranged alphabetically in the code to avoid confusion in the future. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 4 06:57:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 4 May 2009 06:57:33 -0400 Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline - properties and kwargs In-Reply-To: Message-ID: <200905041057.n44AvXil006684@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2822 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1288 is|0 |1 obsolete| | ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-04 06:57 EST ------- Created an attachment (id=1289) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1289&action=view) Patch to add keyword arguments and properties to command line wrappers Brad likes the idea, and as the Bio.Application module owner that's good :) http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005963.html This patch makes a very slight difference to reduce the changes needed to old code (i.e. in the __init__ method use self.parameters = [...] as before) with the bonus that the base class and subclasses have the same __init__ signature (argument list). This patch also now covers Bio.Align.Applications, Bio.Motif.Applications and Bio.AlignAce.Applications as well as Bio.Emboss.Applications (i.e. all affected files). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Mon May 4 08:02:59 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 4 May 2009 13:02:59 +0100 Subject: [Biopython-dev] MUMmer In-Reply-To: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca> References: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca> Message-ID: <320fb6e00905040502y4785a0f9t4475ab0868a791c@mail.gmail.com> On Thu, Apr 30, 2009 at 4:23 PM, Marcin Swiatek wrote: > Hello, > > I guess I should start with a nice 'hi' to everybody, now that I am > sending my first message to this group. So: Hi, Everybody! Hi! > Now, that we have the formality out of the way, I will get to the point. > Recently, I have written some Python code for parsing and processing the > output of MUMmer tool (http://mummer.sourceforge.net/). More > specifically, the code I have manages invocations and handles outputs of > the nucmer pipeline (alignment of multiple closely related nucleotide > sequences) and of mummer itself (short exact matches). Obviously, the > results are ultimately rendered as pairs of biopython's Seq objects. > > I use this stuff only myself, in work on bacterial genomes, but I would > be more than willing to contribute it to the project. It may be rough > around the edges at the moment, but I think I could easily give it the > necessary polish if there is interest in having it included. Great! I assume your OK with our licence, and there are no problems from your employer/University with a contribution like this? > Should that be the case, could one of the project leads point me in the > right direction, please? How should I go about the submission? In terms of showing us the code, how do you feel about trying out github (see Bartek's email)? Alternatively file and enhancement bug on our bugzilla and upload your current python file (or a zip file if this is split up into several modules). >From your description above it sounds like you have two main lumps of code: a pairwise alignment parser, and some command line tool wrappers. Brad and Bartek have already mentioned returning Alignment objects, that would let us integrate MUMmer as an input format for Bio.AlignIO, http://biopython.org/wiki/AlignIO It may be helpful to have a look at how we parse FASTA output into pairwise alignments, and also the EMBOSS "pairs" files from needle and water. Although (as Brad mentioned), this is currently undergoing a little flux, for the command line wrappers I'd like this to use our Bio.Application framework to represent the command line object, giving a string the user can then invoke as the prefer. Having the MUMmer wrapper under Bio.Align.Applications seems sensible at this point. If you have been lurking on the dev mailing list for a while, these topics may be familiar already. If not, have a look over the last month or so in the archives here: http://lists.open-bio.org/pipermail/biopython-dev/ Thanks, Peter From p.j.a.cock at googlemail.com Mon May 4 08:15:04 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 4 May 2009 13:15:04 +0100 Subject: [Biopython-dev] XML parsing library for new modules In-Reply-To: <20090501122806.GE50777@sobchak.mgh.harvard.edu> References: <3f6baf360904291228m1bba6008kb26e345510772e7a@mail.gmail.com> <20090501122806.GE50777@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00905040515q63312065oec904ed3f5ffab93@mail.gmail.com> On Fri, May 1, 2009 at 1:28 PM, Brad Chapman wrote: > Eric; > Thanks for summarizing the issues. I know Peter is taking a few well > deserved days off but I suspect he will have some thoughts when he > returns. We'd love to hear the experience of others who have used > different python XML parsers. I would be interested to hear Michiel's views on this, as he knows more about the specifics of the existing XML parsers in Biopython (e.g. Bio.Entrez). > My lean is towards ElementTree for reasons of code clarity. SAX > parsers require a lot of boilerplate style code. They also can be > tricky with nested elements; I always find myself using a lot of "if > in_tag; else if in_tag" style code. ElementTree eliminates a lot of > these issues which should result in easier to maintain code. We have been trying to avoid external library dependencies where possible (moving away from Martel for parsing has really helped here). Given ElementTree and cElementTree are included with Python 2.5+, this is only an issue for Biopython running on Python 2.4. Both ElementTree and cElementTree are available as separate downloads (with Windows installers). I think under their licence we could even bundle it with Biopython if need be. So, while it is a shame ElementTree isn't part of Python 2.4, if it is the best technical solution, that shouldn't stop us from using it. Note we should ONLY use those core features which are included with Python 2.5+ inself. Peter P.S. I wonder if our BLAST XML parser would get a big speed boost if we switched it to ElementTree instead of xml.sax? From bugzilla-daemon at portal.open-bio.org Mon May 4 09:47:25 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 4 May 2009 09:47:25 -0400 Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline - properties and kwargs In-Reply-To: Message-ID: <200905041347.n44DlPQD018238@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2822 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1289 is|0 |1 obsolete| | ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-04 09:47 EST ------- Created an attachment (id=1290) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1290&action=view) Patch to add keyword arguments, properties and __repr__ to command line wrappers Extended to include __repr__ support (using the new keyword arguments support). Note that the Muscle wrapper will need an alternative python valid identifier for the -in argument, e.g. "input", because we can't use just "in" as a property or keyword argument. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 4 10:07:57 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 4 May 2009 10:07:57 -0400 Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline - properties and kwargs In-Reply-To: Message-ID: <200905041407.n44E7vI9020041@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2822 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1290 is|0 |1 obsolete| | ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-04 10:07 EST ------- Created an attachment (id=1291) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1291&action=view) Patch to add keyword arguments, properties and __repr__ to command line wrappers As in previous patch but with support for clearing parameters by "deleting" the property, and some basic doctests in Bio.Application. Still need to co-ordinate with Cymon to give the Muscle wrapper a valid python identifier as an alias for the -in argument, e.g. "input", because we can't use just "in" as a property or keyword argument. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon May 4 10:48:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 4 May 2009 15:48:53 +0100 Subject: [Biopython-dev] Properties in Bio.Application interface? In-Reply-To: <20090430120532.GA50777@sobchak.mgh.harvard.edu> References: <320fb6e00904260546s2d3eeb73iec08df89d93f9908@mail.gmail.com> <320fb6e00904290325w756493f2i264f79cd3f7e08b8@mail.gmail.com> <320fb6e00904290834g7a73c7au487564e3b103250@mail.gmail.com> <20090430120532.GA50777@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00905040748w7a0b940aub82220b9c78e7dc3@mail.gmail.com> On Thu, Apr 30, 2009 at 1:05 PM, Brad Chapman wrote: > I love what you are doing here. The keywords and properties make > it much more Pythonic; the old way reeks of Java-style get/sets. My > vote is to put them both in. Cool - I was hoping people would agree it is more pythonic. I have some follow up thoughts, or points for discussion ... Peter From biopython at maubp.freeserve.co.uk Mon May 4 10:53:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 4 May 2009 15:53:37 +0100 Subject: [Biopython-dev] Properties names in command line wrappers Message-ID: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com> On Mon, May 4, 2009 at 3:48 PM, Peter wrote: > On Thu, Apr 30, 2009 at 1:05 PM, Brad Chapman wrote: >> I love what you are doing here. The keywords and properties make >> it much more Pythonic; the old way reeks of Java-style get/sets. My >> vote is to put them both in. > > Cool - I was hoping people would agree it is more pythonic. > > I have some follow up thoughts, or points for discussion ... > I updated the patch on Bug 2822 to cover all the Bio.Application command line wrapper subclasses, and included __repr__ support. However, that has raised a real example of a parameter where the current "human readable" name is not a valid python identifier ("in", for "-in" in Muscle). I think the pragmatic solution is to add a sensible alternative which we can use for the property and keyword argument name (e.g. "input" in this case) while in general keeping these names as close as possible to the actual parameter name as used at the command line. On the other hand, some might argue for giving all the options meaningful names. The (hardly used) existing blastall wrapper in Bio/Blast/Applications.py gives the "-a" argument a human readable name of "nprocessors", and "-A" gets "window_size". With the old set_parameter call either alias could be used. However, with a python property we need to pick one as a preferred name - and I'm not 100% sure being helpful and using "nprocessors" (e.g. cline.nprocessors=4) is actually better than using the actual argument name (e.g. cline.a = 4). My instinct is that these are low level wrappers, which don't try to second guess the user. To take full advantage of any command line tool you will need to read the tool's documentation to know what the arguments are - and having Biopython making up its own aliases just makes things more complicated. Therefore I think the property names in the command line wrapper objects should be as close as possible to the actual command line arguments. In this case, for blastall use "a" for number of processors and "A" for window size. However, I see the existing "helper functions" in Bio/Blast/NCBIStandalone.py as a higher level wrapper, which tries to insulate the user from the precise details of the command line string, and here using an argument name "nprocessors" makes more sense (although again, it differs from the actual command line making cross referencing to the NCBI documentation more difficult). What are your thoughts Brad? Peter From biopython at maubp.freeserve.co.uk Mon May 4 11:03:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 4 May 2009 16:03:17 +0100 Subject: [Biopython-dev] Switches in the Bio.Application interface Message-ID: <320fb6e00905040803v78029f26k48b8e6908e40cc5@mail.gmail.com> On Mon, May 4, 2009 at 3:48 PM, Peter wrote: > On Thu, Apr 30, 2009 at 1:05 PM, Brad Chapman wrote: >> I love what you are doing here. The keywords and properties make >> it much more Pythonic; the old way reeks of Java-style get/sets. My >> vote is to put them both in. > > Cool - I was hoping people would agree it is more pythonic. > > I have some follow up thoughts, or points for discussion ... > > Peter > It seems sensible to me to allow "deleting" a property to clear it. There is an example in the proposed Bio/Application/__init__.py docstring of how this would work: >>> from Bio.Emboss.Applications import WaterCommandline >>> cline = WaterCommandline(gapopen=10, gapextend=0.5) >>> cline WaterCommandline(cmd='water', gapopen=10, gapextend=0.5) You can also manipulate the parameters via their properties, e.g. >>> cline.gapopen 10 >>> cline.gapopen = 20 >>> cline WaterCommandline(cmd='water', gapopen=20, gapextend=0.5) You can clear a parameter you have already added by 'deleting' the corresponding property: >>> del cline.gapopen >>> cline.gapopen >>> cline WaterCommandline(cmd='water', gapextend=0.5) That does seem to work and covers most situation, however there is a special case of command line "switches" (arguments which don't take an argument, like -kimura in ClustalW, or -l in ls). There are a lot of these cases in Cymon's new alignment wrappers. These worked OK when used with set_parameter("kimura"), the value is omitted and defaults to None. Using the current patch, to set this via the keyword argument or property, it must explicitly be set to None, which is ugly: >>> from Bio.Align.Applications import ClustalwCommandline >>> print ClustalwCommandline(gapopen=2, gapext=0.5, infile="demo.fasta") clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5 >>> print ClustalwCommandline(gapopen=2, gapext=0.5, kimura=None, infile="demo.fasta") clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5 -kimura For these "switch" arguments, perhaps the value should be interpreted as a boolean (should the switch be added or not?). This would be a change to the current API, but I don't think any of the existing wrappers actually have this kind of parameter, so there shouldn't be a backwards compatibility issue here. Instead I want to do this: >>> from Bio.Align.Applications import ClustalwCommandline >>> print ClustalwCommandline(gapopen=2, gapext=0.5, infile="demo.fasta") clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5 >>> print ClustalwCommandline(gapopen=2, gapext=0.5, kimura=True, infile="demo.fasta") clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5 -kimura >>> print ClustalwCommandline(gapopen=2, gapext=0.5, kimura=False, infile="demo.fasta") clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5 An example use case is to allow parameter searches e.g. from Bio.Align.Applications import ClustalwCommandline for gap_open in [0, 1, 2, 10] : for gap_extend in [0, 0.25, 0.5] : for use_kimura in [True, False] : #Won't work yet!: cline = ClustalwCommandline(gapopen=gap_open, gapext=gap_extend, kimura=use_kimura, infile="demo.fasta") print cline Or, modifying and reusing a single command line wrapper object: from Bio.Align.Applications import ClustalwCommandline #Set standard options: cline = ClustalwCommandline(infile="demo.fasta") #Do parameter sweep: for gap_open in [0, 1, 2, 10] : cline.gapopen = gap_open for gap_extend in [0, 0.25, 0.5] : cline.gapext = gap_extend for use_kimura in [True, False] : cline.kimura = use_kimura #Won't work yet! print cline Peter From bugzilla-daemon at portal.open-bio.org Mon May 4 11:29:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 4 May 2009 11:29:33 -0400 Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline - properties and kwargs In-Reply-To: Message-ID: <200905041529.n44FTXr9025530@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2822 ------- Comment #7 from cymon.cox at gmail.com 2009-05-04 11:29 EST ------- (In reply to comment #6) > Created an attachment (id=1291) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1291&action=view) [details] > Patch to add keyword arguments, properties and __repr__ to command line > wrappers > > As in previous patch but with support for clearing parameters by "deleting" the > property, and some basic doctests in Bio.Application. > > Still need to co-ordinate with Cymon to give the Muscle wrapper a valid python > identifier as an alias for the -in argument, e.g. "input", because we can't use > just "in" as a property or keyword argument. "input" for -in and maybe also "input1" "input2" as alternatives for -in1 -in2, might the the way to go, and document it. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Mon May 4 11:25:17 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 4 May 2009 08:25:17 -0700 (PDT) Subject: [Biopython-dev] XML parsing library for new modules In-Reply-To: <320fb6e00905040515q63312065oec904ed3f5ffab93@mail.gmail.com> Message-ID: <3493.66471.qm@web62406.mail.re1.yahoo.com> --- On Mon, 5/4/09, Peter Cock wrote: > > My lean is towards ElementTree for reasons of code > clarity. SAX > > parsers require a lot of boilerplate style code. They > also can be > > tricky with nested elements; I always find myself > using a lot of "if > > in_tag; else if in_tag" style code. ElementTree > eliminates a lot of > > these issues which should result in easier to maintain > code. This is partially true. SAX parsers can be complicated, but with some dedication reasonably clear code is also possible. The SAX parser in Bio.Entrez is not all that bad, and it can handle all kinds of different XML pages as long as a DTD is available. The prime motivation for ElementTree is that it's mutable; I don't know if that is really needed in this case. Another thing to consider is what to do with the result returned by ElementTree. Whereas it will contain all the information in the XML file, it may not represent it in a user-friendly way. You may want to take the output from ElementTree and store it in a more biopython-like object. Also keep in mind memory usage: ElementTree will keep the complete XML file in memory, whereas the SAX parser gives you more flexibility here (see below). That said, I don't have any fundamental objections against using ElementTree. > > We have been trying to avoid external library dependencies > where > possible (moving away from Martel for parsing has really > helped here). > Given ElementTree and cElementTree are included with Python > 2.5+, > this is only an issue for Biopython running on Python 2.4. I think it's OK to require Python 2.5 or later for Biopython. > P.S. I wonder if our BLAST XML parser would get a big speed > boost if we switched it to ElementTree instead of xml.sax? I doubt it, since the SAX parser is pretty straightforward -- the hard part is to go through the DTD and find out how to interpret each element in the XML (this is not time-consuming though). The key point though is memory usage. With the SAX parser, you can parse the XML file in chunks, and use an iterator to return individual Blast records -- you don't need to keep the full XML file in memory. The Blast parser NCBIXML.parse does exactly that. With ElementTree, as far as I understand you read in the full XML file and keep it in memory. --Michiel. From cy at cymon.org Mon May 4 11:34:52 2009 From: cy at cymon.org (Cymon Cox) Date: Mon, 4 May 2009 16:34:52 +0100 Subject: [Biopython-dev] Switches in the Bio.Application interface In-Reply-To: <320fb6e00905040803v78029f26k48b8e6908e40cc5@mail.gmail.com> References: <320fb6e00905040803v78029f26k48b8e6908e40cc5@mail.gmail.com> Message-ID: <7265d4f0905040834p419f49c5ka33ceef6f7dcab19@mail.gmail.com> 2009/5/4 Peter > On Mon, May 4, 2009 at 3:48 PM, Peter > wrote: > > That does seem to work and covers most situation, however there is a > special case of command line "switches" (arguments which don't take an > argument, like -kimura in ClustalW, or -l in ls). There are a lot of > these cases in Cymon's new alignment wrappers. These worked OK when > used with set_parameter("kimura"), the value is omitted and defaults > to None. Using the current patch, to set this via the keyword > argument or property, it must explicitly be set to None, which is > ugly: > > >>> from Bio.Align.Applications import ClustalwCommandline > >>> print ClustalwCommandline(gapopen=2, gapext=0.5, infile="demo.fasta") > clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5 > >>> print ClustalwCommandline(gapopen=2, gapext=0.5, kimura=None, > infile="demo.fasta") > clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5 -kimura Ugly, and very confusing. > For these "switch" arguments, perhaps the value should be interpreted > as a boolean (should the switch be added or not?). This is what i did in my Muscle helper functions - so makes sense to me... C. -- ____________________________________________________________________ Cymon J. Cox Centro de Ciencias do Mar Faculdade de Ciencias do Mar e Ambiente (FCMA) Universidade do Algarve Campus de Gambelas 8005-139 Faro Portugal Phone: +0351 289800909 ext 7909 Fax: +0351 289800051 Email: cy at cymon.org, cymon at ualg.pt, cymon.cox at gmail.com HomePage : http://biology.duke.edu/bryology/cymon.html -8.63/-6.77 From p.j.a.cock at googlemail.com Mon May 4 11:45:12 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 4 May 2009 16:45:12 +0100 Subject: [Biopython-dev] XML parsing library for new modules In-Reply-To: <3493.66471.qm@web62406.mail.re1.yahoo.com> References: <320fb6e00905040515q63312065oec904ed3f5ffab93@mail.gmail.com> <3493.66471.qm@web62406.mail.re1.yahoo.com> Message-ID: <320fb6e00905040845g67219f36r563f4dfa1b080125@mail.gmail.com> Brad wrote: >>> My lean[ing] is towards ElementTree for reasons of code >>> clarity. SAX parsers require a lot of boilerplate style code. >>> They also can be tricky with nested elements; I always >>> find myself using a lot of "if in_tag; else if in_tag" style >>> code. ElementTree eliminates a lot of these issues >>> which should result in easier to maintain code. Michiel wrote: > This is partially true. SAX parsers can be complicated, but > with some dedication reasonably clear code is also possible. > The SAX parser in Bio.Entrez is not all that bad, and it can > handle all kinds of different XML pages as long as a DTD > is available. The prime motivation for ElementTree is that > it's mutable; I don't know if that is really needed in this case. Eric will have to answer that regarding PhyloXML, but if the aim is to turn it into one of our existing tree objects, then having the XML structure mutable is irrelevant. > Another thing to consider is what to do with the result > returned by ElementTree. Whereas it will contain all the > information in the XML file, it may not represent it in a > user-friendly way. You may want to take the output from > ElementTree and store it in a more biopython-like object. > Also keep in mind memory usage: ElementTree will keep > the complete XML file in memory, whereas the SAX > parser gives you more flexibility here (see below). Something for Eric to consider. Michiel wrote: > That said, I don't have any fundamental objections > against using ElementTree. Peter wrote: >> We have been trying to avoid external library dependencies >> where possible (moving away from Martel for parsing has >> really helped here). Given ElementTree and cElementTree >> are included with Python 2.5+, this is only an issue for >> Biopython running on Python 2.4. > > I think it's OK to require Python 2.5 or later for Biopython. As this stage I disagree, Python 2.4 would still be widely used on production servers running stable distributions. Also we'd have to give a couple of releases notice about dropping Python 2.4 support. In any case, if we want to use ElementTree with Python 2.4 this is possible. Peter wrote: >> P.S. I wonder if our BLAST XML parser would get a big speed >> boost if we switched it to ElementTree instead of xml.sax? > > I doubt it, since the SAX parser is pretty straightforward -- > the hard part is to go through the DTD and find out how to > interpret each element in the XML (this is not > time-consuming though). The key point though is memory > usage. With the SAX parser, you can parse the XML file in > chunks, and use an iterator to return individual Blast records > -- you don't need to keep the full XML file in memory. The > Blast parser NCBIXML.parse does exactly that. With > ElementTree, as far as I understand you read in the full > XML file and keep it in memory. Keeping a full BLAST XML file in memory would be a bad idea, and would spoil the memory savings of the iterator approach to parsing it. So ElementTree isn't suitable for everything ;) Peter From biopython at maubp.freeserve.co.uk Mon May 4 11:47:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 4 May 2009 16:47:58 +0100 Subject: [Biopython-dev] Switches in the Bio.Application interface In-Reply-To: <7265d4f0905040834p419f49c5ka33ceef6f7dcab19@mail.gmail.com> References: <320fb6e00905040803v78029f26k48b8e6908e40cc5@mail.gmail.com> <7265d4f0905040834p419f49c5ka33ceef6f7dcab19@mail.gmail.com> Message-ID: <320fb6e00905040847s32bc9e4fr3f7fb045b2d3429b@mail.gmail.com> On Mon, May 4, 2009 at 4:34 PM, Cymon Cox wrote: > >> For these "switch" arguments, perhaps the value should be interpreted >> as a boolean (should the switch be added or not?). > > This is what i did in my Muscle helper functions - so makes sense to me... > Good :) Peter From bugzilla-daemon at portal.open-bio.org Mon May 4 12:29:10 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 4 May 2009 12:29:10 -0400 Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline - properties and kwargs In-Reply-To: Message-ID: <200905041629.n44GTAeq030521@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2822 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1291 is|0 |1 obsolete| | ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-04 12:29 EST ------- (From update of attachment 1291) Checked into CVS: Checking in Tests/test_Prank_tool.py; /home/repository/biopython/biopython/Tests/test_Prank_tool.py,v <-- test_Prank_tool.py new revision: 1.5; previous revision: 1.4 done Checking in Tests/test_Muscle_tool.py; /home/repository/biopython/biopython/Tests/test_Muscle_tool.py,v <-- test_Muscle_tool.py new revision: 1.7; previous revision: 1.6 done Checking in Tests/test_Emboss.py; /home/repository/biopython/biopython/Tests/test_Emboss.py,v <-- test_Emboss.py new revision: 1.20; previous revision: 1.19 done Checking in Tests/test_Clustalw_tool.py; /home/repository/biopython/biopython/Tests/test_Clustalw_tool.py,v <-- test_Clustalw_tool.py new revision: 1.13; previous revision: 1.12 done Checking in Bio/Application/__init__.py; /home/repository/biopython/biopython/Bio/Application/__init__.py,v <-- __init__.py new revision: 1.15; previous revision: 1.14 done Checking in Bio/Emboss/Applications.py; /home/repository/biopython/biopython/Bio/Emboss/Applications.py,v <-- Applications.py new revision: 1.23; previous revision: 1.22 done Checking in Bio/AlignAce/Applications.py; /home/repository/biopython/biopython/Bio/AlignAce/Applications.py,v <-- Applications.py new revision: 1.5; previous revision: 1.4 done Checking in Bio/Motif/Applications/_AlignAce.py; /home/repository/biopython/biopython/Bio/Motif/Applications/_AlignAce.py,v <-- _AlignAce.py new revision: 1.3; previous revision: 1.2 done Checking in Bio/Align/Applications/_Clustalw.py; /home/repository/biopython/biopython/Bio/Align/Applications/_Clustalw.py,v <-- _Clustalw.py new revision: 1.5; previous revision: 1.4 done Checking in Bio/Align/Applications/_Mafft.py; /home/repository/biopython/biopython/Bio/Align/Applications/_Mafft.py,v <-- _Mafft.py new revision: 1.4; previous revision: 1.3 done Checking in Bio/Align/Applications/_Muscle.py; /home/repository/biopython/biopython/Bio/Align/Applications/_Muscle.py,v <-- _Muscle.py new revision: 1.6; previous revision: 1.5 done Checking in Bio/Align/Applications/_Prank.py; /home/repository/biopython/biopython/Bio/Align/Applications/_Prank.py,v <-- _Prank.py new revision: 1.4; previous revision: 1.3 done (In reply to comment #7) > (In reply to comment #6) > > Still need to co-ordinate with Cymon to give the Muscle wrapper a valid > > python identifier as an alias for the -in argument, e.g. "input", because > > we can't use just "in" as a property or keyword argument. > > "input" for -in and maybe also "input1" "input2" as alternatives for -in1 > -in2, might the the way to go, and document it. I've used "input" as the preferred alias for "-in". Leaving this bug open to cover dealing with "switch" arguments like -kimura in clustalw, where it makes sense to treat the value as a boolean (see dev mailing list). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 4 13:48:28 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 4 May 2009 13:48:28 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905041748.n44HmSaN003712@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #23 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-04 13:48 EST ------- In Prank, should realbranches take no arguments? i.e. use the new _Switch class? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 4 13:49:20 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 4 May 2009 13:49:20 -0400 Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline - properties and kwargs In-Reply-To: Message-ID: <200905041749.n44HnK8j003766@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2822 ------- Comment #9 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-04 13:49 EST ------- (In reply to comment #8) > Leaving this bug open to cover dealing with "switch" arguments like -kimura in > clustalw, where it makes sense to treat the value as a boolean (see dev mailing > list). Done in CVS, I think. Next, more test and documentation... Checking in Bio/Application/__init__.py; /home/repository/biopython/biopython/Bio/Application/__init__.py,v <-- __init__.py new revision: 1.16; previous revision: 1.15 done Checking in Bio/Align/Applications/_Clustalw.py; /home/repository/biopython/biopython/Bio/Align/Applications/_Clustalw.py,v <-- _Clustalw.py new revision: 1.6; previous revision: 1.5 done Checking in Bio/Align/Applications/_Mafft.py; /home/repository/biopython/biopython/Bio/Align/Applications/_Mafft.py,v <-- _Mafft.py new revision: 1.5; previous revision: 1.4 done Checking in Bio/Align/Applications/_Muscle.py; /home/repository/biopython/biopython/Bio/Align/Applications/_Muscle.py,v <-- _Muscle.py new revision: 1.7; previous revision: 1.6 done Checking in Bio/Align/Applications/_Prank.py; /home/repository/biopython/biopython/Bio/Align/Applications/_Prank.py,v <-- _Prank.py new revision: 1.5; previous revision: 1.4 done Checking in Tests/test_Clustalw_tool.py; /home/repository/biopython/biopython/Tests/test_Clustalw_tool.py,v <-- test_Clustalw_tool.py new revision: 1.14; previous revision: 1.13 done Checking in Tests/test_Muscle_tool.py; /home/repository/biopython/biopython/Tests/test_Muscle_tool.py,v <-- test_Muscle_tool.py new revision: 1.8; previous revision: 1.7 done Checking in Tests/test_Prank_tool.py; /home/repository/biopython/biopython/Tests/test_Prank_tool.py,v <-- test_Prank_tool.py new revision: 1.6; previous revision: 1.5 done -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 5 08:04:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 5 May 2009 08:04:09 -0400 Subject: [Biopython-dev] [Bug 2294] Writing GenBank files with Bio.SeqIO In-Reply-To: Message-ID: <200905051204.n45C4987022142@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2294 ------- Comment #14 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-05 08:04 EST ------- Created an attachment (id=1292) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1292&action=view) Patch to Bio/SeqIO/InsdcIO.py to write GenBank features This patch adds basic support for writing features in GenBank files. There is still plenty to do: * Full testing, both manual and with extended unit test coverage * Wrapping long feature locations * Writing references * Extending to cover writing EBML files Note that this requires the latest Bio.GenBank code from CVS, as during this work I found and fixed two small issues with the location parsing. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Tue May 5 08:36:57 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 5 May 2009 08:36:57 -0400 Subject: [Biopython-dev] Properties names in command line wrappers In-Reply-To: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com> References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com> Message-ID: <20090505123656.GB15113@sobchak.mgh.harvard.edu> Hi Peter; Nice to have you back. Hope you had a relaxing few days away. > I updated the patch on Bug 2822 to cover all the Bio.Application > command line wrapper subclasses, and included __repr__ support. > However, that has raised a real example of a parameter where the > current "human readable" name is not a valid python identifier ("in", > for "-in" in Muscle). I think the pragmatic solution is to add a > sensible alternative which we can use for the property and keyword > argument name (e.g. "input" in this case) while in general keeping > these names as close as possible to the actual parameter name as used > at the command line. Agreed. This is the best solution for these few conflicting cases. > On the other hand, some might argue for giving all the options > meaningful names. The (hardly used) existing blastall wrapper in > Bio/Blast/Applications.py gives the "-a" argument a human readable > name of "nprocessors", and "-A" gets "window_size". With the old > set_parameter call either alias could be used. However, with a python > property we need to pick one as a preferred name - and I'm not 100% > sure being helpful and using "nprocessors" (e.g. cline.nprocessors=4) > is actually better than using the actual argument name (e.g. cline.a = > 4). Could we support both the original argument and optional human readable arguments? I know the code in Application is a bit hard coded for the first argument as the real name and the last argument as the readable name; the cleanest solution would be to generalize this to have multiple names where it makes sense. More practically, it always makes sense to have the low level standard arguments from the program itself. Even if it is non-intuitive like BLASTs switches, people who already understand the program can just use their existing knowledge without any specific knowledge of how Biopython. Where someone wants to support more useful names, they can add those in. You have been digging around in this so probably have a good idea how hard this is to implement practically. If it's a pain, I'd argue to just have the original arguments now, and the useful names can do on a todo list. Brad From chapmanb at 50mail.com Tue May 5 08:50:59 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 5 May 2009 08:50:59 -0400 Subject: [Biopython-dev] XML parsing library for new modules In-Reply-To: <320fb6e00905040845g67219f36r563f4dfa1b080125@mail.gmail.com> References: <320fb6e00905040515q63312065oec904ed3f5ffab93@mail.gmail.com> <3493.66471.qm@web62406.mail.re1.yahoo.com> <320fb6e00905040845g67219f36r563f4dfa1b080125@mail.gmail.com> Message-ID: <20090505125058.GC15113@sobchak.mgh.harvard.edu> Peter, Michiel and Eric; > > Another thing to consider is what to do with the result > > returned by ElementTree. Whereas it will contain all the > > information in the XML file, it may not represent it in a > > user-friendly way. You may want to take the output from > > ElementTree and store it in a more biopython-like object. Agreed. Most of the fun creative parts of the project, as opposed to the parsing nuts and bolts, will be in developing the object representations. > > Also keep in mind memory usage: ElementTree will keep > > the complete XML file in memory, whereas the SAX > > parser gives you more flexibility here (see below). ElementTree can do incremental parsing, so you can also deal with large files using it: http://effbot.org/zone/element-iterparse.htm Brad From biopython at maubp.freeserve.co.uk Tue May 5 09:58:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 May 2009 14:58:04 +0100 Subject: [Biopython-dev] Properties names in command line wrappers In-Reply-To: <20090505123656.GB15113@sobchak.mgh.harvard.edu> References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com> <20090505123656.GB15113@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00905050658h2cabf55dhfbb467042135843a@mail.gmail.com> On Tue, May 5, 2009 at 1:36 PM, Brad Chapman wrote: > > Could we support both the original argument and optional human > readable arguments? I know the code in Application is a bit > hard coded for the first argument as the real name and the last > argument as the readable name; the cleanest solution would be to > generalize this to have multiple names where it makes sense. You mean for these BLAST examples, create two properties "a" and "nprocessors", both controlling the "-a" parameter, and also two properties "A" and "window_size" both controlling "-A"? From a code point of view, this would be moderately straight forward - but I'm not convinced about this. > More practically, it always makes sense to have the low level > standard arguments from the program itself. Even if it is > non-intuitive like BLASTs switches, people who already understand > the program can just use their existing knowledge without any > specific knowledge of how Biopython. Yes :) Personally I initially found it very frustrating when using the Bio.Blast.NCBIStandalone.blastall wrapper because the NCBI switches had all been given friendly names, and it wasn't clear without looking at the source code what mapped to what. As a minor change, I think the Bio.Blast.NCBIStandalone.blastall docstring should actually include the real NCBI switch used by each Biopython keyword. > Where someone wants to support more useful names, they can > add those in. So that we cater to those familiar with the NCBI command line arguments, but also give a more human alternative? On the downside, it means there are two ways to set these parameters. Also, if we go down this route for consistency for all command line wrappers we may want to invent more human readable aliases (if the tool arguments are too cryptic). We are also opening up a potential problem if the tool later adds a new argument whose name clashes with one of our inventions. Also would we care about the lack of consistency between tools (e.g. infile versus input?), and should we try and be consistent in our new names? I favour using only a single property for each parameter, with the name as similar as possible to the actual command line switch (i.e. property name "a" for "-a", not "nprocessors"). Note each property would have a docstring which will say what is it for ("Number of processors to use."). In the case of the existing blastall wrapper in Bio.Blast.Applications, I would use change names=["-a", "nprocessors"] to ["-a", "nprocessors", "a"], meaning "a" (last entry) would be the property name used, "-a" (first entry) would be used for the actual command line string. I would keep the "nprocessors" alias for backwards compatibility only - all three aliases would be available to the (legacy) method set_parameter. > You have been digging around in this so probably have a good idea > how hard this is to implement practically. If it's a pain, I'd argue > to just have the original arguments now, and the useful names can do > on a todo list. It is certainly possible, although probably a bit tedious due to changing the "boilerplate" code. Peter From bugzilla-daemon at portal.open-bio.org Tue May 5 10:37:56 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 5 May 2009 10:37:56 -0400 Subject: [Biopython-dev] [Bug 2294] Writing GenBank files with Bio.SeqIO In-Reply-To: Message-ID: <200905051437.n45EbuNA006427@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2294 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1292 is|0 |1 obsolete| | ------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-05 10:37 EST ------- (From update of attachment 1292) Checked into CVS now. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Tue May 5 11:26:20 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 5 May 2009 16:26:20 +0100 Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF) In-Reply-To: <320fb6e00904210655u5f497227o24457ec2ee3d711f@mail.gmail.com> References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> <320fb6e00904210655u5f497227o24457ec2ee3d711f@mail.gmail.com> Message-ID: <320fb6e00905050826y5ae0b13eiaa6d9e56fd9049e9@mail.gmail.com> On Tue, Apr 21, 2009 at 2:55 PM, Peter Cock wrote: >> I have also been thinking about how I would (re)design the SeqFeature >> and FeatureLocation objects. ?In particular I would want to put the >> strand as part of the same object as the location, and also any >> join-locations. ?I would still want to cope with fuzzy locations, but >> make the non-fuzzy approximations more prominent in comparison. ?Also, >> I really don't like the way joins are currently stored as more >> SeqFeatures in the sub_features list (plus this kind of blocks >> alternative usage for child/parent nesting that might be nice for GFF >> files). >> >> The prime use case to keep in mind is taking a feature location (even >> a join), and using this to extract that region of nucleotides from the >> parent sequence (i.e. a Seq object or a SeqRecord object, as now both >> can be sliced). I've written code to do this in test_SeqIO_features.py, which cross checks the nucleotides pulled out from a GenBank files based on the SeqFeature, against what the NCBI provide in FASTA format. This seems to work OK, but has not been tested extensively (e.g. running it on drosophila or arabidopsis would be good). It could make sense to expose this functionality directly in Biopython, maybe as a method of the SeqRecord taking a SeqFeature (or the index of a feature in that record), returning a Seq object (or perhaps a SeqRecord using the feature's annotation). e.g. >>> from Bio import SeqIO >>> record = SeqIO.read(open("NC_005816.gb"),"genbank") >>> record.extract_feature_seq(6) Seq('GTGAACAAACAACAACAAACTGCGCTGAATATGGCGCGATTTATCAGAAGCCAG...TAA', IUPACAmbiguousDNA()) >>> feature = record.features[6] >>> record.extract_feature_seq(feature) Seq('GTGAACAAACAACAACAAACTGCGCTGAATATGGCGCGATTTATCAGAAGCCAG...TAA', IUPACAmbiguousDNA()) Alternatively, rather than introducing a new method (e.g. "extract_feature_seq" as in the above example) we could overload the __getitem__ method of the SeqRecord, i.e. overloading the slice mechanism so a SeqFeature can alternatively be given, e.g. record[feature]. Note that passing the index of a feature wouldn't work as record[6] currently means the seventh letter, rather than the seventh feature. Note that just passing a SeqFeature's FeatureLocation is not enough, as this lacks the strand information, and also any sub-features and associated location operator (i.e. join). > I forgot to mention the second major use case I'm concerned about, > which is recovering the GenBank/EMBL style location string. ?I have > looked at this in the past, by adding methods to the FeatureLocation > and all the Position objects, but it is complicated by the fact the > Position objects don't know if they are at the start or end (and for > the start locations we need to add one to convert from Python > counting). ?This is the main block on having Bio.SeqIO support writing > GenBank (or EMBL) files with their features included. See Bug 2294 for writing GenBank files: http://bugzilla.open-bio.org/show_bug.cgi?id=2294 I've just checked in some code to record the features when writing GenBank files with Bio.SeqIO. I solved the feature location issue by introducing a private function which knows about all the currently used AbstractPosition objects - the code is actually pretty short. Peter From p.j.a.cock at googlemail.com Tue May 5 12:41:31 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 5 May 2009 17:41:31 +0100 Subject: [Biopython-dev] Dropping Python 2.3 support in Biopython Message-ID: <320fb6e00905050941m37725eb5ibba02ca99236212e@mail.gmail.com> Hello all, This is a final warning that the next release of Biopython will not support Python 2.3. As far as we are aware, no-one has come forward with a need for continued support for Python 2.3, so we will soon begin removing the special case code needed to keep Biopython working on Python 2.3. This will give us a simpler code base, less platforms to test on, and we can also take advantage of various language features only available in Python 2.4+ (e.g. generator expressions and decorators). Any last minute requests to postpone this should be made to the main Biopython mailing list by Friday 8 May. Thank you, Peter From sbassi at clubdelarazon.org Tue May 5 18:49:11 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Tue, 5 May 2009 19:49:11 -0300 Subject: [Biopython-dev] Missing directories with easy_install? Message-ID: <9e2f512b0905051549w688195f6w719572fe42ee2678@mail.gmail.com> When I install Biopython 1.5 (and previous versions too) using easy_install, it seems that docs, test and scripts directories are not installed (see here for a screenshot, panel at left is easy_install product while right panel is when I manually uncompress biopython tarball: http://www.genesdigitales.com/bioinfo/biopy.jpg). Is this expected or an oversight? From biopython at maubp.freeserve.co.uk Tue May 5 18:56:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 May 2009 23:56:00 +0100 Subject: [Biopython-dev] Missing directories with easy_install? In-Reply-To: <9e2f512b0905051549w688195f6w719572fe42ee2678@mail.gmail.com> References: <9e2f512b0905051549w688195f6w719572fe42ee2678@mail.gmail.com> Message-ID: <320fb6e00905051556g2821e4f0k3db66ec545dcd399@mail.gmail.com> On Tue, May 5, 2009 at 11:49 PM, Sebastian Bassi wrote: > When I install Biopython 1.5 (and previous versions too) using > easy_install, it seems that docs, test and scripts directories are not > installed (see here for a screenshot, panel at left is easy_install > product while right panel is when I manually uncompress biopython > tarball: http://www.genesdigitales.com/bioinfo/biopy.jpg). > Is this expected or an oversight? You'd have to ask Brad for an expert opinion, but I think this is probably to be expected. If you install from source, the only folders copied to site-packages are Bio, BioSQL, and Martel. See also this thread: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005924.html Peter P.S. I assume you meant Biopython 1.50 and not 1.5 ;) From sbassi at clubdelarazon.org Tue May 5 19:05:46 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Tue, 5 May 2009 20:05:46 -0300 Subject: [Biopython-dev] Missing directories with easy_install? In-Reply-To: <320fb6e00905051556g2821e4f0k3db66ec545dcd399@mail.gmail.com> References: <9e2f512b0905051549w688195f6w719572fe42ee2678@mail.gmail.com> <320fb6e00905051556g2821e4f0k3db66ec545dcd399@mail.gmail.com> Message-ID: <9e2f512b0905051605k663035d7td84372847675c7d4@mail.gmail.com> On Tue, May 5, 2009 at 7:56 PM, Peter wrote: > You'd have to ask Brad for an expert opinion, but I think this is > probably to be expected. If you install from source, the only folders > copied to site-packages are Bio, BioSQL, and Martel. > See also this thread: > http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005924.html OK, so that is. > P.S. I assume you meant Biopython 1.50 and not 1.5 ;) yes!. From biopython at maubp.freeserve.co.uk Tue May 5 19:33:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 May 2009 00:33:16 +0100 Subject: [Biopython-dev] SeqRecord per-letter-annotation : avoid lists? Message-ID: <320fb6e00905051633i70604746i332b3bfaf3476876@mail.gmail.com> Hi all, I was thinking that about the SeqRecord object's letter_annotations, and that perhaps we should only allow strings and tuples (which are immutable), but not lists. Because lists are mutable, the user can (accidentaly) alter the list such that its length doesn't match that of the associated sequence (which would be bad). Currently we do use lists in the SeqRecord's letter_annotations, e.g. for qualities. I don't recall having any particular reason for using a list rather than a tuple. Any thoughts on this? Peter From p.j.a.cock at googlemail.com Wed May 6 06:32:01 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 6 May 2009 11:32:01 +0100 Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF) In-Reply-To: <320fb6e00905050826y5ae0b13eiaa6d9e56fd9049e9@mail.gmail.com> References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> <320fb6e00904210655u5f497227o24457ec2ee3d711f@mail.gmail.com> <320fb6e00905050826y5ae0b13eiaa6d9e56fd9049e9@mail.gmail.com> Message-ID: <320fb6e00905060332t2b9d9595pca68b83db8cef28f@mail.gmail.com> On Tue, May 5, 2009 at 4:26 PM, Peter Cock wrote: > On Tue, Apr 21, 2009 at 2:55 PM, Peter Cock wrote: >>> The prime use case to keep in mind is taking a feature location (even >>> a join), and using this to extract that region of nucleotides from the >>> parent sequence (i.e. a Seq object or a SeqRecord object, as now both >>> can be sliced). > > I've written code to do this in test_SeqIO_features.py, which cross > checks the nucleotides pulled out from a GenBank files based on the > SeqFeature, against what the NCBI provide in FASTA format. ?This seems > to work OK, but has not been tested extensively (e.g. running it on > drosophila or arabidopsis would be good). Yep - found a corner case my code can't yet cope with, from the Arabidopsis thaliana chloroplasts (NC_000932). This has some pathological mixed strand locations, like join(complement(69611..69724),139856..140650) which is for a trans-spliced ribosomal protein. > It could make sense to expose this functionality directly in > Biopython, ... Given this code is non-trivial to implement, this seems worth doing. Peter From bugzilla-daemon at portal.open-bio.org Wed May 6 18:50:08 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 6 May 2009 18:50:08 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200905062250.n46Mo8EM023616@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #9 from eric.talevich at gmail.com 2009-05-06 18:50 EST ------- Created an attachment (id=1293) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1293&action=view) Additional warnings test for Py2.6+ This is the file that test_PDB_unit.py can import to plug in an additional test for specific warnings. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 6 18:54:06 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 6 May 2009 18:54:06 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200905062254.n46Ms6YP023831@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #10 from eric.talevich at gmail.com 2009-05-06 18:54 EST ------- Created an attachment (id=1294) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1294&action=view) test_PDB_unit.py, with conditional import This is a modified test_PDB_unit.py that checks whether the necessary context manager is available (it will be for Py2.6+), and if so, imports the additional unit test from _PDB_extra.py into the current class. (Sorry it's a whole file, I was having trouble diffing between git branches.) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 7 04:51:35 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 7 May 2009 04:51:35 -0400 Subject: [Biopython-dev] [Bug 2824] New: Bio.Entrez.epost is using an HTTP GET not an HTTP POST Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2824 Summary: Bio.Entrez.epost is using an HTTP GET not an HTTP POST Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk Following from a query on our mailing list suggesting Bio.Entrez.epost is failing with long ID lists, I looked a little more closely at the code and it is actually using an HTTP GET instead of an HTTP POST (which would avoid the long URL problem). See: http://lists.open-bio.org/pipermail/biopython/2009-May/005149.html We can still use urllib to do this with its data argument... http://docs.python.org/library/urllib.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 7 05:18:58 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 7 May 2009 05:18:58 -0400 Subject: [Biopython-dev] [Bug 2824] Bio.Entrez.epost is using an HTTP GET not an HTTP POST In-Reply-To: Message-ID: <200905070918.n479IwHQ031195@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2824 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-07 05:18 EST ------- Created an attachment (id=1295) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1295&action=view) Patch for Bio/Entrez/__init__.py This patch does two things, (1) Makes Bio.Entrez.epost do an HTTP POST (2) Catches the too long URL error 414 messages and raises an IOError Without the patch: >>> print Entrez.epost("pubmed", id=",".join(str(i) for i in range(1,10000))).read() 414 Request-URI Too Large

Request-URI Too Large

The requested URL's length exceeds the capacity limit for this server.

>>> print Entrez.efetch("pubmed", id=",".join(str(i) for i in range(1,10000)), retmode="text", rettype="uilist").read() 414 Request-URI Too Large

Request-URI Too Large

The requested URL's length exceeds the capacity limit for this server.

Note both the above trigger the Error 414 message, but it does not get caught. With the patch: >>> print Entrez.epost("pubmed", id=",".join(str(i) for i in range(1,10000))).read() 1 NCID_01_264798363_130.14.18.47_9001_1241687667 >>> print Entrez.efetch("pubmed", id=",".join(str(i) for i in range(1,10000)), retmode="text", rettype="uilist").read() Traceback (most recent call last): File "", line 1, in File "Bio/Entrez/__init__.py", line 126, in efetch return _open(cgi, variables) File "Bio/Entrez/__init__.py", line 370, in _open raise IOError("Requested URL too long (try using EPost?)") IOError: Requested URL too long (try using EPost?) Now epost works with long arguments, and using the other tools with too long a URL will trigger an IOError. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 7 06:20:10 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 7 May 2009 06:20:10 -0400 Subject: [Biopython-dev] [Bug 2824] Bio.Entrez.epost is using an HTTP GET not an HTTP POST In-Reply-To: Message-ID: <200905071020.n47AKAGD002826@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2824 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-07 06:20 EST ------- Patch checked in (OK'd with Michiel), marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 7 09:56:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 7 May 2009 09:56:09 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905071356.n47Du9iQ018532@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #24 from cymon.cox at gmail.com 2009-05-07 09:56 EST ------- (In reply to comment #23) > In Prank, should realbranches take no arguments? i.e. use the new _Switch > class? Yes, verified and done; pushed to applic-int branch. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 7 10:07:23 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 7 May 2009 10:07:23 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905071407.n47E7Nn7019531@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #25 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-07 10:07 EST ------- (In reply to comment #24) > (In reply to comment #23) > > In Prank, should realbranches take no arguments? i.e. use the new _Switch > > class? > > Yes, verified and done; pushed to applic-int branch. > C. Thanks for checking - that's done in CVS now. I think the final bit of new code is _Dialign.py which still needs to be updated for the new style __init__ method. Then there are your unit tests... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 7 10:39:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 7 May 2009 10:39:40 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905071439.n47Edeaj022126@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #26 from cymon.cox at gmail.com 2009-05-07 10:39 EST ------- (In reply to comment #25) > (In reply to comment #24) > > (In reply to comment #23) > > > In Prank, should realbranches take no arguments? i.e. use the new _Switch > > > class? > > > > Yes, verified and done; pushed to applic-int branch. > > C. > > Thanks for checking - that's done in CVS now. > > I think the final bit of new code is _Dialign.py which still needs to be > updated for the new style __init__ method. Done - pushed to applic-int (Note windows path stuff absent from _Dialign) > Then there are your unit tests... As they are at present, unittests for Muscle, Mafft, Dialign and Prank all pass. They could of course be made arbitrarily more complex... they should probably have at least one test that uses the properties style parameter setting rather than just set_paramter() C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 7 11:22:35 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 7 May 2009 11:22:35 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905071522.n47FMZ16025500@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #27 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-07 11:22 EST ------- (In reply to comment #26) > > I think the final bit of new code is _Dialign.py which still needs to be > > updated for the new style __init__ method. > > Done - pushed to applic-int (Note windows path stuff absent from _Dialign) > OK, that is in CVS now. > > Then there are your unit tests... > > As they are at present, unittests for Muscle, Mafft, Dialign and Prank all > pass. They could of course be made arbitrarily more complex... they should > probably have at least one test that uses the properties style parameter > setting rather than just set_paramter() > C. I've added test_Dialign_tool.py to CVS, and then switched a few to using keyword arguments and properties. As far as I can see from here, the tool isn't expected to work on Windows (although it might still be possible with cygwin): http://bibiserv.techfak.uni-bielefeld.de/download/tools/DIALIGN_221.html Is that everything? You'd mentioned a more general test which just builds the strings, but doesn't actually need to run any of the tools themselves. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 8 08:07:03 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 8 May 2009 08:07:03 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905081207.n48C73cT012732@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #28 from cymon.cox at gmail.com 2009-05-08 08:07 EST ------- (In reply to comment #27) > (In reply to comment #26) > > > I think the final bit of new code is _Dialign.py which still needs to be > > > updated for the new style __init__ method. > > > > Done - pushed to applic-int (Note windows path stuff absent from _Dialign) > > > > OK, that is in CVS now. > > > > Then there are your unit tests... > > > > As they are at present, unittests for Muscle, Mafft, Dialign and Prank all > > pass. They could of course be made arbitrarily more complex... they should > > probably have at least one test that uses the properties style parameter > > setting rather than just set_paramter() > > C. > > I've added test_Dialign_tool.py to CVS, and then switched a few to using > keyword arguments and properties. As far as I can see from here, the tool > isn't expected to work on Windows (although it might still be possible with > cygwin): > http://bibiserv.techfak.uni-bielefeld.de/download/tools/DIALIGN_221.html > > Is that everything? That's everything currently written. I still want to add interfaces to ProbCons and T-Coffee. You'd mentioned a more general test which just builds the > strings, but doesn't actually need to run any of the tools themselves. Yes, I'll do that. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 8 08:23:03 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 8 May 2009 08:23:03 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905081223.n48CN3nV013977@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #29 from chapmanb at 50mail.com 2009-05-08 08:23 EST ------- Created an attachment (id=1296) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1296&action=view) Start of TCoffee command line Cymon; Here is the start of a TCoffee command line object. It's not up to date with the latest changes y'all have been making and doesn't have all the options, but should save some typing. Brad -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 8 15:14:27 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 8 May 2009 15:14:27 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200905081914.n48JERYx012798@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 eric.talevich at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1293 is|0 |1 obsolete| | Attachment #1294 is|0 |1 obsolete| | ------- Comment #11 from eric.talevich at gmail.com 2009-05-08 15:14 EST ------- Created an attachment (id=1297) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1297&action=view) Py2.6-only unit test of PDB warnings I pushed a branch called bug2820 to github containing just this commit, if that's easier: http://github.com/etal/biopython/tree/bug2820 Any suggestions for naming the new file? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 8 17:45:53 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 8 May 2009 17:45:53 -0400 Subject: [Biopython-dev] [Bug 2817] Meta-bug for cleanup once we drop Python 2.3 support In-Reply-To: Message-ID: <200905082145.n48Ljr4L023802@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2817 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-08 17:45 EST ------- I've started removing support for Python 2.3 in CVS, including removing all the sets and subprocess special case code. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 8 18:14:36 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 8 May 2009 18:14:36 -0400 Subject: [Biopython-dev] [Bug 2825] New: SeqIO does not successfully parse Genbank records related to whole genome sequencing deposits, as Did not recognise the LOCUS line layout Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2825 Summary: SeqIO does not successfully parse Genbank records related to whole genome sequencing deposits, as Did not recognise the LOCUS line layout Product: Biopython Version: 1.49 Platform: All OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: david.wyllie at ndm.ox.ac.uk Hi I'm using the BioPython distribution 1.49 obtained as a Package using the Ubuntu 9 synaptic package manager. The below describes the problem: NCBI has a record type which describes the contents of whole-genome sequencing projects. The record doesn't itself contain sequence, by constrast to most genbank records. this URL gives an example http://www.ncbi.nlm.nih.gov/nuccore/162285818 should the SeqIO parser be able to read this? it cannot. Here is an example: # import modules from Bio import Entrez from Bio import SeqIO # read the record from NCBI, print out the contents. handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818") masterrecord=handle.readlines() for line in masterrecord: print line handle.close() # let's read it again, and try to parse with with SeqIO. handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818") # this line causes the crash seq_record = SeqIO.read(handle, "genbank") handle.close() # fails. the traceback reads """ Traceback (most recent call last): File "bugreport.py", line 25, in seq_record = SeqIO.read(handle, "genbank") File "/var/lib/python-support/python2.6/Bio/SeqIO/__init__.py", line 435, in read first = iterator.next() File "/var/lib/python-support/python2.6/Bio/GenBank/Scanner.py", line 410, in parse_records record = self.parse(handle) File "/var/lib/python-support/python2.6/Bio/GenBank/Scanner.py", line 393, in parse if self.feed(handle, consumer) : File "/var/lib/python-support/python2.6/Bio/GenBank/Scanner.py", line 360, in feed self._feed_first_line(consumer, self.line) File "/var/lib/python-support/python2.6/Bio/GenBank/Scanner.py", line 907, in _feed_first_line raise ValueError('Did not recognise the LOCUS line layout:\n' + line) ValueError: Did not recognise the LOCUS line layout: LOCUS ABIN01000000 353 rc DNA linear BCT 10-DEC-2007 """ # by contrast, reading one of the constituent genbank records, like this one # http://www.ncbi.nlm.nih.gov/nuccore/162285817 # works correctly; handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285817") seq_record = SeqIO.read(handle, "genbank") handle.close() print "Successfully loaded record GI=162285817" print seq_record.description -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 8 18:37:47 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 8 May 2009 18:37:47 -0400 Subject: [Biopython-dev] [Bug 2825] Parsing whole genome sequencing (WGS) Genbank records In-Reply-To: Message-ID: <200905082237.n48MbleU027475@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2825 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|normal |enhancement Summary|SeqIO does not successfully |Parsing whole genome |parse Genbank records |sequencing (WGS) Genbank |related to whole genome |records |sequencing deposits, as Did | |not recognise the LOCUS line| |layout | ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-08 18:37 EST ------- Hi David, This is not expected to work in Bio.SeqIO or Bio.GenBank at the moment. For the LOCUS line we expect things like "123 aa" for proteins, or "123 bp" for nucleotides. Here you have "353 rc" (rc for record count), which as our error message says, is unexpected. At the end of the record, there are also WGS and/or WGS_SCAFLD lines to worry about: http://www.ncbi.nlm.nih.gov/Genbank/wgs.html http://www.bio.net/bionet/mm/genbankb/2009-January/000299.html Given these WGS files have no sequence, and no real sequence associated features either, it stikes me that supporting this in Bio.SeqIO is a stretch (these records are not really sequences, nor are they about a sequence). However, Bio.GenBank should perhaps be updated to cope... so I'll leave this bug open for that as a possible enhancement. Note I have changed the bug title from "SeqIO does not successfully parse Genbank records related to whole genome sequencing deposits, as Did not recognise the LOCUS line layout", to "Parsing whole genome sequencing (WGS) Genbank records", and changed the bug priority to an enhancement. What information do you want from this file? In the meantime, I suggest you fetch the record as XML, which you can parse using Bio.Entrez.read() or your XML parser of choice. Peter P.S. This is a shorter way to dump the file to screen in python: >>> from Bio import Entrez >>> handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818") >>> print handle.read() LOCUS ABIN01000000 353 rc DNA linear BCT 10-DEC-2007 DEFINITION Mycobacterium intracellulare ATCC 13950, whole genome shotgun sequencing project. ACCESSION ABIN00000000 VERSION ABIN00000000.1 GI:162285818 DBLINK Project:27955 KEYWORDS WGS. SOURCE Mycobacterium intracellulare ATCC 13950 ORGANISM Mycobacterium intracellulare ATCC 13950 Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales; Corynebacterineae; Mycobacteriaceae; Mycobacterium; Mycobacterium avium complex (MAC). REFERENCE 1 (bases 1 to 353) AUTHORS Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J., Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M. TITLE Mycobacterium intracellulare Genome Project JOURNAL Unpublished REFERENCE 2 (bases 1 to 353) AUTHORS Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J., Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M. TITLE Direct Submission JOURNAL Submitted (30-NOV-2007) McGill University and Genome Quebec Innovation Centre, 740 Avenue Docteur Penfield, Montreal, Quebec H3A 1A4, Canada COMMENT The Mycobacterium intracellulare ATCC 13950 whole genome shotgun (WGS) project has the project accession ABIN00000000. This version of the project (01) has the accession number ABIN01000000, and consists of sequences ABIN01000001-ABIN01000353. The whole genome shotgun sequence was generated by the McGill University and Genome Quebec Innovation Centre using the GS De Novo Assembler from GS-FLX reads. This strain is available from the American Type Culture Collection (www.atcc.org). FEATURES Location/Qualifiers source 1..353 /organism="Mycobacterium intracellulare ATCC 13950" /mol_type="genomic DNA" /strain="ATCC 13950" /serovar="16" /isolation_source="human lymph node" /db_xref="taxon:487521" /note="type strain of Mycobacterium intracellulare ATCC 13950 associated with disease" WGS ABIN01000001-ABIN01000353 // -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 8 19:12:43 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 8 May 2009 19:12:43 -0400 Subject: [Biopython-dev] [Bug 2825] Parsing whole genome sequencing (WGS) Genbank records In-Reply-To: Message-ID: <200905082312.n48NChKL030485@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2825 ------- Comment #2 from david.wyllie at ndm.ox.ac.uk 2009-05-08 19:12 EST ------- Thank you for your help. I just wanted to extract the WGS line, which I'm able to do. (In reply to comment #1) > Hi David, > > This is not expected to work in Bio.SeqIO or Bio.GenBank at the moment. For > the LOCUS line we expect things like "123 aa" for proteins, or "123 bp" for > nucleotides. Here you have "353 rc" (rc for record count), which as our error > message says, is unexpected. At the end of the record, there are also WGS > and/or WGS_SCAFLD lines to worry about: > > http://www.ncbi.nlm.nih.gov/Genbank/wgs.html > http://www.bio.net/bionet/mm/genbankb/2009-January/000299.html > > Given these WGS files have no sequence, and no real sequence associated > features either, it stikes me that supporting this in Bio.SeqIO is a stretch > (these records are not really sequences, nor are they about a sequence). > > However, Bio.GenBank should perhaps be updated to cope... so I'll leave this > bug open for that as a possible enhancement. Note I have changed the bug title > from "SeqIO does not successfully parse Genbank records related to whole genome > sequencing deposits, as Did not recognise the LOCUS line layout", to "Parsing > whole genome sequencing (WGS) Genbank records", and changed the bug priority to > an enhancement. > > What information do you want from this file? In the meantime, I suggest you > fetch the record as XML, which you can parse using Bio.Entrez.read() or your > XML parser of choice. > > Peter > > P.S. This is a shorter way to dump the file to screen in python: > > >>> from Bio import Entrez > >>> handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818") > >>> print handle.read() > LOCUS ABIN01000000 353 rc DNA linear BCT 10-DEC-2007 > DEFINITION Mycobacterium intracellulare ATCC 13950, whole genome shotgun > sequencing project. > ACCESSION ABIN00000000 > VERSION ABIN00000000.1 GI:162285818 > DBLINK Project:27955 > KEYWORDS WGS. > SOURCE Mycobacterium intracellulare ATCC 13950 > ORGANISM Mycobacterium intracellulare ATCC 13950 > Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales; > Corynebacterineae; Mycobacteriaceae; Mycobacterium; Mycobacterium > avium complex (MAC). > REFERENCE 1 (bases 1 to 353) > AUTHORS Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J., > Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M. > TITLE Mycobacterium intracellulare Genome Project > JOURNAL Unpublished > REFERENCE 2 (bases 1 to 353) > AUTHORS Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J., > Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M. > TITLE Direct Submission > JOURNAL Submitted (30-NOV-2007) McGill University and Genome Quebec > Innovation Centre, 740 Avenue Docteur Penfield, Montreal, Quebec > H3A 1A4, Canada > COMMENT The Mycobacterium intracellulare ATCC 13950 whole genome shotgun > (WGS) project has the project accession ABIN00000000. This version > of the project (01) has the accession number ABIN01000000, and > consists of sequences ABIN01000001-ABIN01000353. > The whole genome shotgun sequence was generated by the McGill > University and Genome Quebec Innovation Centre using the GS De Novo > Assembler from GS-FLX reads. This strain is available from the > American Type Culture Collection (www.atcc.org). > FEATURES Location/Qualifiers > source 1..353 > /organism="Mycobacterium intracellulare ATCC 13950" > /mol_type="genomic DNA" > /strain="ATCC 13950" > /serovar="16" > /isolation_source="human lymph node" > /db_xref="taxon:487521" > /note="type strain of Mycobacterium intracellulare ATCC > 13950 > associated with disease" > WGS ABIN01000001-ABIN01000353 > // > (In reply to comment #1) > Hi David, > > This is not expected to work in Bio.SeqIO or Bio.GenBank at the moment. For > the LOCUS line we expect things like "123 aa" for proteins, or "123 bp" for > nucleotides. Here you have "353 rc" (rc for record count), which as our error > message says, is unexpected. At the end of the record, there are also WGS > and/or WGS_SCAFLD lines to worry about: > > http://www.ncbi.nlm.nih.gov/Genbank/wgs.html > http://www.bio.net/bionet/mm/genbankb/2009-January/000299.html > > Given these WGS files have no sequence, and no real sequence associated > features either, it stikes me that supporting this in Bio.SeqIO is a stretch > (these records are not really sequences, nor are they about a sequence). > > However, Bio.GenBank should perhaps be updated to cope... so I'll leave this > bug open for that as a possible enhancement. Note I have changed the bug title > from "SeqIO does not successfully parse Genbank records related to whole genome > sequencing deposits, as Did not recognise the LOCUS line layout", to "Parsing > whole genome sequencing (WGS) Genbank records", and changed the bug priority to > an enhancement. > > What information do you want from this file? In the meantime, I suggest you > fetch the record as XML, which you can parse using Bio.Entrez.read() or your > XML parser of choice. > > Peter > > P.S. This is a shorter way to dump the file to screen in python: > > >>> from Bio import Entrez > >>> handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818") > >>> print handle.read() > LOCUS ABIN01000000 353 rc DNA linear BCT 10-DEC-2007 > DEFINITION Mycobacterium intracellulare ATCC 13950, whole genome shotgun > sequencing project. > ACCESSION ABIN00000000 > VERSION ABIN00000000.1 GI:162285818 > DBLINK Project:27955 > KEYWORDS WGS. > SOURCE Mycobacterium intracellulare ATCC 13950 > ORGANISM Mycobacterium intracellulare ATCC 13950 > Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales; > Corynebacterineae; Mycobacteriaceae; Mycobacterium; Mycobacterium > avium complex (MAC). > REFERENCE 1 (bases 1 to 353) > AUTHORS Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J., > Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M. > TITLE Mycobacterium intracellulare Genome Project > JOURNAL Unpublished > REFERENCE 2 (bases 1 to 353) > AUTHORS Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J., > Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M. > TITLE Direct Submission > JOURNAL Submitted (30-NOV-2007) McGill University and Genome Quebec > Innovation Centre, 740 Avenue Docteur Penfield, Montreal, Quebec > H3A 1A4, Canada > COMMENT The Mycobacterium intracellulare ATCC 13950 whole genome shotgun > (WGS) project has the project accession ABIN00000000. This version > of the project (01) has the accession number ABIN01000000, and > consists of sequences ABIN01000001-ABIN01000353. > The whole genome shotgun sequence was generated by the McGill > University and Genome Quebec Innovation Centre using the GS De Novo > Assembler from GS-FLX reads. This strain is available from the > American Type Culture Collection (www.atcc.org). > FEATURES Location/Qualifiers > source 1..353 > /organism="Mycobacterium intracellulare ATCC 13950" > /mol_type="genomic DNA" > /strain="ATCC 13950" > /serovar="16" > /isolation_source="human lymph node" > /db_xref="taxon:487521" > /note="type strain of Mycobacterium intracellulare ATCC > 13950 > associated with disease" > WGS ABIN01000001-ABIN01000353 > // > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat May 9 07:59:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 9 May 2009 07:59:32 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905091159.n49BxWpM015484@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #30 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-09 07:59 EST ------- I've got test_Mafft_tool.py working on one Linux machine using MAFFT v6.626b (2009/03/16) installed from source. However, test_Mafft_tool.py fails on another Linux machine using MAFFT v6.240 (2007/04/04) installed using the distribution's package, in this case Ubuntu Jaunty: http://packages.ubuntu.com/jaunty/mafft Note that the next version of Ubuntu currently also uses the same old package: http://packages.ubuntu.com/karmic/mafft As does Debian unstable: http://packages.debian.org/unstable/science/mafft >From trying mafft v6.240 by hand at the command line, it never seems to actually print anything to the console. Either the MAFFT API changed (which doesn't seem to be the case), or the version Ubuntu installed on this machine is broken. This could be due to something else like the version of awk or gcc (guesses based on the MAFFT change log): http://align.bmr.kyushu-u.ac.jp/mafft/software/ Note that the latest version is now MAFFT 6.704, so we should try that too. If I am right about the current Ubuntu/Debian package being broken, we should get in touch with them about updating it... otherwise we can look forward to bug reports about our wrapper and/or test_Mafft_tool.py failing. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat May 9 08:31:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 9 May 2009 08:31:55 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200905091231.n49CVtUj017919@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #12 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-09 08:31 EST ------- (In reply to comment #8) > I have something that works on both Py2.5 and Py2.6 now: > http://github.com/etal/biopython/tree/pdbtidy Would it be easy for you to test your code on Python 2.4? I can probably do that but not right now... I would prefer to avoid the extra file by writing this test as part of test_PDB_unit.py - but the "with" statement isn't valid syntax on Python 2.4, although it can be used on Python 2.5 via: from __future__ import with_statement Could you re-write this to avoid the with statement? > Also, apparently tests are run in alphabetical order, ... Yes, that is expected. > ... and Exposure was jumping ahead of PDBExceptionTest. I renamed > PDBExceptionTest to ExceptionTest to restore the natural order of > things and stop setting off the warnings prematurely. Maybe test > suites with multiple TestCase classes should be arranged alphabetically > in the code to avoid confusion in the future. Ideally the unit tests should work in any order - and this is generally a reasonable assumption, as they should be independent. Having some carefully named unit tests will only hide the ordering problem (which is due to the global state information in the warnings module). At the very least, we should probably have comments in the code about this (to avoid issues in the future) and maybe use an eye-catching name like AAAAA which should always come first. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Sat May 9 09:06:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 9 May 2009 14:06:15 +0100 Subject: [Biopython-dev] PhyloXML read/parse functions and handles Message-ID: <320fb6e00905090606g7235d4a6t22914f3ff9c293be@mail.gmail.com> Hi Eric, Are you happy to have feedback on your PhyloXML code in public? In this case I wanted to make a fairly general observation about parsing files using handles, so I have cc'd the dev list. I just had a look at the stub in Bio/PhyloXML/__init__.py and Bio/PhyloXML/Parser.py on your github branch, http://github.com/etal/biopython/tree/phyloxml The convention we are following in Biopython for parsing functions is as follows: read(handle, ...) - returns a single object (e.g. a tree in your case) parse(handle, ...) - returns an iterator (e.g. returning multiple trees) [This naming convention is arbitrary, but we should try to stick to it in all new parsers for consistency.] In Bio/PhyloXML/Parser.py you have a parse() sub function which according to the comment appears to return a single tree. If so, this should be a read() function instead of a parse() function. You seem to have a read() stub function in Bio/PhyloXML/__init__.py which returns a single tree (good), but takes a (zip) filename (not a handle - bad). Taking just a filename prevents using a whole range of handle objects as input - e.g. StringIO handles, URL handles, piped output from a command line tool etc. This flexibility is why we focus on dealing with handles for parsers. On a related point, you should leave unzipping the file to the user - this is not specific to dealing with XML tree files. Plus, in addition to zip files (i.e. pkzip/winzip format), there are other compressed fileformats to consider, such as tarballs. They too can be opened and compressed on the fly as a handle (e.g. see the gzip python library). By taking a handle as the input your parser can then be used with any of these import sources. Peter P.S. Finally, a more general note about a possible "Bio.TreeIO" module. For simple Newick trees, a single file can contain one or more trees (e.g. from bootstrapping). A tree can be split over multiple lines (but may be one long line), but multiple trees can be split up because they should all have a semicolon terminator. For Nexus files, I'm not sure off hand if there can be more than one tree. If you are going to use the Tree objects from Bio.Nexus, then we could provide a "Bio.TreeIO" module with read/parse/write methods coping with "newick", "nexus", "phyloxml" formats, all using the same tree objects. From bugzilla-daemon at portal.open-bio.org Sat May 9 12:40:27 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 9 May 2009 12:40:27 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905091640.n49GeRvY002521@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #31 from cymon.cox at gmail.com 2009-05-09 12:40 EST ------- (In reply to comment #30) > I've got test_Mafft_tool.py working on one Linux machine using MAFFT v6.626b > (2009/03/16) installed from source. That was my reference installation when writing the command line tool (on Jaunty/RHE 5.3). > However, test_Mafft_tool.py fails on another Linux machine using MAFFT v6.240 > (2007/04/04) installed using the distribution's package, in this case Ubuntu > Jaunty: > http://packages.ubuntu.com/jaunty/mafft > > Note that the next version of Ubuntu currently also uses the same old package: > http://packages.ubuntu.com/karmic/mafft > > As does Debian unstable: > http://packages.debian.org/unstable/science/mafft > > From trying mafft v6.240 by hand at the command line, it never seems to > actually print anything to the console. Either the MAFFT API changed (which > doesn't seem to be the case), or the version Ubuntu installed on this machine > is broken. This could be due to something else like the version of awk or gcc > (guesses based on the MAFFT change log): > http://align.bmr.kyushu-u.ac.jp/mafft/software/ Hadn't tried the Ubuntu package... On the upside, the Muscle3.7 package installed from Ubuntu passes our tests, whereas the source compiles but core-dumps. Similarly, ProbCons1.2 won't compile but the Ubuntu package looks good (havent written the tests yet). > Note that the latest version is now MAFFT 6.704, so we should try that too. If > I am right about the current Ubuntu/Debian package being broken, we should get > in touch with them about updating it... otherwise we can look forward to bug > reports about our wrapper and/or test_Mafft_tool.py failing. Built from source on Jaunty; it passes our tests. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Sun May 10 01:22:46 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 9 May 2009 22:22:46 -0700 Subject: [Biopython-dev] PhyloXML read/parse functions and handles In-Reply-To: <320fb6e00905090606g7235d4a6t22914f3ff9c293be@mail.gmail.com> References: <320fb6e00905090606g7235d4a6t22914f3ff9c293be@mail.gmail.com> Message-ID: <3f6baf360905092222w305556c0kdbf94e3c336b5958@mail.gmail.com> On Sat, May 9, 2009 at 6:06 AM, Peter wrote: > Hi Eric, > > Are you happy to have feedback on your PhyloXML code in public? Sure am! I was just getting around to drafting up some questions for biopython-dev, but I'm glad to receive some preemptive advice. I just had a look at the stub in Bio/PhyloXML/__init__.py and > Bio/PhyloXML/Parser.py on your github branch, > http://github.com/etal/biopython/tree/phyloxml > > The convention we are following in Biopython for parsing functions is > as follows: > read(handle, ...) - returns a single object (e.g. a tree in your case) > parse(handle, ...) - returns an iterator (e.g. returning multiple trees) > > I noticed that; I'll change the Bio.PhyloXML.Parser.parse() stub to read() and have it behave as expected. The function currently allows either filenames or file handles as the source because ElementTree.iterparse() also accepts either object as a source. The read() function could "assert not isinstance(infile, str)", I guess... The existing Java implementation in Forester/ATV has even more magic, automatically performing Zip extraction if the given filename ends with '.zip'. Since this looks like it will be a pretty common use case, at least for big files, I thought it would be nice to also offer a wrapper function that takes a filename and does the Right Thing -- that's what __init__.read() does currently. Is there a precedent for this in Biopython? The name should probably be something different; in the pdbtidy branch I used load(), to match the Pickle module, since the wrapper function does more than just parse or read a file. So how about: from Bio import PhyloXML handle = open('somefile', 'r') # file-like object from any source tree = PhyloXML.read(handle) Equivalent to: from Bio import PhyloXML tree = PhyloXML.load('somefile') # DTRT for xml, zip, gz, ...? Or, to be explicit, offer a read_zip or load_zip function. I'd leave well enough alone, but the incantation to extract a character stream from a single zipped file is kind of unintuitive, and one of the three example files on phyloxml.org is already zipped. (I should really ask Christian Zmasek about this to see if that's a real convention or not.) P.S. Finally, a more general note about a possible "Bio.TreeIO" > module. For simple Newick trees, a single file can contain one or more > trees (e.g. from bootstrapping). A tree can be split over multiple > lines (but may be one long line), but multiple trees can be split up > because they should all have a semicolon terminator. For Nexus files, > I'm not sure off hand if there can be more than one tree. If you are > going to use the Tree objects from Bio.Nexus, then we could provide a > "Bio.TreeIO" module with read/parse/write methods coping with > "newick", "nexus", "phyloxml" formats, all using the same tree > objects. > OK, I'll give it a try. Brad recommended that I just get a simple PhyloXML parser working first before attempting integration, but if some of Bio.Nexus can be reused in that process, great. I'm about to go dark from the end of this week until 3/31 (getting married, yaknow), but I'll fix all this code when I get back and have access to git again. Thanks for your help, Eric From biopython at maubp.freeserve.co.uk Sun May 10 05:22:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 10 May 2009 10:22:21 +0100 Subject: [Biopython-dev] PhyloXML read/parse functions and handles In-Reply-To: <3f6baf360905092222w305556c0kdbf94e3c336b5958@mail.gmail.com> References: <320fb6e00905090606g7235d4a6t22914f3ff9c293be@mail.gmail.com> <3f6baf360905092222w305556c0kdbf94e3c336b5958@mail.gmail.com> Message-ID: <320fb6e00905100222n22b7670dre26f9368726fce68@mail.gmail.com> On Sun, May 10, 2009 at 6:22 AM, Eric Talevich wrote: > > The function currently allows either filenames or file handles as the source > because ElementTree.iterparse() also accepts either object as a source. The > read() function could "assert not isinstance(infile, str)", I guess... Interesting - ReportLab also allows filenames or handles. If this truely is a widespread or growing trend in Python libraries, maybe we should do this as well. > The existing Java implementation in Forester/ATV has even more magic, > automatically performing Zip extraction if the given filename ends with > '.zip'. Since this looks like it will be a pretty common use case, at least > for big files, I thought it would be nice to also offer a wrapper function > that takes a filename and does the Right Thing -- that's what > __init__.read() does currently. Is there a precedent for this in Biopython? Note that Bio.Nexus does this already, making it a bit inconsistent with the rest of Biopython. I guess no one noticed or commented back when it was added. > The name should probably be something different; in the pdbtidy branch I > used load(), to match the Pickle module, since the wrapper function does > more than just parse or read a file. > > So how about: > > from Bio import PhyloXML > handle = open('somefile', 'r') # file-like object from any source > tree = PhyloXML.read(handle) > > Equivalent to: > > from Bio import PhyloXML > tree = PhyloXML.load('somefile') # DTRT for xml, zip, gz, ...? > > Or, to be explicit, offer a read_zip or load_zip function. I prefer the more explicit read_zip idea, your would also have an optional argument for the filename within the zip file. However, I'm not yet convinced we need this function. > I'd leave well enough alone, but the incantation to extract a character > stream from a single zipped file is kind of unintuitive, and one of the > three example files on phyloxml.org is already zipped. (I should really > ask Christian Zmasek about this to see if that's a real convention or > not.) Do you want to find out if this really is a phyloxml.org convention first? If this is their convention, it surprises me they didn't go for .gz files, which in my experience are more widley used in Bioinformatics (e.g. at the NCBI and PDB). These are supported cross platform and hold one single file (often a tarred file containing multiple files). A zip file can hold multiple files, which means you have to make extra asumptions (e.g. you are using the first file in your code). >> P.S. Finally, a more general note about a possible "Bio.TreeIO" >> module. For simple Newick trees, a single file can contain one or more >> trees (e.g. from bootstrapping). A tree can be split over multiple >> lines (but may be one long line), but multiple trees can be split up >> because they should all have a semicolon terminator. For Nexus files, >> I'm not sure off hand if there can be more than one tree. If you are >> going to use the Tree objects from Bio.Nexus, then we could provide a >> "Bio.TreeIO" module with read/parse/write methods coping with >> "newick", "nexus", "phyloxml" formats, all using the same tree >> objects. >> > > OK, I'll give it a try. Brad recommended that I just get a simple PhyloXML > parser working first before attempting integration, but if some of Bio.Nexus > can be reused in that process, great. Brad is right - getting a simple PhyloXML parser working is the first step. It would be sensible to look at the Bio.Nexus tree structure though. > I'm about to go dark from the end of this week until 3/31 (getting > married, yaknow), but I'll fix all this code when I get back and have > access to git again. Congratulations - it looks like you've got a proper break sheduled as well :) Peter From bugzilla-daemon at portal.open-bio.org Sun May 10 09:50:50 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 10 May 2009 09:50:50 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200905101350.n4ADoo7x001186@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #13 from eric.talevich at gmail.com 2009-05-10 09:50 EST ------- (In reply to comment #12) > Would it be easy for you to test your code on Python 2.4? I can probably do > that but not right now... Yes, I can do that, but only on Linux. I don't think there's anything platform-specific here, though. > I would prefer to avoid the extra file by writing this test as part of > test_PDB_unit.py - but the "with" statement isn't valid syntax on Python 2.4, > although it can be used on Python 2.5 via: > from __future__ import with_statement > > Could you re-write this to avoid the with statement? I think the with statement is isomorphic to a try-except-finally arrangement, calling the context manager's __enter__ method in the try block and __exit__ in the finally block. I'll look at the source code of the warnings module and maybe just copy a substantial chunk of it into this unit test (assuming it's pure Python). That might make it possible to support Py2.4, too. > Ideally the unit tests should work in any order - and this is generally a > reasonable assumption, as they should be independent. Having some carefully > named unit tests will only hide the ordering problem (which is due to the > global state information in the warnings module). At the very least, we should > probably have comments in the code about this (to avoid issues in the future) > and maybe use an eye-catching name like AAAAA which should always come first. > Agreed. I'll tinker with it some more to see what can be improved here. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 11 08:40:49 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 11 May 2009 08:40:49 -0400 Subject: [Biopython-dev] [Bug 2783] Using alternative start codons in Bio.Seq translate method/function In-Reply-To: Message-ID: <200905111240.n4BCenqD006754@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2783 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-11 08:40 EST ------- Created an attachment (id=1298) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1298&action=view) Patch for Bio/Seq.py to support complete CDS translation with non-standard start codons I've recently been doing CDS translations for viral/bacterial genes with alternative start codons - and would like to fix this limitation in Biopython, rather than having to hack around it. On Bug 2381, comment #14, I wrote: > For comparison, the following is copied from the BioPerl documentation about > their sequence object's translate method. It would be nice to follow some of > the same naming conventions for any optional arguments. > > http://www.bioperl.org/Core/Latest/bptutorial.html#iii_3_1_manipulating_sequence_data_with_seq_methods > > If we want to translate full coding regions (CDS) the way major nucleotide > databanks EMBL, GenBank and DDBJ do it, the translate() method has to perform > more checks. Specifically, translate() needs to confirm that the sequence has > appropriate start and terminator codons at the very beginning and the very end > of the sequence and that there are no terminator codons present within the > sequence in frame 0. In addition, if the genetic code being used has an > atypical (non-ATG) start codon, the translate() method needs to convert the > initial amino acid to methionine. These checks and conversions are triggered > by setting ``complete'' to 1: > > $prot_obj = $my_seq_object->translate(-complete => 1); > On Bug 2381, comment #51, Leighton wrote: > In terms of nomenclature: > > The default behaviour of translate() as Peter proposed: read through in-frame > and translate with the appropriate codon table - is fine in nearly all > circumstances. Most other circumstances are covered by stopping at the first > in-frame stop codon, which Peter has implemented, and is an option we all seem > to agree on. > > Biologically-speaking, this behaviour is not always correct for CDS in > prokaryotes, where alternative start codons may occur a significant minority > of the time. These will be mistranslated if no provision is made for them. I > think a useful biological sequence object should at least try to mimic actual > biology, so we should provide an option to handle this. > > We should not assume that a sequence is a CDS unless it is specified by the > user. It seems reasonable to me that the term 'cds' should occur in any such > argument from the user. > > We have at least two options for how to proceed with a CDS: i) we can provide > a strict CDS-type translation, which requires confirmation that the sequence > is, in fact, a CDS; ii) we can provide a weak CDS-type translation, which only > modifies the way the start codon is translated. In both cases, behaviour is > specific to CDS, and so having 'cds' in the argument name *somewhere* seems > obvious, and entirely reasonable. Leighton's option (ii) is start codon only modification. This is what I implemented in the patch on comment 1 (attachment 1259). We haven't agreed on a good name for this - which is partly why I went back to revisit the alternative: Leighton's option (i) is strict CDS-type translation. As Leighton suggests, having "cds" in the argument name here makes sense. Regarding the BioPerl argument name for this functionality, "complete", on Bug 2381 comment 19, Martin wrote: > The "complete" is a cryptic naming, I wouldn't be fond of it. > I think you are both right about the naming. Would complete_cds=True would be clear? In fact, I quite like the idea of using cds=True which is short and also fairly clear. This patch adds a complete_cds=Boolean argument to the Bio.Seq translate methods and function, which should act like the BioPerl equivalent. It includes doctests showing the new functionality. I would like to use either of these approaches in Biopython - but not both ;) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 11 16:00:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 11 May 2009 16:00:29 -0400 Subject: [Biopython-dev] [Bug 2826] New: when creating a de-novo SeqRecord, the dbxrefs are not written by SeqIO.write Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2826 Summary: when creating a de-novo SeqRecord, the dbxrefs are not written by SeqIO.write Product: Biopython Version: 1.49 Platform: All OS/Version: Linux Status: NEW Severity: minor Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: david.wyllie at ndm.ox.ac.uk Hi when creating a SeqRecord de novo, the dbxrefs are not written by SeqIO.write. Is this the intended behaviour? here is an example: # example script from Bio.Seq import Seq from Bio import SeqIO from Bio.SeqRecord import SeqRecord from Bio.Alphabet import generic_protein # list to hold output records outlist=[] # ofh is the output file handle ofh = open("/home/dwyllie/temporary.gbk","w") # example of de novo creation of SeqRecord object from url: # http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html rec = SeqRecord(Seq("MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKAT", generic_protein), \ id="NP_418483.1", name="b4059", description="ssDNA-binding protein", \ dbxrefs=["ASAP:13298", "GI:16131885", "GeneID:948570"]) print rec outlist.append(rec) count = SeqIO.write(outlist, ofh, "genbank") ofh.close() # end of script OUTPUT: ID: NP_418483.1 Name: b4059 Description: ssDNA-binding protein Database cross-references: ASAP:13298, GI:16131885, GeneID:948570 Number of features: 0 Seq('MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKAT', ProteinAlphabet()) Contents of temporary.gbk: LOCUS b4059 46 bp UNK 01-JAN-1980 DEFINITION ssDNA-binding protein ACCESSION NP_418483 VERSION NP_418483.1 KEYWORDS . SOURCE . ORGANISM . . FEATURES Location/Qualifiers ORIGIN 1 MASRGVNKVI LVGNLGQDPE VRYMPNGGAV ANITLATSES WRDKAT // -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 11 16:29:02 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 11 May 2009 16:29:02 -0400 Subject: [Biopython-dev] [Bug 2826] SeqRecord dbxrefs not written to GenBank by SeqIO In-Reply-To: Message-ID: <200905112029.n4BKT2x0024871@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2826 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|when creating a de-novo |SeqRecord dbxrefs not |SeqRecord, the dbxrefs are |written to GenBank by SeqIO |not written by SeqIO.write | ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-11 16:29 EST ------- Hi David, Thank you for another interesting bug report. See here for what the NCBI uses in a GenPept file for this example protein, NP_418483.1 http://www.ncbi.nlm.nih.gov/protein/16131885 The ASAP and GeneID numbers are not recorded at the sequence level - there is nowhere in the GenBank file format to but them. They are however recorded within a CDS feature on the link above. So, if you want these recorded, you'd have to create a SeqFeature with the information (you can't use the SeqRecord's dbxrefs list). The GI number would get written, but due to an anomology in the GenBank parser this is currently stored in the annotations dictionary under the key "gi", so this is where the GenBank writer looks for this. We should probably switch to recording this in the dbxrefs as "gi:12345" as well/instead, and look for this GI number there instead/as well. Currently when parsing GenBank files, the only thing stored in the SeqRecord's dbxref list is a PROJECT line cross reference (see Bug 2225). Looking at the code, we don't currently record that - we should. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 11 18:55:21 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 11 May 2009 18:55:21 -0400 Subject: [Biopython-dev] [Bug 2826] SeqRecord dbxrefs not written to GenBank by SeqIO In-Reply-To: Message-ID: <200905112255.n4BMtLFc004295@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2826 ------- Comment #2 from david.wyllie at ndm.ox.ac.uk 2009-05-11 18:55 EST ------- Thank you. I'm new to BioPython. The goal was to take some whole-genome sequence (which isn't in Genbank) and attach a taxon to it, in order that it be written to a BioSQL database. Other records in the BioSQL database derive from NCBI and so have taxon_ids, so the additional WGS being in a similar format would make things simpler. Thank you very much for all your assistance. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From cy at cymon.org Tue May 12 07:07:59 2009 From: cy at cymon.org (Cymon Cox) Date: Tue, 12 May 2009 12:07:59 +0100 Subject: [Biopython-dev] Clustal alignment format header line Message-ID: <7265d4f0905120407j151408bcnf988ba554fcb5140@mail.gmail.com> Both Muscle (-clw) and Probcons (-clustalw) output a programme specific header line for the clustal format alignment: "MUSCLE (3.7) multiple sequence alignment AK1H_ECOLI/1-378 CPDSINAALICRGEKMSIAIMAGVLEAR etc" "PROBCONS version 1.12 multiple sequence alignment AK1H_ECOLI/1-378 CPDSINAALICRGEKMSIAIMA " Bio.AlignIO will not read these alignments Bio/AlignIO/ClustalIO.py:94 if line[:7] != 'CLUSTAL': raise ValueError("Did not find CLUSTAL header") Muscle does have a -clwstrict flag but ProbCons doesnt. Would it be a good idea to relax the header parsing? C. -- From biopython at maubp.freeserve.co.uk Tue May 12 11:28:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 May 2009 16:28:35 +0100 Subject: [Biopython-dev] Clustal alignment format header line In-Reply-To: <7265d4f0905120407j151408bcnf988ba554fcb5140@mail.gmail.com> References: <7265d4f0905120407j151408bcnf988ba554fcb5140@mail.gmail.com> Message-ID: <320fb6e00905120828t139113d8rceb7db874acdd8a0@mail.gmail.com> On Tue, May 12, 2009 at 12:07 PM, Cymon Cox wrote: > Both Muscle (-clw) and Probcons (-clustalw) ?output a programme specific > header line for the clustal format alignment: > > "MUSCLE (3.7) multiple sequence alignment > > > AK1H_ECOLI/1-378 ? ? ?CPDSINAALICRGEKMSIAIMAGVLEAR etc" > > "PROBCONS version 1.12 multiple sequence alignment > > AK1H_ECOLI/1-378 ? ?CPDSINAALICRGEKMSIAIMA > > " > > Bio.AlignIO will not read these alignments > Bio/AlignIO/ClustalIO.py:94 > ?if line[:7] != 'CLUSTAL': > ? ? ? raise ValueError("Did not find CLUSTAL header") > > Muscle does have a -clwstrict flag but ProbCons doesnt. > > Would it be a good idea to relax the header parsing? > > C. Maybe. Up until now the only example of this I had personally come across was MUSCLE, but they helpfully provide the -clwstrict argument so the issue wasn't important. There are also of course the official variants like: CLUSTAL W (1.81) multiple sequence alignment CLUSTAL 2.0.9 multiple sequence alignment How would you code this? A flexible option would be to take anything where the first line ends with "multiple sequence alignment", but this risks letting a lot of non-clustal files though which will then (hopefully) fail, but probably with a much more cryptic error message. A white list of safe variants like "MUSCLE" and "PROBCONS" would be safest. Also I have a vague memory of some tool using something like "CLUSTAL ... from ToolX" but I don't recall the details. Peter From cy at cymon.org Tue May 12 11:43:47 2009 From: cy at cymon.org (Cymon Cox) Date: Tue, 12 May 2009 16:43:47 +0100 Subject: [Biopython-dev] Clustal alignment format header line In-Reply-To: <320fb6e00905120828t139113d8rceb7db874acdd8a0@mail.gmail.com> References: <7265d4f0905120407j151408bcnf988ba554fcb5140@mail.gmail.com> <320fb6e00905120828t139113d8rceb7db874acdd8a0@mail.gmail.com> Message-ID: <7265d4f0905120843j172f5303y7029e1f7b5f4187f@mail.gmail.com> 2009/5/12 Peter > On Tue, May 12, 2009 at 12:07 PM, Cymon Cox wrote: > > Both Muscle (-clw) and Probcons (-clustalw) output a programme specific > > header line for the clustal format alignment: > > > > "MUSCLE (3.7) multiple sequence alignment > > > > > > AK1H_ECOLI/1-378 CPDSINAALICRGEKMSIAIMAGVLEAR etc" > > > > "PROBCONS version 1.12 multiple sequence alignment > > > > AK1H_ECOLI/1-378 CPDSINAALICRGEKMSIAIMA > > > > " > > > > Bio.AlignIO will not read these alignments > > Bio/AlignIO/ClustalIO.py:94 > > if line[:7] != 'CLUSTAL': > > raise ValueError("Did not find CLUSTAL header") > > > > Muscle does have a -clwstrict flag but ProbCons doesnt. > > > > Would it be a good idea to relax the header parsing? > > > > C. > > Maybe. Up until now the only example of this I had personally come > across was MUSCLE, but they helpfully provide the -clwstrict argument > so the issue wasn't important. > > There are also of course the official variants like: > > CLUSTAL W (1.81) multiple sequence alignment > CLUSTAL 2.0.9 multiple sequence alignment > > How would you code this? A flexible option would be to take anything > where the first line ends with "multiple sequence alignment", but this > risks letting a lot of non-clustal files though which will then > (hopefully) fail, but probably with a much more cryptic error message. > A white list of safe variants like "MUSCLE" and "PROBCONS" would be > safest. > > Also I have a vague memory of some tool using something like "CLUSTAL > ... from ToolX" but I don't recall the details. T-COFFEE for one: "CLUSTAL FORMAT for T-COFFEE Version_6.92 [http://www.tcoffee.org] [MODE: ], CPU=0.00 sec, SCORE=100, Nseq=2, Len=601" Is it so bad to let it fail on the structure of the data - effectively ignore the header? Maybe have a general "this doesnt look like clustal formatted data" error based on the data structure... C. -- From biopython at maubp.freeserve.co.uk Tue May 12 12:05:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 May 2009 17:05:15 +0100 Subject: [Biopython-dev] Loading SeqRecords into BioSQL with NCBI taxon ID Message-ID: <320fb6e00905120905l3d0e31b2p2a3d92f61096cbd5@mail.gmail.com> Over on Bug 2826, David wrote: http://bugzilla.open-bio.org/show_bug.cgi?id=2826#c2 > Thank you. I'm new to BioPython. > > The goal was to take some whole-genome sequence (which isn't in Genbank) and > attach a taxon to it, in order that it be written to a BioSQL database. You've talked about trying to parse WGS GenBank files on Bug 2825 but presumable if this new data isn't in GenBank, it is in another format. What format is your whole-genome sequence? FASTA or something simple? > Other records in the BioSQL database derive from NCBI and so have taxon_ids, > so the additional WGS being in a similar format would make things simpler. I see. Basically you need to import a SeqRecord into BioSQL with an NCBI taxon ID. You don't need to write out a GenBank file to do this. First create the SeqRecord, e.g. from Bio import SeqIO record = SeqIO.read(handle, format, alphabet) There are now two options - because the BioSQL loader will look for the NCBI taxon ID in two places: (Option 1) Record the NCBI taxon ID in the SeqRecord's annotation dictionary under the "ncbi_taxid" key. This should work (untested): record.annotations["ncbi_taxid"] = 12345 #or single element list, [12345] (Option 2) Mimic a SeqRecord from parsing a GenBank file with a source feature containing the taxon ID. This should work (untested): #Create the SeqRecord: record = SeqIO.read(handle, format, alphabet) #Create the source features: from Bio.SeqFeature import SeqFeature, FeatureLocation f = SeqFeature(FeatureLocation(0, len(record)), strand=+1, type="source") f.qualifiers["db_xref"] = ["taxon:12345"] record.features = [f] #or insert at start If you don't really have a sequence, this second approach doesn't make so much sense. [Arguably there could be a third option via the dbxref's list] Then in either case, load the modified SeqRecord into the database. You may want to pre-load the NCBI taxonomy, see http://www.biopython.org/wiki/BioSQL Alternatively, using Biopython 1.49+ you can have this fetched from Entrez on demand with the fetch_NCBI_taxonomy=True option. The BioSQL wiki page needs updating on this topic. Peter From bugzilla-daemon at portal.open-bio.org Tue May 12 12:11:43 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 12 May 2009 12:11:43 -0400 Subject: [Biopython-dev] [Bug 2826] SeqRecord dbxrefs not written to GenBank by SeqIO In-Reply-To: Message-ID: <200905121611.n4CGBhrY001864@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2826 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-12 12:11 EST ------- (In reply to comment #2) > Thank you. I'm new to BioPython. > > The goal was to take some whole-genome sequence (which isn't in Genbank) and > attach a taxon to it, in order that it be written to a BioSQL database. For this example you don't need to write out a GenBank file at all (which is what this bug was about). See my email on the mailing list for details: http://lists.open-bio.org/pipermail/biopython/2009-May/005154.html and sent in error to the dev list: http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006028.html I am leaving this bug open for relevant dbxrefs entries not currently recorded when writing GenBank files with Bio.SeqIO (GI number which goes on the VERSION line, and genome projects on the PROJECT / DBLINK line). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue May 12 12:16:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 May 2009 17:16:35 +0100 Subject: [Biopython-dev] Clustal alignment format header line In-Reply-To: <7265d4f0905120843j172f5303y7029e1f7b5f4187f@mail.gmail.com> References: <7265d4f0905120407j151408bcnf988ba554fcb5140@mail.gmail.com> <320fb6e00905120828t139113d8rceb7db874acdd8a0@mail.gmail.com> <7265d4f0905120843j172f5303y7029e1f7b5f4187f@mail.gmail.com> Message-ID: <320fb6e00905120916p3db7c003kf6eef581cbb4c93b@mail.gmail.com> On Tue, May 12, 2009 at 4:43 PM, Cymon Cox wrote: >Peter wrote: >> Also I have a vague memory of some tool using something like "CLUSTAL >> ... from ToolX" but I don't recall the details. > > T-COFFEE for one: > "CLUSTAL FORMAT for T-COFFEE Version_6.92 [http://www.tcoffee.org] [MODE: > ], CPU=0.00 sec, SCORE=100, Nseq=2, Len=601" Yes - that is almost certainly the example I was thinking of. > Is it so bad to let it fail on the structure of the data - effectively > ignore the header? Maybe have a general "this doesnt look like clustal > formatted data" error based on the data structure... Some of the current error messages are a little cryptic to an end user, I guess they could have "Are you sure this is a Clustal format file?" appended to them. I'd be happy with a whitelist of variant headers, i.e. must start with "CLUSTAL", "MUSCLE" or "PROBCONS" (assuming these tools don't write their own file formats which also start that way!). If people find new cases and report them, it also gives us notice about another tool we may want to include in our command line wrappers, and/or obtain sample output files for the unit tests. Peter From biopython at maubp.freeserve.co.uk Tue May 12 13:14:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 May 2009 18:14:27 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com> <8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com> <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com> <320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com> <320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com> <8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com> Message-ID: <320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com> On Tue, Apr 28, 2009 at 6:50 PM, Bartek Wilczynski wrote: > On Tue, Apr 28, 2009 at 7:45 PM, Peter wrote: >> I take that back - I added an email address of just "peterc" to my >> github account (it seems they don't do any validation, perhaps for >> this very reason?). ?This had no immediate effect, but one day later >> and all my CVS commits are now shown with my photo in github. ?Neat - > > great That seems to have stopped working now - no idea why, "peterc" is still listed an one of my email addresses on my github account, but my github account is no longer linked to commits in Biopython. Odd. Do you think it would be straight forward for your CVS to git conversion to map the CVS usernames to github usernames for future commits (so as not to alter the currently published history)? Peter From bugzilla-daemon at portal.open-bio.org Tue May 12 13:33:03 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 12 May 2009 13:33:03 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905121733.n4CHX3jK009739@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #32 from cymon.cox at gmail.com 2009-05-12 13:33 EST ------- Added PROBCONS and TCOFFEE command line interfaces and unittests. The TCOFFEE commadline implements a very restricted set of options (just those Brad attached). Also added white list of known headers to AlignIO/ClustalwIO.py:97 - the PROBSCONS unittest will fail without this alteration. On http://github.com/cymon/biopython-github-master/tree/applic-int C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bartek at rezolwenta.eu.org Tue May 12 14:23:18 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 12 May 2009 20:23:18 +0200 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com> <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com> <320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com> <320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com> <8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com> <320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com> Message-ID: <8b34ec180905121123w6fe553ahd416b9e1d317dd11@mail.gmail.com> On Tue, May 12, 2009 at 7:14 PM, Peter wrote: > That seems to have stopped working now - no idea why, "peterc" is > still listed an one of my email addresses on my github account, but my > github account is no longer linked to commits in Biopython. ?Odd. It seems to be OK again. Maybe it was temporary ? > > Do you think it would be straight forward for your CVS to git > conversion to map the CVS usernames to github usernames for future > commits (so as not to alter the currently published history)? > It would be straightforward to add a mapping to the conversion, but I think it would affect the whole history... I was thinking that the mapping was going to change when we finally switch to git. Then it would be a natural cause of events... Otherwise, we would have another step in our transition. Whether it's worth doing it, depends on how long we expect to be in the transition between CVS and git. cheers Bartek From bugzilla-daemon at portal.open-bio.org Tue May 12 14:44:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 12 May 2009 14:44:09 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905121844.n4CIi9sb017010@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1296 is|0 |1 obsolete| | ------- Comment #33 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-12 14:44 EST ------- (From update of attachment 1296) (In reply to comment #32) > Added PROBCONS and TCOFFEE command line interfaces and unittests. > > The TCOFFEE commadline implements a very restricted set of options > (just those Brad attached). > > Also added white list of known headers to AlignIO/ClustalwIO.py:97 - the > PROBSCONS unittest will fail without this alteration. > > On http://github.com/cymon/biopython-github-master/tree/applic-int Thank you Cymon and Brad - those are now checked in, more or less as is. I did tweak Bio/AlignIO/ClustalwIO.py a little bit. Also, TCoffee says it can be installed on Windows using Cygwin - we should try that at some point ;) Note for the TCoffee suite we could also consider adding xpresso, 3dcoffee, mcoffee and rcoffee as well - hopefully they have similar interfaces so with some subclassing we won't have to duplicate a lot of the code. One other thought - do you think the EMBOSS water and needle wrappers (and any other alignment tools in EMBOSS) be made available under Bio.Align.Applications (via an import in Bio/Align/Applications/__init__.py so no code duplication)? Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue May 12 14:57:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 May 2009 19:57:24 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <8b34ec180905121123w6fe553ahd416b9e1d317dd11@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com> <320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com> <320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com> <8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com> <320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com> <8b34ec180905121123w6fe553ahd416b9e1d317dd11@mail.gmail.com> Message-ID: <320fb6e00905121157i2068b81fw5a8b21e65c33abe2@mail.gmail.com> On Tue, May 12, 2009 at 7:23 PM, Bartek Wilczynski wrote: > > I was thinking that the mapping was going to change when we finally > switch to git. Then it would be a natural cause of events... > Otherwise, we would have another step in our transition. Whether it's > worth doing it, depends on how long we expect to be in the transition > between CVS and git. I'm happy that git will work, and that I personally know enough about the basics to manage. I'm not happy with the current github repository due to the history tag issue - but we know we can fix that now. Are you going to try removing the old tags and re-doing them on github? Does anyone know how the git provided "ViewCVS" equivalent shows tags in a file's history? I think we should now have a chat with the OBF (off list) about how we might go about installing git on their server. Commits can then be pushed out to github automatically (or pulled from github if we go the other way round). This would make several things easier: (1) Seamless continuation of existing user accounts (2) Keeping the snapshot code up to date: http://biopython.org/SRC/biopython/ (3) Having our own commit RSS feeds (not essential as this could be done on github) (4) Having automatic builds of the documentation (previously discussed as nice to have) Plus of course giving redundancy with the code mirrored on both OBF servers and GitHub :) Peter From bugzilla-daemon at portal.open-bio.org Tue May 12 15:45:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 12 May 2009 15:45:12 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905121945.n4CJjCFj023070@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #34 from cymon.cox at gmail.com 2009-05-12 15:45 EST ------- (In reply to comment #33) > Note for the TCoffee suite we could also consider adding xpresso, 3dcoffee, > mcoffee and rcoffee as well - hopefully they have similar interfaces so with > some subclassing we won't have to duplicate a lot of the code. With the latest version of t_coffee (and not the currently available Jaunty package!), these (ie the meta calls like mcoffee etc) are all covered by the "-mode" option. I just installed t_coffee from source and this appears to be the case. There are so many options and interdependencies in TCOFFEE, and its command line is clearly a moving target, that the interface may require more work before being released. > One other thought - do you think the EMBOSS water and needle wrappers (and any > other alignment tools in EMBOSS) be made available under Bio.Align.Applications > (via an import in Bio/Align/Applications/__init__.py so no code duplication)? Sounds good to me. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue May 12 19:04:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 May 2009 00:04:53 +0100 Subject: [Biopython-dev] Bio.EMBOSS wrappers In-Reply-To: <320fb6e00904130649v699b16d1seea0c51172b9ab71@mail.gmail.com> References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com> <20090413123219.GB5429@sobchak.mgh.harvard.edu> <320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com> <20090413134429.GE5429@sobchak.mgh.harvard.edu> <320fb6e00904130649v699b16d1seea0c51172b9ab71@mail.gmail.com> Message-ID: <320fb6e00905121604q4c70d69ck35fb16210fb0efe2@mail.gmail.com> On Mon, Apr 13, 2009 at 2:49 PM, Peter wrote: > On Mon, Apr 13, 2009 at 2:44 PM, Brad Chapman wrote: >>> > ... Feel free to add away. >>> >>> I need to work on my delegation skills - that seems to have back fired ;) >> >> Oops. I honestly read that as "do I have your permission?" I can of >> course tackle this, but am a bit underwater now. > > Looking back, I was a bit ambiguous. I don't mind who does it - let's > see who has time free first. That's done in CVS now - plus a few other things like -die and -stdout. I've also done -outfile via the new base Emboss wrapper, as all the tools (so far at least) include this option. >>> Regarding adding -auto support, I have a question about the needle >>> wrapper and the gap parameters. Using the needle tool at the command >>> line will prompt for the gap parameters UNLESS the -auto argument has >>> been used. i.e. Without -auto, it makes sense to insist on the gap >>> parameters being included, which is what the current wrapper does. >>> However, if we add support for -auto, then these parameters can be >>> optional. We could handle this in the wrapper, but it would be messy >>> (and there may be similar questions with other EMBOSS tools). What do >>> you think - stick with the simple option of insisting the Biopython >>> user set the gap parameters, even if they are using -auto? >> >> I think we should stick with the simple option. These were meant to >> be pretty dumb specifiers that help users write more modular code than >> simply pasting in a raw string for the command line. Trying to get >> too fancy is probably overkill. > > Agreed. By putting the outfile argument on the base EMBOSS wrapper class, together with the related -filter and -stdout options, I was able to enforce a simple check that at least one of these is used, applicable to all the wrappers. This preserves the old safety check that the output file is required (unless using standard out via -filter and/or -stdout instead). Something similar could be done so that using -auto overrides the any "required" flags we have set (e.g. for gapopen in water), but this seems unnecessary to me (as discussed above). Peter From biopython at maubp.freeserve.co.uk Wed May 13 05:55:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 May 2009 10:55:06 +0100 Subject: [Biopython-dev] Properties names in command line wrappers In-Reply-To: <20090505123656.GB15113@sobchak.mgh.harvard.edu> References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com> <20090505123656.GB15113@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00905130255j22905e80y6ff8678f6af89cb2@mail.gmail.com> On Mon, May 4, 2009, Peter wrote: >>> ... The (hardly used) existing blastall wrapper in >>> Bio/Blast/Applications.py gives the "-a" argument a human >>> readable name of "nprocessors", and "-A" gets "window_size". >>> With the old set_parameter call either alias could be used. >>> However, with a python property we need to pick one as a >>> preferred name - and I'm not 100% sure being helpful and >>> using "nprocessors" (e.g. cline.nprocessors=4) is actually >>> better than using the actual argument name (e.g. cline.a = 4). On Tue, May 5, 2009, Brad wrote: >> Could we support both the original argument and optional human >> readable arguments? I know the code in Application is a bit >> hard coded for the first argument as the real name and the last >> argument as the readable name; the cleanest solution would be to >> generalize this to have multiple names where it makes sense. >> ... On Tue, May 5, 2009, Peter wrote: > ... > I favour using only a single property for each parameter, with the > name as similar as possible to the actual command line switch (i.e. > property name "a" for "-a", not "nprocessors"). Note each property > would have a docstring which will say what is it for ("Number of > processors to use."). I still favour only using a single python property for each parameter, but after some work on the blastall wrapper last night, I am beginning to come round to your point of view. If a command line tool provides a long parameter name (some tools provide both short and long names for important parameters) we should use that rather than inventing our own [so no change here]. However, for tools like BLAST which *only* have cryptic single letter command line options (case sensitive), maybe we should be using a sensible human readable name for the associated property in the Biopython wrapper (i.e. "nprocessors" for "-a", and "window_size" for "-A"). Having actually now tried using properties "a" and "A", the resulting python code is very cryptic - and only makes sense if you are familiar with the blastall arguments (and given there are so many of them, this is difficult!). It should be trivial to extend to documentation strings automatically to include something like "This maps onto the XXX command line argument" so that the mapping is clear to the user without having to look at our source code. Hopefully this gets the balance right between giving nice python code, and staying faithful to the actual command line tool API. Peter From biopython at maubp.freeserve.co.uk Wed May 13 07:15:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 May 2009 12:15:35 +0100 Subject: [Biopython-dev] Properties names in command line wrappers In-Reply-To: <7265d4f0905130350s4006db5bqc40eed1ec535243d@mail.gmail.com> References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com> <20090505123656.GB15113@sobchak.mgh.harvard.edu> <320fb6e00905130255j22905e80y6ff8678f6af89cb2@mail.gmail.com> <7265d4f0905130350s4006db5bqc40eed1ec535243d@mail.gmail.com> Message-ID: <320fb6e00905130415p6757da94id8d7508a2ea3eebf@mail.gmail.com> On Wed, May 13, 2009 at 11:50 AM, Cymon Cox wrote: >> On Tue, May 5, 2009, Peter wrote: >> > ... >> > I favour using only a single property for each parameter, with the >> > name as similar as possible to the actual command line switch (i.e. >> > property name "a" for "-a", not "nprocessors"). ?Note each property >> > would have a docstring which will say what is it for ("Number of >> > processors to use."). >> >> I still favour only using a single python property for each parameter, > > A confusing issue arises where we have alternative names for options. > That the following example from _Probcons.py: > > ??????????? _Option(["-c", "c", "--consistency", "consistency" ], ["input"], > ??????????????????? lambda x: x in range(0,6), > ??????????????????? 0, > ??????????????????? "Use 0 <= REPS <= 5 (default: 2) passes of consistency > transformation", > ??????????????????? 0), > >>>> cmd = cmdline = ProbconsCommandline("probcons", input="blah") >>>> cmd.c = 1 >>>> str(cmd) > 'probcons blah ' >>>> cmd.set_parameter("c", 1) >>>> str(cmd) > 'probcons -c 1 blah ' >>>> cmd.consistency = 2 >>>> str(cmd) > 'probcons -c 2 blah ' >>>> cmd.c = 5 >>>> str(cmd) > 'probcons -c 2 blah ' > > That is, the user needs to look at the code to figure out what the correct > name is to use when assigning to the property. Is it possible to restrict > the binding of attributes to the cmdline to only valid property names? An > alternative would be to restrict all parameters to only one name and > document the alternatives it covers (dont like this idea - see below). Yes, you can use any of the defined aliases with set_parameter, and they are all equally valid, and all do exactly the same thing. e.g. cmd = ProbconsCommandline("probcons", input="blah") cmd.set_parameter("c", 1) cmd.set_parameter("-c", 1) cmd.set_parameter("--consistency", 1) cmd.set_parameter("consistency", 1) I would however regard set_parameter as a legacy method and push the (single) keyword argument or property alternative, for which there is only one name (here "consistency" ): cmd = ProbconsCommandline("probcons", input="blah") cmd.consistency = 1 or, cmd = ProbconsCommandline("probcons", input="blah", consistency=1) [And yes, we should have some error checking code in the base class __init__ method to make sure the string used is a valid python identifier.] The user does NOT have to look at the source code to find this out - just the docstrings or properties - try help(cmd) or dir(cmd) in python. >> but after some work on the blastall wrapper last night, I am >> beginning to come round to your point of view. >> >> If a command line tool provides a long parameter name (some tools >> provide both short and long names for important parameters) we >> should use that rather than inventing our own [so no change here]. >> >> However, for tools like BLAST which *only* have cryptic single letter >> command line options (case sensitive), maybe we should be using >> a sensible human readable name for the associated property in the >> Biopython wrapper (i.e. "nprocessors" for "-a", and "window_size" >> for "-A"). ?Having actually now tried using properties "a" and "A", >> the resulting python code is very cryptic - and only makes sense >> if you are familiar with the blastall arguments (and given there are >> so many of them, this is difficult!). > > I dont agree. If you want to make your python code legible to people > who are not familar with the command line options, you can just > comment it. I think the interfaces should stick as close as possible > to the application documentation. I see these interfaces being used > mostly by people who are familar with the applications, in which case > the command line construction should be fairly intuitive. Well, I am on the fence here. The trouble is that sometimes (e.g. BLAST) the command line parameters themselves are just so cryptic. Yes, we could just use "a" and "A", and leave it up to the user to document their code. If we using "nprocessors" and "window_size" the code becomes self documenting (although you have to know Biopython's mapping). Brad's suggestion to support both in the property and keyword arguments brings us back to having multiple choices on how to do set a parameter (as in the set_parameter with its aliases), confusing and unpythonic. Peter From cy at cymon.org Wed May 13 06:50:54 2009 From: cy at cymon.org (Cymon Cox) Date: Wed, 13 May 2009 11:50:54 +0100 Subject: [Biopython-dev] Properties names in command line wrappers In-Reply-To: <320fb6e00905130255j22905e80y6ff8678f6af89cb2@mail.gmail.com> References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com> <20090505123656.GB15113@sobchak.mgh.harvard.edu> <320fb6e00905130255j22905e80y6ff8678f6af89cb2@mail.gmail.com> Message-ID: <7265d4f0905130350s4006db5bqc40eed1ec535243d@mail.gmail.com> 2009/5/13 Peter > On Mon, May 4, 2009, Peter wrote: > >>> ... The (hardly used) existing blastall wrapper in > >>> Bio/Blast/Applications.py gives the "-a" argument a human > >>> readable name of "nprocessors", and "-A" gets "window_size". > >>> With the old set_parameter call either alias could be used. > >>> However, with a python property we need to pick one as a > >>> preferred name - and I'm not 100% sure being helpful and > >>> using "nprocessors" (e.g. cline.nprocessors=4) is actually > >>> better than using the actual argument name (e.g. cline.a = 4). > > On Tue, May 5, 2009, Brad wrote: > >> Could we support both the original argument and optional human > >> readable arguments? I know the code in Application is a bit > >> hard coded for the first argument as the real name and the last > >> argument as the readable name; the cleanest solution would be to > >> generalize this to have multiple names where it makes sense. > >> ... > > On Tue, May 5, 2009, Peter wrote: > > ... > > I favour using only a single property for each parameter, with the > > name as similar as possible to the actual command line switch (i.e. > > property name "a" for "-a", not "nprocessors"). Note each property > > would have a docstring which will say what is it for ("Number of > > processors to use."). > > I still favour only using a single python property for each parameter, A confusing issue arises where we have alternative names for options. That the following example from _Probcons.py: _Option(["-c", "c", "--consistency", "consistency" ], ["input"], lambda x: x in range(0,6), 0, "Use 0 <= REPS <= 5 (default: 2) passes of consistency transformation", 0), >>> cmd = cmdline = ProbconsCommandline("probcons", input="blah") >>> cmd.c = 1 >>> str(cmd) 'probcons blah ' >>> cmd.set_parameter("c", 1) >>> str(cmd) 'probcons -c 1 blah ' >>> cmd.consistency = 2 >>> str(cmd) 'probcons -c 2 blah ' >>> cmd.c = 5 >>> str(cmd) 'probcons -c 2 blah ' That is, the user needs to look at the code to figure out what the correct name is to use when assigning to the property. Is it possible to restrict the binding of attributes to the cmdline to only valid property names? An alternative would be to restrict all parameters to only one name and document the alternatives it covers (dont like this idea - see below). but after some work on the blastall wrapper last night, I am > beginning to come round to your point of view. > > If a command line tool provides a long parameter name (some tools > provide both short and long names for important parameters) we > should use that rather than inventing our own [so no change here]. > > However, for tools like BLAST which *only* have cryptic single letter > command line options (case sensitive), maybe we should be using > a sensible human readable name for the associated property in the > Biopython wrapper (i.e. "nprocessors" for "-a", and "window_size" > for "-A"). Having actually now tried using properties "a" and "A", > the resulting python code is very cryptic - and only makes sense > if you are familiar with the blastall arguments (and given there are > so many of them, this is difficult!). I dont agree. If you want to make your python code legible to people who are not familar with the command line options, you can just comment it. I think the interfaces should stick as close as possible to the application documentation. I see these interfaces being used mostly by people who are familar with the applications, in which case the command line construction should be fairly intuitive. Cheers, C. -- From biopython at maubp.freeserve.co.uk Wed May 13 09:10:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 May 2009 14:10:59 +0100 Subject: [Biopython-dev] Properties names in command line wrappers In-Reply-To: <320fb6e00905130415p6757da94id8d7508a2ea3eebf@mail.gmail.com> References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com> <20090505123656.GB15113@sobchak.mgh.harvard.edu> <320fb6e00905130255j22905e80y6ff8678f6af89cb2@mail.gmail.com> <7265d4f0905130350s4006db5bqc40eed1ec535243d@mail.gmail.com> <320fb6e00905130415p6757da94id8d7508a2ea3eebf@mail.gmail.com> Message-ID: <320fb6e00905130610g3eb8edb4q99913b8b0ae14bf9@mail.gmail.com> On Wed, May 13, 2009 at 12:15 PM, Peter wrote: > > The user does NOT have to look at the source code to find this out - > just the docstrings or properties - try help(cmd) or dir(cmd) in python. > I've just updated the automatically generated docstrings for each property so that it includes the actual parameter name which will be used to build the string. Peter From bugzilla-daemon at portal.open-bio.org Wed May 13 11:01:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 13 May 2009 11:01:33 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905131501.n4DF1XYv019413@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #35 from cymon.cox at gmail.com 2009-05-13 11:01 EST ------- Ive added some very basic unittests for the command line interfaces, which dont require the applications to be installed. test_Application_Commandlines.py - currently in only includes Bio/Align/Applications but Bio/Emboss tests could be added. Note that the _Mafft.py command line interface is currently broken due the restriction only having a single instance of a parameter on the command line. Mafft uses the following option: --seed alignment1 [--seed alignment2 --seed alignment3 ...] We could remove support this option in Mafft. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 13 11:23:34 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 13 May 2009 11:23:34 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905131523.n4DFNYX7021233@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #36 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-13 11:23 EST ------- (In reply to comment #35) > > Note that the _Mafft.py command line interface is currently broken due the > restriction only having a single instance of a parameter on the command line. > Mafft uses the following option: > > --seed alignment1 [--seed alignment2 --seed alignment3 ...] > > We could remove support this option in Mafft. Removing the --seed argument might be a pragmatic short term solution. I'd considered this type of thing as a possible corner case - but hadn't mentioned it as I didn't have a concrete example. I would suggest setting the parameter value to a list could work: i.e. Support any of: cline = MafftCommandline(seed=["alignment1", "alignment2", "alignment3"]) cline.set_paramter("seed", ["alignment1", "alignment2", "alignment3"]) cline.seed = ["alignment1", "alignment2", "alignment3"] giving: mafft --seed alignment1 --seed alignment2 --seed alignment3 We'd need to introduce a new _Option subclass for this. A similar situation applies to optional argument lists, like the Unix zip command: zip zipfile file1 file2 file3 ... where there is a single output filename (here zipfile), and then one or more input files or filespecifiers (here three entries). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From winda002 at student.otago.ac.nz Thu May 14 00:53:42 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Thu, 14 May 2009 16:53:42 +1200 Subject: [Biopython-dev] Cookbook entry, concatenating nexus files Message-ID: <4A0BA3D6.5070207@student.otago.ac.nz> I have been slowly adding some of the scripts I use most commonly to the cookbook section of the wiki (http://biopython.org/wiki/Category:Cookbook). Since I'm very much a dilettante at this programming business as the cookbook is meant as supplementary documentation for Biopython it's probably a good idea for someone that knows what they are doing to look at these things (Peter has been really helpful with this thus far, but is seems unfair to saddle one man with so much bad programming :) I've just added a recipe that uses the nexus class to concatenate multiple nexus files and provide some feedback if the taxa are not the same in each one: http://biopython.org/wiki/Concatenate_nexus Any thoughts? If you think you can make it clearer/quicker/better then you can edit it on the wiki or provide comments here of there. Cheers, David From biopython at maubp.freeserve.co.uk Thu May 14 05:27:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 May 2009 10:27:12 +0100 Subject: [Biopython-dev] Cookbook entry, concatenating nexus files In-Reply-To: <4A0BA3D6.5070207@student.otago.ac.nz> References: <4A0BA3D6.5070207@student.otago.ac.nz> Message-ID: <320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com> On Thu, May 14, 2009 at 5:53 AM, David Winter wrote: > > I have been slowly adding some of the scripts I use most commonly to the > cookbook section of the wiki (http://biopython.org/wiki/Category:Cookbook). > Since I'm very much a ?dilettante at this programming business as the > cookbook is meant as supplementary documentation for Biopython it's probably > a good idea for someone that knows what they are doing to look at these > things (Peter has been really helpful with this thus far, but is seems > unfair to saddle one man with so much bad programming :) > > I've just added a recipe that uses the nexus class to concatenate multiple > nexus files and provide some feedback if the taxa are not the same in each > one: http://biopython.org/wiki/Concatenate_nexus > > Any thoughts? If you think you can make it clearer/quicker/better then you > can edit it on the wiki or provide comments here of there. What exactly are you trying to achieve? A big Nexus files with lots of alignments (and trees) in it? When I talked to Frank about Nexus files, he said they should only ever hold one alignment matrix, hence Bio.AlignIO does not allow writing multiple alignments to a single Nexus file. If you have some real world examples of Nexus files holding more than one alignment matrix, please share them - then we can try and get Bio.AlignIO (and if need be Bio.Nexus) to cope with them directly! Peter From cy at cymon.org Thu May 14 05:59:51 2009 From: cy at cymon.org (Cymon Cox) Date: Thu, 14 May 2009 10:59:51 +0100 Subject: [Biopython-dev] Cookbook entry, concatenating nexus files In-Reply-To: <320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com> References: <4A0BA3D6.5070207@student.otago.ac.nz> <320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com> Message-ID: <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com> 2009/5/14 Peter > On Thu, May 14, 2009 at 5:53 AM, David Winter > wrote: > > > > I have been slowly adding some of the scripts I use most commonly to the > > cookbook section of the wiki ( > http://biopython.org/wiki/Category:Cookbook). > > Since I'm very much a dilettante at this programming business as the > > cookbook is meant as supplementary documentation for Biopython it's > probably > > a good idea for someone that knows what they are doing to look at these > > things (Peter has been really helpful with this thus far, but is seems > > unfair to saddle one man with so much bad programming :) > > > > I've just added a recipe that uses the nexus class to concatenate > multiple > > nexus files and provide some feedback if the taxa are not the same in > each > > one: http://biopython.org/wiki/Concatenate_nexus > > > > Any thoughts? If you think you can make it clearer/quicker/better then > you > > can edit it on the wiki or provide comments here of there. > > What exactly are you trying to achieve? A big Nexus files with lots > of alignments (and trees) in it? The example David has given is very useful and a common procedure for phylogeneticists. Single gene/proteins tend to be aligned in separate alignment files and the concatenated into a so-called 'supermatrix'. One thing I would question is the first line: "It's a good idea, if possible, to make species-level phylogenetic inferences bases on multiple genes because a) demographic processes can lead gene-trees to diverge from species trees and b) journal editors now this." Yes, it is a good idea to make inferences based upon the largest amount of data, but if demographic process have led to some gene(s) that have diverged from the species tree, then this is a reason not to combined them. Phylogenetic inference assumes all data evolved on the same tree - typically one would analyse gene partitions individually to look for incongruence among partitions before combining the data. > When I talked to Frank about Nexus files, he said they should only > ever hold one alignment matrix, Well, that was my understanding as well. But, it may be wrong. I just tried it - p4 will read both matrices no problem, PAUP* (the de facto standard here) will execute both matrices ok presumably leaving just the last as the data in memory. I'll look into this further... Cheers C. -- ____________________________________________________________________ Cymon J. Cox Centro de Ciencias do Mar Faculdade de Ciencias do Mar e Ambiente (FCMA) Universidade do Algarve Campus de Gambelas 8005-139 Faro Portugal Phone: +0351 289800909 ext 7909 Fax: +0351 289800051 Email: cy at cymon.org, cymon at ualg.pt, cymon.cox at gmail.com HomePage : http://biology.duke.edu/bryology/cymon.html -8.63/-6.77 From biopython at maubp.freeserve.co.uk Thu May 14 07:02:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 May 2009 12:02:03 +0100 Subject: [Biopython-dev] Cookbook entry, concatenating nexus files In-Reply-To: <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com> References: <4A0BA3D6.5070207@student.otago.ac.nz> <320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com> <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com> Message-ID: <320fb6e00905140402i5e7d9343p60162f697692b427@mail.gmail.com> On Thu, May 14, 2009 at 10:59 AM, Cymon Cox wrote: >> What exactly are you trying to achieve? ?A big Nexus files with lots >> of alignments (and trees) in it? > > The example David has given is very useful and a common procedure for > phylogeneticists. Single gene/proteins tend to be aligned in separate > alignment files and the concatenated into a so-called 'supermatrix'. Oh right - I hadn't looked at David's example carefully enough earlier to work out which concatenation he was doing (by row or by column). It does make sense on re-reading. Concatenation to give a single supermatrix (same number of taxa, longer sequences) would be most elegantly done by sorting the three alignments (so the taxa are in the same order) and then concatenating them (by column). See Bug 2552, http://bugzilla.open-bio.org/show_bug.cgi?id=2552 Note that this procedure isn't specific to NEXUS files - you could do this with any alignment format. It is just fairly straight forward with the Bio.Nexus module at the moment (at least, until we fix Bug 2552). Peter From biopython at maubp.freeserve.co.uk Thu May 14 07:11:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 May 2009 12:11:30 +0100 Subject: [Biopython-dev] Cookbook entry, concatenating nexus files In-Reply-To: <320fb6e00905140402i5e7d9343p60162f697692b427@mail.gmail.com> References: <4A0BA3D6.5070207@student.otago.ac.nz> <320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com> <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com> <320fb6e00905140402i5e7d9343p60162f697692b427@mail.gmail.com> Message-ID: <320fb6e00905140411med6a548g5f687a53881e24aa@mail.gmail.com> On Thu, May 14, 2009 at 12:02 PM, Peter wrote: > Oh right - I hadn't looked at David's example carefully enough earlier > to work out which concatenation he was doing (by row or by column). > It does make sense on re-reading. I'd rephrase this bit of the intro: It's a good idea, if possible, to make species-level phylogenetic inferences bases on multiple genes because a) demographic processes can lead gene-trees to diverge from species trees and b) journal editors now this. Most of the alignment files supported by Biopython allow you to write multiple alignments to the same file which makes this easy. However, the nexus file format (used by PAUP* and Mr Bayes) does not. In nexus files multiple alignments need to be represented as different 'character partitions' within a data matrix that contains one long sequence for each taxon. Bio.AlignIO will in general write out one or more alignments to a file. It does NOT do any concatenation by column, required to give the "supermatrix" which you want (which is why I get confused on the first reading). How about: It's a good idea, if possible, to make species-level phylogenetic inferences bases on multiple genes because (a) demographic processes can lead gene-trees to diverge from species trees and (b) journal editors know this. [add stuff from Cymon's comment here?] This is usually handled by creating a single "supermatrix" from separate alignments for each gene. i.e. You need a single alignment containing one row for each taxon where the rows are the concatenated pre-aligned sequences. In NEXUS files (used by PAUP* and Mr Bayes) multiple alignments can be explicitly represented as different 'character partitions' within a data matrix that contains one long sequence for each taxon. The Bio.Nexus module makes this relatively straight forward. Peter From cy at cymon.org Thu May 14 07:30:20 2009 From: cy at cymon.org (Cymon Cox) Date: Thu, 14 May 2009 12:30:20 +0100 Subject: [Biopython-dev] Cookbook entry, concatenating nexus files In-Reply-To: <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com> References: <4A0BA3D6.5070207@student.otago.ac.nz> <320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com> <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com> Message-ID: <7265d4f0905140430j47b0a661jd58dbe5749e4a1f7@mail.gmail.com> 2009/5/14 Cymon Cox > 2009/5/14 Peter > >> When I talked to Frank about Nexus files, he said they should only >> ever hold one alignment matrix, > > > Well, that was my understanding as well. But, it may be wrong. I just tried > it - p4 will read both matrices no problem, PAUP* (the de facto standard > here) will execute both matrices ok presumably leaving just the last as the > data in memory. > > I'll look into this further... > After a quick scan of the spec, there appears to be only one oblique reference to this issue: "Although the NEXUS standard does not impose constraints on the number of blocks, particular programs will. For example, MacClade 3.07 does not allow more than one TAXA block in a file." So I read that to mean, you can have any number of similarly named blocks in a NEXUS file, ie multiple DATA, TAXA, CHARACTERS, TREES etc, and its up to an individual application to decide how to deal with them. This seems to be in practice what happens: PAUP* will read multiple blocks of the same name but only the last block of a particular name will remain in memory after the file has been parsed. On the other hand, P4 will read multiple DATA blocks and store the different alignments as separate objects, and read multiple TREES blocks and store all the trees. C. From biopython at maubp.freeserve.co.uk Thu May 14 14:20:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 May 2009 19:20:47 +0100 Subject: [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL Message-ID: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> Hi, This is cross-posted between biopython-dev and biosql-l as it regards parsing the description (DE) lines in SwissProt files and how they are stored in BioSQL. This follows from an earlier discussion on biopython-dev Older SwissProt files just had one or two DE lines, and it made sense to treat this as a simple string mapped onto the description field in the bioentry table in BioSQL. This appears to what happens with BioPerl 1.5.x and in Biopython (although the details regarding white space differ). However, newer SwissProt files have many DE lines with additional structure. The example Michiel gave earlier on the biopython-dev list was: http://www.uniprot.org/uniprot/Q9XHP0.txt This has the following DE lines: DE RecName: Full=11S globulin seed storage protein 2; DE AltName: Full=11S globulin seed storage protein II; DE AltName: Full=Alpha-globulin; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 acidic chain; DE AltName: Full=11S globulin seed storage protein II acidic chain; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 basic chain; DE AltName: Full=11S globulin seed storage protein II basic chain; DE Flags: Precursor; I had to fight with perl to get my old copy of BioPerl working again (some week reference thing), but I managed, and then loaded this file into my test BioSQL database with: $ perl load_seqdatabase.pl --dbname biosql_test --dbuser root --dbpass XXX --namespace biosql_test --format swiss Q9XHP0.txt Then I looked at the resulting description in the main bioentry table: $ mysql --user=root -p biosql_test -e 'SELECT description FROM bioentry WHERE accession="Q9XHP0";' This is stored as one huge long string (without the newlines, I'm not sure if BioPerl strips those in parsing the file, or when loading it into the database): RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S globulin seed storage protein II; AltName: Full=Alpha-globulin; Contains: RecName: Full=11S globulin seed storage protein 2 acidic chain; AltName: Full=11S globulin seed storage protein II acidic chain; Contains: RecName: Full=11S globulin seed storage protein 2 basic chain; AltName: Full=11S globulin seed storage protein II basic chain; Flags: Precursor; For Biopython, I emptied the database then did: >>> from Bio import SeqIO >>> from BioSQL import BioSeqDatabase >>> server = BioSeqDatabase.open_database(driver="MySQLdb", user="root", passwd = "XXX", host = "localhost", db="biosql_test") >>> db = server["biosql-test"] #namespace >>> db.load(SeqIO.parse(open("Q9XHP0.txt"), "swiss")) 1 >>> server.commit() As before, I looked in the table with mysql. Again - this stores the full description from the DE line, although with the newlines embedded. So, Biopython is consistent with my old copy of BioPerl (1.5.x) if we ignore the white space. However, how does this look in BioPerl 1.6? If this is the same, are there any plans to change this? For Biopython we have discussed recording most of the DE information under the annotations instead (keyed off RecName, AltName, Contains, Flags), but I would like to be consistent with BioPerl+BioSQL. Thanks Peter From winda002 at student.otago.ac.nz Thu May 14 18:39:34 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Fri, 15 May 2009 10:39:34 +1200 Subject: [Biopython-dev] Cookbook entry, concatenating nexus files In-Reply-To: <320fb6e00905140411med6a548g5f687a53881e24aa@mail.gmail.com> References: <4A0BA3D6.5070207@student.otago.ac.nz> <320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com> <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com> <320fb6e00905140402i5e7d9343p60162f697692b427@mail.gmail.com> <320fb6e00905140411med6a548g5f687a53881e24aa@mail.gmail.com> Message-ID: <4A0C9DA6.9060403@student.otago.ac.nz> Peter wrote: > On Thu, May 14, 2009 at 12:02 PM, Peter wrote: > >> Oh right - I hadn't looked at David's example carefully enough earlier >> to work out which concatenation he was doing (by row or by column). >> It does make sense on re-reading. >> Well, just about ;) > > I'd rephrase this bit of the intro: > Yep, that's much better. Thanks Peter and Cymon for your feedback on this, I've updated the intro to include it and a couple of specific examples of how you'd use the character partitions. (Have you guys seen this: doi.wiley.com/10.1111/j.1755-0998.2008.02164.x , you could write a paper from one function in your nexus module!) cheers, david From biopython at maubp.freeserve.co.uk Fri May 15 05:05:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 15 May 2009 10:05:59 +0100 Subject: [Biopython-dev] Cookbook entry, concatenating nexus files In-Reply-To: <4A0C9DA6.9060403@student.otago.ac.nz> References: <4A0BA3D6.5070207@student.otago.ac.nz> <320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com> <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com> <320fb6e00905140402i5e7d9343p60162f697692b427@mail.gmail.com> <320fb6e00905140411med6a548g5f687a53881e24aa@mail.gmail.com> <4A0C9DA6.9060403@student.otago.ac.nz> Message-ID: <320fb6e00905150205k31d95c84naac1fa7873461263@mail.gmail.com> On Thu, May 14, 2009 at 11:39 PM, David Winter wrote: >> >> I'd rephrase this bit of the intro: >> > > Yep, that's much better. Thanks Peter and Cymon for your feedback on this, > I've updated the intro to include it and a couple of specific examples of > how you'd use the character partitions. That does look much clearer now :) Could you include the three original alignments in the text? It would help to let the reader see what is going on (and could be used to reproduce the example). > (Have you guys seen ?this: ?doi.wiley.com/10.1111/j.1755-0998.2008.02164.x , > you could write a paper from one function in your nexus module!) >From the abstract that does sound pretty trivial, but I guess that tool would be useful for non-programmers - even if you could probably rewrite it as one short python script using Biopython (or indeed a Perl script using BioPerl etc). Peter From bugzilla-daemon at portal.open-bio.org Fri May 15 20:24:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 15 May 2009 20:24:29 -0400 Subject: [Biopython-dev] [Bug 2829] New: Biosequence.alphabet can be set to unknown after loading a nucleotide SeqRecord Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2829 Summary: Biosequence.alphabet can be set to unknown after loading a nucleotide SeqRecord Product: Biopython Version: 1.49 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: david.wyllie at ndm.ox.ac.uk Hi I have done the following 1 loaded a small nucleotide fasta file with SeqIO, setting the alphabet successfully 2 written it to a test database with BioSQL 3 reloaded it, at which point the reloaded object has a "SingleLetterAlphabet" alphabet and biosequence.alphabet is set to unknown. Is this expected? The overall object was to add some SeqFeatures to the loaded SeqRecord, but it doesn't seem to store correctly even without any manipulations. Below demonstrates the problem. The system is Ubuntu 9 x64/ Python 2.6/ Biopython 1.49. #!/usr/bin/env python from BioSQL import BioSeqDatabase from Bio.Alphabet import generic_nucleotide from Bio import SeqIO from Bio import Seq # define variables needed for testing username="myusername" password="mypassword" hostname="localhost" # we are going to try to load a nucleotide fasta file into a BioSQL database # need a test file, with inputfile the file name; #>test_sequence #ttgaccgatgaccccggttcaggcttcaccacagtgtggaacgcggtcgtctccgaactt inputfile="/home/dwyllie/test.faa" # we want to create a new BioSQL database, called test dbname="test" dbdescription="test of alphabet storage" # we also want to remove one if it exists, for the purposes of testing server = BioSeqDatabase.open_database(driver="MySQLdb", db="bioseqdb", user=username, passwd=password, host=hostname) # if the database doesn't exist, we get an error, so we trap for that try: server.remove_database(dbname) server.adaptor.commit() except KeyError: print "Attempt to remove ",dbname," failed; going on to create a new one" server = BioSeqDatabase.open_database(driver="MySQLdb", db="bioseqdb", user=username, passwd=password, host=hostname) db = server.new_database(dbname, description=dbdescription) server.adaptor.commit() # set up a list to hold the mycobacterial sequences selectedrecords = [] # Setup an empty list which we'll later write # ifh is the input file handle; ifh = open(inputfile, "rU") # set a counter recordsread=0 for record in SeqIO.parse(ifh, "fasta", generic_nucleotide): # increment counter recordsread=recordsread+1 # just so we can reload it easily, we'll assign an id to this record # however, the problem does not depend on this, # nor on the nature of the defline, as far as I can tell record.id="IDENTIFIER_"+str(recordsread) print "** Note the sequence type of the Seq ** " print record # note that to this point it does appear to work, and the alphabet is correct. selectedrecords.append(record) print inputfile, "total found ", recordsread ifh.close() # write it to the bioSQL database print "Writing sequences to database" db.load(selectedrecords) server.adaptor.commit() # subsequent attempts to write the re-loaded object fail because no alphabet is defined print "However, the alphabet hasn't been stored." loadedrecord=db.lookup(gi="IDENTIFIER_1") print "Displaying re-loaded record" print loadedrecord # this can be confirmed by running sqlcmd=""" select * from bioseqdb.biosequence, bioseqdb.bioentry, bioseqdb.biodatabase where biodatabase.biodatabase_id= bioentry.biodatabase_id and biosequence.bioentry_id=bioentry.bioentry_id and biodatabase.name="test" """ print "This can be confirmed by examining bioseqdb.biosequence.alphabet, which is set to unknown; ", sqlcmd -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat May 16 07:37:52 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 16 May 2009 07:37:52 -0400 Subject: [Biopython-dev] [Bug 2829] BioSQL does not record a generic nucleotide alphabet In-Reply-To: Message-ID: <200905161137.n4GBbqKe018688@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2829 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|normal |enhancement Summary|Biosequence.alphabet can be |BioSQL does not record a |set to unknown after loading|generic nucleotide alphabet |a nucleotide SeqRecord | ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-16 07:37 EST ------- Biopython has a relatively rich range of alphabets, including IUPAC ambiguous and unambiguous alphabets, plus ways to indicate gap characters and stop symbols. The BioSQL range is much simpler, so some information is inevitably lost. In BioSQL, all we store is a simple string, "dna", "rna", "protein" or "unknown" (although BioJava used uppercase, so that is effectively allowed too). See: http://www.biosql.org/wiki/Enhancement_Requests#Check_constraint_on_biosequence.alphabet This means if your sequence was using "IUPAC extended protein with a * stop codon", all we can record is "protein". i.e. On retrieval from a BioSQL database, the alphabet is simply a generic protein. Likewise "ambiguous IUAC DNA with minus as the gap character" just becomes generic DNA. Note that as far as I know, currently none of the Bio* languages attempt to record "nucleotide" (i.e. "dna" or "rna"). This is something we should discuss on the BioSQL mailing list as a possible enhancement. So in answer to your question "Is this expected?", yes, a generic nucleotide alphabet isn't "dna", "rna" or "protein" so is currently recorded in the BioSQL database as "unknown". This gets turned into the SingleLetterAlphabet on retrieval. Changing title to "BioSQL does not record a generic nucleotide alphabet" and marking this as an enhancement. Peter P.S. Are you just testing here, or do you really not know if your sequence is DNA or RNA? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat May 16 07:54:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 16 May 2009 07:54:11 -0400 Subject: [Biopython-dev] [Bug 2829] BioSQL does not record a generic nucleotide alphabet In-Reply-To: Message-ID: <200905161154.n4GBsBWZ019474@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2829 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-16 07:54 EST ------- See: http://lists.open-bio.org/pipermail/biosql-l/2009-May/001515.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bartek at rezolwenta.eu.org Sat May 16 13:39:18 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Sat, 16 May 2009 19:39:18 +0200 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <320fb6e00905121157i2068b81fw5a8b21e65c33abe2@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com> <320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com> <320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com> <8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com> <320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com> <8b34ec180905121123w6fe553ahd416b9e1d317dd11@mail.gmail.com> <320fb6e00905121157i2068b81fw5a8b21e65c33abe2@mail.gmail.com> Message-ID: <8b34ec180905161039u529a7093p978aa8d61970ca51@mail.gmail.com> On Tue, May 12, 2009 at 8:57 PM, Peter wrote: > > I'm not happy with the current github repository due to the history > tag issue - but we know we can fix that now. ?Are you going to try > removing the old tags and re-doing them on github? I've finally found some time for it and fixed the tags in the main repository. I was able to run the update and it ran ok, I w2as also able to clone the repo from the official branch and see that they are OK in gitx. If anyone has problems with the tags, please let me know. > > Does anyone know how the git provided "ViewCVS" equivalent shows tags > in a file's history? If you are talking about gitweb, you can see it (for example: Makefile for linux 2.6.17) here: http://git.kernel.org/?p=linux/kernel/git/chrisw/linux-2.6.17.y.git;a=history;f=Makefile;h=79072d86297e78406791f0fc5764c35eb04fd07d;hb=78ace17e51d4968ed2355e8f708d233d1cc37f6d I've also installed gitweb on a copy of biopython repo on my server (not a permanent URL, not updated from trunk) http://83.243.39.60/cgi-bin/gitweb.cgi?p=biopython.git;a=tree;hb=HEAD It shows the tags, but (as usually with git), the tags are only shown for the files which were affected by the particular commit marked with the tag. So this behavior is consistent with kernel.org and github. cheers Bartek From biopython at maubp.freeserve.co.uk Sat May 16 16:35:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 16 May 2009 21:35:36 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <8b34ec180905161039u529a7093p978aa8d61970ca51@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com> <320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com> <320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com> <8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com> <320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com> <8b34ec180905121123w6fe553ahd416b9e1d317dd11@mail.gmail.com> <320fb6e00905121157i2068b81fw5a8b21e65c33abe2@mail.gmail.com> <8b34ec180905161039u529a7093p978aa8d61970ca51@mail.gmail.com> Message-ID: <320fb6e00905161335i28be05fay848dc18f86e728cf@mail.gmail.com> On 5/16/09, Bartek Wilczynski wrote: > On Tue, May 12, 2009 at 8:57 PM, Peter wrote: > > > > I'm not happy with the current github repository due to the history > > tag issue - but we know we can fix that now. Are you going to try > > removing the old tags and re-doing them on github? > > I've finally found some time for it and fixed the tags in the main repository. Great :) > I was able to run the update and it ran ok, I was also able to clone the repo > from the official branch and see that they are OK in gitx. If anyone > has problems with the tags, please let me know. I'll check with my Mac on Monday. > > Does anyone know how the git provided "ViewCVS" equivalent shows > > tags in a file's history? > > If you are talking about gitweb, you can see it (for example: Makefile > for linux 2.6.17) here: > > http://git.kernel.org/?p=linux/kernel/git/chrisw/linux-2.6.17.y.git;a=history;f=Makefile;h=79072d86297e78406791f0fc5764c35eb04fd07d;hb=78ace17e51d4968ed2355e8f708d233d1cc37f6d > > I've also installed gitweb on a copy of biopython repo on my server > (not a permanent URL, not updated from trunk) > http://83.243.39.60/cgi-bin/gitweb.cgi?p=biopython.git;a=tree;hb=HEAD > > It shows the tags, but (as usually with git), the tags are only shown > for the files which were affected by the particular commit marked with > the tag. So this behavior is consistent with kernel.org and github. Thanks for those examples. I see what you mean, looking at Bio/Blast/NCBIXML.py in gitweb for example, no tags show up at all. On the other hand, for the NEWS file, some tags show up. Basically for what I want to use the tags for (identifying changes to a single file between two releases), gitweb doesn't work. Nor does github's history. This is a shame. I think the reason CVS (or SVN) seem to work better in this regard is like python they care about individual files, while git works in terms of changes (which may affect multiple files). I'll see how I get on with the command line or graphical git history viewers and get back to you... Cheers, Peter From hlapp at gmx.net Sat May 16 18:34:57 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 18:34:57 -0400 Subject: [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> Message-ID: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> Don't you love SwissProt (or UniProt as we must call it now I suppose). They (understandably) try to squeeze ever more annotation into the existing tags, rather than adding new tags. So, of the following structure: DE RecName: Full=11S globulin seed storage protein 2; DE AltName: Full=11S globulin seed storage protein II; DE AltName: Full=Alpha-globulin; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 acidic chain; DE AltName: Full=11S globulin seed storage protein II acidic chain; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 basic chain; DE AltName: Full=11S globulin seed storage protein II basic chain; DE Flags: Precursor; really only the first line, with the 'RecName: Full=' removed, is the description line as we know it. The rest, I would say, is annotation, such as two alternative names, amino acid chains contained in the full record (shouldn't this be feature annotation, really? and indeed it is - why it needs to be repeated here is beyond me) and their names as well as alternative names, and the fact that the sequence is a precursor form. Leaving all this in one string has the advantage that we can round- trip it (and there is probably hardly any other way to accomplish that), but clearly in terms of semantics this isn't the sequence description as we know it anymore. Does anyone else think too that completely changing the semantics of sequence annotation fields is a bad idea? My inclination from a BioPerl perspective is to extract the part following 'RecName: Full=' as the description, and attach the rest as annotation. We could in fact use the TagTree class for this. I'm cross- posting to BioPerl too to gather what other BioPerl'ers think about this. -hilmar On May 14, 2009, at 2:20 PM, Peter wrote: > Hi, > > This is cross-posted between biopython-dev and biosql-l as it regards > parsing the description (DE) lines in SwissProt files and how they are > stored in BioSQL. This follows from an earlier discussion on > biopython-dev > > Older SwissProt files just had one or two DE lines, and it made sense > to treat this as a simple string mapped onto the description field in > the bioentry table in BioSQL. This appears to what happens with > BioPerl 1.5.x and in Biopython (although the details regarding white > space differ). However, newer SwissProt files have many DE lines with > additional structure. The example Michiel gave earlier on the > biopython-dev list was: > > http://www.uniprot.org/uniprot/Q9XHP0.txt > > This has the following DE lines: > > DE RecName: Full=11S globulin seed storage protein 2; > DE AltName: Full=11S globulin seed storage protein II; > DE AltName: Full=Alpha-globulin; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 acidic chain; > DE AltName: Full=11S globulin seed storage protein II acidic > chain; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 basic chain; > DE AltName: Full=11S globulin seed storage protein II basic chain; > DE Flags: Precursor; > > I had to fight with perl to get my old copy of BioPerl working again > (some week reference thing), but I managed, and then loaded this file > into my test BioSQL database with: > > $ perl load_seqdatabase.pl --dbname biosql_test --dbuser root --dbpass > XXX --namespace biosql_test --format swiss Q9XHP0.txt > > Then I looked at the resulting description in the main bioentry table: > > $ mysql --user=root -p biosql_test -e 'SELECT description FROM > bioentry WHERE accession="Q9XHP0";' > > This is stored as one huge long string (without the newlines, I'm not > sure if BioPerl strips those in parsing the file, or when loading it > into the database): > > RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S > globulin seed storage protein II; AltName: Full=Alpha-globulin; > Contains: RecName: Full=11S globulin seed storage protein 2 acidic > chain; AltName: Full=11S globulin seed storage protein II acidic > chain; Contains: RecName: Full=11S globulin seed storage protein 2 > basic chain; AltName: Full=11S globulin seed storage protein II basic > chain; Flags: Precursor; > > For Biopython, I emptied the database then did: > >>>> from Bio import SeqIO >>>> from BioSQL import BioSeqDatabase >>>> server = BioSeqDatabase.open_database(driver="MySQLdb", >>>> user="root", passwd = "XXX", host = "localhost", db="biosql_test") >>>> db = server["biosql-test"] #namespace >>>> db.load(SeqIO.parse(open("Q9XHP0.txt"), "swiss")) > 1 >>>> server.commit() > > As before, I looked in the table with mysql. Again - this stores the > full description from the DE line, although with the newlines > embedded. So, Biopython is consistent with my old copy of BioPerl > (1.5.x) if we ignore the white space. > > However, how does this look in BioPerl 1.6? If this is the same, are > there any plans to change this? For Biopython we have discussed > recording most of the DE information under the annotations instead > (keyed off RecName, AltName, Contains, Flags), but I would like to be > consistent with BioPerl+BioSQL. > > Thanks > > Peter > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Sat May 16 19:14:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 17 May 2009 00:14:54 +0100 Subject: [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> Message-ID: <320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com> On 5/16/09, Hilmar Lapp wrote: > > Don't you love SwissProt (or UniProt as we must call it now I suppose). > They (understandably) try to squeeze ever more annotation into the existing > tags, rather than adding new tags. > > So, of the following structure: > > DE RecName: Full=11S globulin seed storage protein 2; > DE AltName: Full=11S globulin seed storage protein II; > DE AltName: Full=Alpha-globulin; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 acidic chain; > DE AltName: Full=11S globulin seed storage protein II acidic chain; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 basic chain; > DE AltName: Full=11S globulin seed storage protein II basic chain; > DE Flags: Precursor; > > really only the first line, with the 'RecName: Full=' removed, is the > description line as we know it. The rest, I would say, is annotation, such > as two alternative names, amino acid chains contained in the full record > (shouldn't this be feature annotation, really? and indeed it is - why it > needs to be repeated here is beyond me) and their names as well as > alternative names, and the fact that the sequence is a precursor form. > > Leaving all this in one string has the advantage that we can round-trip it > (and there is probably hardly any other way to accomplish that), but clearly > in terms of semantics this isn't the sequence description as we know it > anymore. > > Does anyone else think too that completely changing the semantics of > sequence annotation fields is a bad idea? +1 That's pretty much what I thought on seeing this the first time. > My inclination from a BioPerl perspective is to extract the part following > 'RecName: Full=' as the description, and attach the rest as annotation. We > could in fact use the TagTree class for this. I'm cross-posting to BioPerl > too to gather what other BioPerl'ers think about this. Am I right to infer that currently BioPerl 1.6.x, like BioPerl 1.5.x just treats the DE lines as only big long string? Could you translate your idea about the TagTree class into something concrete with BioSQL tables and fields for me? I'm not familiar with the TagTree (or Perl). Over on the Biopython list we'd talked about storing this annotation in a nested structured. However, in order to use the BioSQL annotations mechanisms, I think a simple flat structure is required :( Peter From biopython at maubp.freeserve.co.uk Sat May 16 19:28:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 17 May 2009 00:28:43 +0100 Subject: [Biopython-dev] [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> Message-ID: <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> On 5/17/09, Chris Fields wrote: > > On May 16, 2009, at 5:34 PM, Hilmar Lapp wrote: > > My inclination from a BioPerl perspective is to extract the part following > > 'RecName: Full=' as the description, and attach the rest as annotation. We > > could in fact use the TagTree class for this. I'm cross-posting to BioPerl > > too to gather what other BioPerl'ers think about this. > > > > -hilmar > > > > This is much like the GN issues we've run into before, and we *could* set > this up using TagTree or similar. In the latter case of gene name the data > is stored in a text tree as follows: > > gene_names: > gene_name: > Name: GC1QBP > Synonyms: HABP1 > Synonyms: SF2P32 > Synonyms: C1QBP > > That could be changed to an XML string: > > > > > GC1QBP > HABP1 > SF2P32 > C1QBP > > > > Thinking about this we should attempt to coalesce around a standard instead > of forcing the other Bio* to a specific format. How would you record this in BioSQL? As an XML string for an annotation value? Brad has suggested JSON might be useful for this kind of thing (see also per-letter-annotation discussion). Peter From hlapp at gmx.net Sat May 16 19:37:14 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 19:37:14 -0400 Subject: [Biopython-dev] [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> Message-ID: <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> On May 16, 2009, at 7:28 PM, Peter wrote: >> That could be changed to an XML string: >> >> >> >> >> GC1QBP >> HABP1 >> SF2P32 >> C1QBP >> >> >> >> Thinking about this we should attempt to coalesce around a standard >> instead >> of forcing the other Bio* to a specific format. > > How would you record this in BioSQL? As an XML string for an > annotation value? Yes. A TagTree object can be serialized to XML, and the XML can be stored as the annotation value in BioSQL. As the XML can be read back in, it allows full round-tripping. > Brad has suggested JSON might be useful for this kind of thing (see > also per-letter-annotation discussion). JSON could be another serialization format, but XML is equally or better supported in all languages except JavaScript. Furthermore, you could just send the XML to the browser and have an XSLT (either directly, or indirectly through JavaScript doing the transformation) do the rendering. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sat May 16 19:42:17 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 19:42:17 -0400 Subject: [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com> Message-ID: <8CD4EED1-A689-447F-8F6E-8D2204DD4E86@gmx.net> On May 16, 2009, at 7:14 PM, Peter wrote: > Am I right to infer that currently BioPerl 1.6.x, like BioPerl 1.5.x > just > treats the DE lines as only big long string? Yes. > Could you translate your idea about the TagTree class into something > concrete with BioSQL tables and fields for me? [...] Over on the > Biopython list we'd talked about storing this annotation in a nested > structured. That's more or less what TagTree is. > However, in order to use the BioSQL annotations mechanisms, I think > a simple flat structure is required :( Not necessarily. If you have a flat serialization (such as XML) the nested structure isn't needed. Of course that's not a fully normalized relational representation, but if you had one, how often would it be used, how efficient would those queries be (SQL is poor at nested or recursive data structures), and how much pain would it be to write the object-relational mappings? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Sun May 17 08:40:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 17 May 2009 13:40:47 +0100 Subject: [Biopython-dev] [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> Message-ID: <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com> On 5/17/09, Hilmar Lapp wrote: > > On May 16, 2009, at 7:28 PM, Peter wrote: > > > That could be changed to an XML string: > > > > > > > > > > > > > > > GC1QBP > > > HABP1 > > > SF2P32 > > > C1QBP > > > > > > > > > > > > Thinking about this we should attempt to coalesce around a standard > > > instead of forcing the other Bio* to a specific format. Absolutely - some common standard should be agreed. Would you envision doing this for other structured fields, inventing a new mini XML format each time? That seems open ended and likely to cause a lot of work keeping all the Bio* project synchronised. Here you have mapped RecName and AltName fields in the DE lines to Name and Synonyms (shouldn't that be Synonym singular?). I also don't get why you have used a gene_name entry inside a gene_names list. Would you hold the contains information and the flags information from the DE lines in separate XML entries? I would have gone for something much closer to the original DE line markup i.e. using the field names UniProt use, RecName and AltName, rather than mapping these to Name and Synonym. > > How would you record this in BioSQL? As an XML string for an annotation > > value? > > Yes. A TagTree object can be serialized to XML, and the XML can be stored > as the annotation value in BioSQL. As the XML can be read back in, it allows > full round-tripping. Assuming you stored all the DE markup, then yes, a round trip back to the SwissProt file could be possible. And, depending on the details of the XML structure used, it would be possible to represent this in a python structure too. > > Brad has suggested JSON might be useful for this kind of thing (see > > also per-letter-annotation discussion). > > JSON could be another serialization format, but XML is equally or better > supported in all languages except JavaScript. Furthermore, you could just > send the XML to the browser and have an XSLT (either directly, or indirectly > through JavaScript doing the transformation) do the rendering. I have no strong preference for either XML or JSON (but would rather avoid them if they are not really needed). For other types of annotation there may be a clearer advantage for one over the other, e.g. per letter annotation like the secondary structure of a protein sequence, or the quality scores of a nucleotide contig. On 5/17/09, Hilmar Lapp wrote: > Not necessarily. If you have a flat serialization (such as XML) the nested > structure isn't needed. Of course that's not a fully normalized relational > representation, but if you had one, how often would it be used, how > efficient would those queries be (SQL is poor at nested or recursive data > structures), and how much pain would it be to write the object-relational > mappings? In this example, searching the database using one of the SwissProt AltNames (synonyms), or filtering on the Flags sounds like a reasonable request - but this would be very difficult if the data is stored inside XML strings. Of course, because the RecName and AltName entries are top level, we could just record them as normal - simple strings in the annotations table. This seems much nicer. Likewise the "Flags: Precursor;" line. i.e. listing the tag/value pairs which could be used in the bioentry_qualifier_value table: AltName = "Full=11S globulin seed storage protein II" AltName = "Full=Alpha-globulin" Flags = "Precursor" (the RecName field, "Full=11S globulin seed storage protein 2", could be used for the bioentry.description instead) The above are all pretty easy. We only need to consider nesting (or something like XML or JSON) for some of the DE information, in the example discussed the Contains lines. Even this could be even be done by storing each contains entry as a single long string (holding both the name and synonyms) directly from the DE line itself, something like this: Contains = "RecName: Full=11S globulin seed storage protein 2 acidic chain;\nAltName: Full=11S globulin seed storage protein II acidic chain;" Contains = "RecName: Full=11S globulin seed storage protein 2 basic chain;\nAltName: Full=11S globulin seed storage protein II basic chain;" Peter From hlapp at gmx.net Sun May 17 11:21:59 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 17 May 2009 11:21:59 -0400 Subject: [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com> Message-ID: On May 17, 2009, at 8:40 AM, Peter wrote: > On 5/17/09, Hilmar Lapp wrote: >> >> On May 16, 2009, at 7:28 PM, Peter wrote: >>>> That could be changed to an XML string: >>>> >>>> >>>> >>>> >>>> GC1QBP >>>> HABP1 >>>> SF2P32 >>>> C1QBP >>>> >>>> >>>> >>>> Thinking about this we should attempt to coalesce around a standard >>>> instead of forcing the other Bio* to a specific format. > > [...] Here you have mapped RecName and AltName fields in the DE > lines to > Name and Synonyms (shouldn't that be Synonym singular?). The example is for the GN lines in SwissProt, not the DE lines. > [...] > On 5/17/09, Hilmar Lapp wrote: >> Not necessarily. If you have a flat serialization (such as XML) the >> nested >> structure isn't needed. Of course that's not a fully normalized >> relational >> representation, but if you had one, how often would it be used, how >> efficient would those queries be (SQL is poor at nested or >> recursive data >> structures), and how much pain would it be to write the object- >> relational >> mappings? > > In this example, searching the database using one of the SwissProt > AltNames (synonyms), or filtering on the Flags sounds like a > reasonable request - but this would be very difficult if the data is > stored inside XML strings. Actually no. Modern full-text indexers (inside or outside the database) can index XML text columns right away and very well. In fact, for the last project that I built a full-text search for (on top of a BioSQL database) I did that by writing custom XML documents to a separate table for each record I wanted indexed. Oracle's full text indexer did the rest. I also built a separate identifier/name/ accession index that pulled all the gene names, symbols, accession numbers, identifiers etc into a single table for indexing. What I mean is, a fully normalized relational representation, especially if nested, is often not the most efficient data structure for efficient searching and filtering. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bugzilla-daemon at portal.open-bio.org Sun May 17 18:53:13 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 17 May 2009 18:53:13 -0400 Subject: [Biopython-dev] [Bug 2829] BioSQL does not record a generic nucleotide alphabet In-Reply-To: Message-ID: <200905172253.n4HMrDIX006938@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2829 ------- Comment #3 from david.wyllie at ndm.ox.ac.uk 2009-05-17 18:53 EST ------- (In reply to comment #2) > See: > http://lists.open-bio.org/pipermail/biosql-l/2009-May/001515.html > Hi thank you very much for explaining. I'm not sure this is a bug, it's a design feature due to my not understanding the implications of generic_nucleotide. I know it's DNA, and if one uses generic_dna instead in the testcase, all is well. Alphabets are explained clearly in the documentation. Thank you again. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 18 06:08:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 18 May 2009 06:08:45 -0400 Subject: [Biopython-dev] [Bug 2829] BioSQL does not record a generic nucleotide alphabet In-Reply-To: Message-ID: <200905181008.n4IA8j0J015956@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2829 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |WONTFIX ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-18 06:08 EST ------- (In reply to comment #3) > Hi > > thank you very much for explaining. > > I'm not sure this is a bug, it's a design feature due to my > not understanding the implications of generic_nucleotide. As I argued on the BioSQL mailing list, generic nucleotide sequences are a valid case not catered to at the moment. However, they are a corner case, and have no equivalent in BioPerl (which is happy to guess at DNA or RNA). Marking this bug as WON'T FIX. > I know it's DNA, and if one uses generic_dna instead in > the testcase, all is well. Good - if you know you have DNA, then specifying a DNA alphabet would be my recommended course of action. > Alphabets are explained clearly in the documentation. > Thank you again. Let us know if you find anything that needs further clarification in the documentation. Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon May 18 09:38:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 18 May 2009 14:38:03 +0100 Subject: [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com> Message-ID: <320fb6e00905180638q29de63c4if0627eff416c4481@mail.gmail.com> On Sun, May 17, 2009 at 4:21 PM, Hilmar Lapp wrote: > > On May 17, 2009, at 8:40 AM, Peter wrote: >> >> [...] Here you have mapped RecName and AltName fields in the DE lines to >> Name and Synonyms (shouldn't that be Synonym singular?). > > The example is for the GN lines in SwissProt, not the DE lines. Ah, that probably explains some of my confusion. >> In this example, searching the database using one of the SwissProt >> AltNames (synonyms), or filtering on the Flags sounds like a >> reasonable request - but this would be very difficult if the data is >> stored inside XML strings. > > Actually no. Modern full-text indexers (inside or outside the database) can > index XML text columns right away and very well. In fact, for the last > project that I built a full-text search for (on top of a BioSQL database) I > did that by writing custom XML documents to a separate table for each > record I wanted indexed. Oracle's full text indexer did the rest. I also built a > separate identifier/name/accession index that pulled all the gene names, > symbols, accession numbers, identifiers etc into a single table for > indexing. OK, when I said searching "would be very difficult if the data is stored inside XML strings", maybe it wasn't so difficult for you - but that still sounds complicated! Sticking with the GN lines and the synonym, if this was stored as a simple tag/value as usual in BioSQL, I would write my SQL statement to search the annotation table where the term id was that associated with a GN synonym, and the annotation value was "HABP1". Simple. Using the XML approach, are you suggesting you could do a full text search on the annotation value field, looking for any rows where the field contains "HABP1", where the term id matches the GN lines' XML string? This sounds simplistic and probably rather slow - presumably why you resorted to the more complicated indexing scheme described above? > What I mean is, a fully normalized relational representation, especially if > nested, is often not the most efficient data structure for efficient > searching and filtering. OK. But do we really need to worry about complex nested structures for the SwissProt annotation (or in general)? Peter From biopython at maubp.freeserve.co.uk Tue May 19 10:23:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 May 2009 15:23:58 +0100 Subject: [Biopython-dev] [Biopython] Parsing large blast files In-Reply-To: <320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com> References: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com> <290052.25369.qm@web62407.mail.re1.yahoo.com> <320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com> Message-ID: <320fb6e00905190723u2eca08e6o3f70bf37be79e4bf@mail.gmail.com> Last month on this thread we started talking about the BLAST command line wrappers: http://lists.open-bio.org/pipermail/biopython/2009-April/005134.html On Wed, Apr 29, 2009, Peter wrote: > On Wed, Apr 29, 2009, Michiel de Hoon wrote: >> >> How would users typically use Bio.Blast.Applications? > > In the next release, I would aim to have Bio.Blast.Applications > updated to cover blastall (fully), plus blastpgp and rpsblast > (currently not covered) and for the three helper functions > Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast to all use > Bio.Blast.Applications internally. That should be done now in CVS - it turned out to be a lot more tedious that I had expected, but I think we are OK. I would be very grateful to have a couple of people test this out. At the very least, just update your copy of Biopython and confirm any existing scripts using the Bio.Blast.NCBIStandalone blastall, blastpgp or rpsblast functions still work as expected. Note we still need to agree on the preferred name for each parameter (i.e. what do we use for the python properties) as discussed on this thread: http://lists.open-bio.org/pipermail/biopython-dev/2009-May/005976.html http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006039.html Peter From biopython at maubp.freeserve.co.uk Tue May 19 13:00:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 May 2009 18:00:41 +0100 Subject: [Biopython-dev] Repeated options in command line interfaces Message-ID: <320fb6e00905191000g473b9e68r12b8704652b1ad93@mail.gmail.com> Hello all, Yes - its another thread about command line wrappers! One of the Roche 454 off instrument applications is runMapping, which in the most general situation allows you to map one or more SFF files onto one or more FASTA files, e.g. runMapping -o ~/test -ref example1.fasta example2.fasta -read data1.sff data2.sff Notice that "-ref" and "-read" are not repeated, so we could treat this via the current application wrapper system as follows: #These modules don't exist (yet): from Bio.Sequencing.Applications import RunMappingCommandline cline = RunMappingCommandline() cline.ref = "example1.fasta example2.fasta" cline.read = "data1.sff data2.sff" This isn't very elegant, but would work. Over on Bug 2815, Cymon and I have briefly discussed the --seed parameter in Mafft, which is used to specify one or more alignment files, e.g. mafft ... --seed alignment1 --seed alignment2 --seed alignment3 ... Notice that "--seed" is repeated before each value. I was thinking it would be nice to treat this as a single property (seed) which takes a list of strings as its value: from Bio.Align.Applications import MafftCommandline cline = MafftCommandline() cline.seed = ["alignment1", "alignment2", ...] or, equivalently: from Bio.Align.Applications import MafftCommandline cline = MafftCommandline(seed=["alignment1", "alignment2", ...]) or, using the old set_parameter approach, from Bio.Align.Applications import MafftCommandline cline = MafftCommandline() cline.set_parameter("seed", ["alignment1", "alignment2", ...]) and similarly for a Roche wrapper, e.g. #These modules don't exist (yet): from Bio.Sequencing.Applications import RunMappingCommandline cline = RunMappingCommandline() cline.ref = ["example1.fasta", "example2.fasta"] cline.read = ["data1.sff", "data2.sff"] Doing this nicely would require two _Option subclasses in Bio.Application, one for repeated options like "seed" in Mafft, and one for multiple valued options like "ref" and "read" in the Roche tools. Does this sound sensible? Does anyone have any more examples? Peter From bugzilla-daemon at portal.open-bio.org Wed May 20 12:31:24 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 20 May 2009 12:31:24 -0400 Subject: [Biopython-dev] [Bug 2833] New: Features insertion on previous bioentry_id Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2833 Summary: Features insertion on previous bioentry_id Product: Biopython Version: 1.50 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P1 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: andrea at biodec.com Biopython 1.50 (also 1.50b it's the same code) python2.4 or python2.5 postgresql 8.3 BioSQL Schema 1.0.1 Problem: imagine to have 3 seqrecord (s1,s2,s3), imagine that - s1 == s3 (but from different sources....) in other words s1 and s3 are not the same object - s2 != s1 and s2 != s3 imagine to load a Biosql db in this order: - db.load([s1]) - db.load([s2]) - db.load([s3]) At the end of the loading i will have only 2 bioentry ID BUT the s3.features will be inserted on s2 seqrecord. --------------------------------------------------------------------------------------- More in details (documented behaviour): print s1 ID: ENST00000334859 Name: ENST00000334859 Description: Leucine-rich repeat and calponin homology domain-containing protein 3 precursor. [Source:Uniprot/SWISSPROT;Acc:Q96II8] Number of features: 24 /source= /taxonomy=[] /keywords=[''] /accessions=['ENST00000334859'] /data_file_division=UNK /date=01-JAN-1980 /organism=. . /gi=ENST00000334859 Seq('ATGGCGGCCGCGGGCTTGGTCGCTGTGGCAGCGGCTGCCGAGTACTCTGGCACG...TGA', IUPACAmbiguousDNA()) print s2 ID: ENST00000391466 Name: ENST00000391466 Description: CDNA FLJ44976 fis, clone BRAWH3001833. [Source:Uniprot/SPTREMBL;Acc:Q6ZQT1] Number of features: 8 /source= /taxonomy=[] /keywords=[''] /accessions=['ENST00000391466'] /data_file_division=UNK /date=01-JAN-1980 /organism=. . /gi=ENST00000391466 Seq('ATGACAGTGATTCTCTTTACCCAACTCACCGCACCCATGGCAGTGATTCTCTTT...TAG', IUPACAmbiguousDNA()) print s3 ID: ENST00000334859 Name: ENST00000334859 Description: Leucine-rich repeat and calponin homology domain-containing protein 3 precursor. [Source:Uniprot/SWISSPROT;Acc:Q96II8] Number of features: 24 /source= /taxonomy=[] /keywords=[''] /accessions=['ENST00000334859'] /data_file_division=UNK /date=01-JAN-1980 /organism=. . /gi=ENST00000334859 Seq('ATGGCGGCCGCGGGCTTGGTCGCTGTGGCAGCGGCTGCCGAGTACTCTGGCACG...TGA', IUPACAmbiguousDNA()) As you can see: - s1 and S3 are identical and s2 differs from them. - s1 and s3 has 24 features - s2 has 8 features STEP 1 (biosql insertion of s1) - db.load([s1]) - looking into the db: select bioentry_id, name, accession, identifier from bioentry; bioentry_id | name | accession | identifier | -------------+-----------------+-----------------+-----------------+ 39 | ENST00000334859 | ENST00000334859 | ENST00000334859 | (1 row) select * from seqfeature; select * from seqfeature; seqfeature_id | bioentry_id | type_term_id | source_term_id | display_name | rank ---------------+-------------+--------------+----------------+--------------+------ 291 | 39 | 27 | 15 | | 1 292 | 39 | 27 | 15 | | 2 293 | 39 | 27 | 15 | | 3 294 | 39 | 27 | 15 | | 4 295 | 39 | 27 | 15 | | 5 296 | 39 | 14 | 15 | | 6 297 | 39 | 14 | 15 | | 7 298 | 39 | 30 | 15 | | 8 299 | 39 | 30 | 15 | | 9 300 | 39 | 30 | 15 | | 10 301 | 39 | 30 | 15 | | 11 302 | 39 | 30 | 15 | | 12 303 | 39 | 30 | 15 | | 13 304 | 39 | 30 | 15 | | 14 305 | 39 | 30 | 15 | | 15 306 | 39 | 30 | 15 | | 16 307 | 39 | 30 | 15 | | 17 308 | 39 | 25 | 15 | | 18 309 | 39 | 25 | 15 | | 19 310 | 39 | 25 | 15 | | 20 311 | 39 | 25 | 15 | | 21 312 | 39 | 25 | 15 | | 22 313 | 39 | 26 | 15 | | 23 314 | 39 | 26 | 15 | | 24 (24 rows) STEP 2 (biosql insertion of s2) - db.load([s2]) - looking into the db: select bioentry_id, name, accession, identifier from bioentry; bioentry_id | name | accession | identifier -------------+-----------------+-----------------+----------------- 39 | ENST00000334859 | ENST00000334859 | ENST00000334859 40 | ENST00000391466 | ENST00000391466 | ENST00000391466 (2 rows) select * from seqfeature; seqfeature_id | bioentry_id | type_term_id | source_term_id | display_name | rank ---------------+-------------+--------------+----------------+--------------+------ 291 | 39 | 27 | 15 | | 1 292 | 39 | 27 | 15 | | 2 293 | 39 | 27 | 15 | | 3 294 | 39 | 27 | 15 | | 4 295 | 39 | 27 | 15 | | 5 296 | 39 | 14 | 15 | | 6 297 | 39 | 14 | 15 | | 7 298 | 39 | 30 | 15 | | 8 299 | 39 | 30 | 15 | | 9 300 | 39 | 30 | 15 | | 10 301 | 39 | 30 | 15 | | 11 302 | 39 | 30 | 15 | | 12 303 | 39 | 30 | 15 | | 13 304 | 39 | 30 | 15 | | 14 305 | 39 | 30 | 15 | | 15 306 | 39 | 30 | 15 | | 16 307 | 39 | 30 | 15 | | 17 308 | 39 | 25 | 15 | | 18 309 | 39 | 25 | 15 | | 19 310 | 39 | 25 | 15 | | 20 311 | 39 | 25 | 15 | | 21 312 | 39 | 25 | 15 | | 22 313 | 39 | 26 | 15 | | 23 314 | 39 | 26 | 15 | | 24 315 | 40 | 28 | 15 | | 1 316 | 40 | 28 | 15 | | 2 317 | 40 | 28 | 15 | | 3 318 | 40 | 28 | 15 | | 4 319 | 40 | 28 | 15 | | 5 320 | 40 | 28 | 15 | | 6 321 | 40 | 28 | 15 | | 7 322 | 40 | 28 | 15 | | 8 (32 rows) STEP 3 (biosql insertion of s3) - db.load([s3]) - looking into the db: select bioentry_id, name, accession, identifier from bioentry; bioentry_id | name | accession | identifier -------------+-----------------+-----------------+----------------- 39 | ENST00000334859 | ENST00000334859 | ENST00000334859 40 | ENST00000391466 | ENST00000391466 | ENST00000391466 (2 rows) select * from seqfeature; seqfeature_id | bioentry_id | type_term_id | source_term_id | display_name | rank ---------------+-------------+--------------+----------------+--------------+------ 291 | 39 | 27 | 15 | | 1 292 | 39 | 27 | 15 | | 2 293 | 39 | 27 | 15 | | 3 294 | 39 | 27 | 15 | | 4 295 | 39 | 27 | 15 | | 5 296 | 39 | 14 | 15 | | 6 297 | 39 | 14 | 15 | | 7 298 | 39 | 30 | 15 | | 8 299 | 39 | 30 | 15 | | 9 300 | 39 | 30 | 15 | | 10 301 | 39 | 30 | 15 | | 11 302 | 39 | 30 | 15 | | 12 303 | 39 | 30 | 15 | | 13 304 | 39 | 30 | 15 | | 14 305 | 39 | 30 | 15 | | 15 306 | 39 | 30 | 15 | | 16 307 | 39 | 30 | 15 | | 17 308 | 39 | 25 | 15 | | 18 309 | 39 | 25 | 15 | | 19 310 | 39 | 25 | 15 | | 20 311 | 39 | 25 | 15 | | 21 312 | 39 | 25 | 15 | | 22 313 | 39 | 26 | 15 | | 23 314 | 39 | 26 | 15 | | 24 315 | 40 | 28 | 15 | | 1 316 | 40 | 28 | 15 | | 2 317 | 40 | 28 | 15 | | 3 318 | 40 | 28 | 15 | | 4 319 | 40 | 28 | 15 | | 5 320 | 40 | 28 | 15 | | 6 321 | 40 | 28 | 15 | | 7 322 | 40 | 28 | 15 | | 8 323 | 40 | 27 | 15 | | 1 324 | 40 | 27 | 15 | | 2 325 | 40 | 27 | 15 | | 3 326 | 40 | 27 | 15 | | 4 327 | 40 | 27 | 15 | | 5 328 | 40 | 14 | 15 | | 6 329 | 40 | 14 | 15 | | 7 330 | 40 | 30 | 15 | | 8 331 | 40 | 30 | 15 | | 9 332 | 40 | 30 | 15 | | 10 333 | 40 | 30 | 15 | | 11 334 | 40 | 30 | 15 | | 12 335 | 40 | 30 | 15 | | 13 336 | 40 | 30 | 15 | | 14 337 | 40 | 30 | 15 | | 15 338 | 40 | 30 | 15 | | 16 339 | 40 | 30 | 15 | | 17 340 | 40 | 25 | 15 | | 18 341 | 40 | 25 | 15 | | 19 342 | 40 | 25 | 15 | | 20 343 | 40 | 25 | 15 | | 21 344 | 40 | 25 | 15 | | 22 345 | 40 | 26 | 15 | | 23 346 | 40 | 26 | 15 | | 24 (56 rows) As you can easily see the 24 feature of s3 seqrecord has been added to the bioentry_id 40 (that was s2). ------------------------------------------------------------------------------------ The problem is not so easy to understand. I tried to have a look into the code of Loader.py and i found something: the code works in this way: 1) it tries to load the seqrecord using: load_seqrecord(self, record) this method as first thing tries to load the bioentry table with the method: _load_bioentry_table(self, record) this method at last thing tries to get the bioentry_id of the "just inserted" record with the db method: self.adaptor.last_id('bioentry') 2) then with the bioentry_id recovered from the first method it tries to fill the other tables...and also the seqfeature... 3) In biosql (the schema), if you try to insert a record into the bioentry table that has the same Identifier or Accession of an existing record it doesn't do anything.... and it tells you "INSERT 0 0" 4) So, if you try to insert the s3 record that has the same Accession and Identifier of the s1... the bioentry_id the load_seqrecord(self, record) method will return the bioentry_id of the s2 record (it will be the self.adaptor.last_id('bioentry') output) Maybe other information will be transferred to s2 (not only the features...). For example also "dbxrefs" could suffer of the same problem.... I think the solution depend on what we expect from the code: - if we expect a behaviour like "don't do anything with identical Accession/Identifier" it is better to check the last_id before and after insertion and return None if it is identical... than manage a "None" bioentry_id like a block in the other biosql insertions.... - if we expect a "Merge" behaviour it is better to retrive the bioentry_id of the object with the same Accession/Identifier and than verify if the 2 seqrecord has identical sequence and than merge features/annotations/dbxrefs.... etc. - other behaviours... other solutions... Andrea -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 20 16:25:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 20 May 2009 16:25:39 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905202025.n4KKPdYT020904@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-20 16:25 EST ------- (In reply to comment #0) > Biopython 1.50 (also 1.50b it's the same code) > python2.4 or python2.5 > postgresql 8.3 > BioSQL Schema 1.0.1 > > Problem: > imagine to have 3 seqrecord (s1,s2,s3), ... load a Biosql db in this order: > - db.load([s1]) > - db.load([s2]) > - db.load([s3]) > > At the end of the loading i will have only 2 bioentry ID > BUT the s3.features will be inserted on s2 seqrecord. BioSQL will allow you to have multiple versions of the same record but they must have different versions (e.g. s1.id="ENST00000334859.0" and s3.id="ENST00000334859.1" should work). The problem with your data is s1.id == s3.id, so I would expect them to get the same accession and version (taken as zero). Therefore s3 should *fail* to load. I can try and reproduce this using the information given, but it would help if you could attach the original sequence files to this bug. Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 20 17:07:08 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 20 May 2009 17:07:08 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905202107.n4KL78te024053@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-20 17:07 EST ------- (In reply to comment #0) > Biopython 1.50 (also 1.50b it's the same code) > python2.4 or python2.5 > postgresql 8.3 > BioSQL Schema 1.0.1 What version of psycopg are you using? i.e. The python library for talking to PostgreSQL. Have you tried running Biopython's BioSQL unit tests? You'll need to configure your settings in setup_BioSQL.py first. If that looks good could you try updating to the latest Biopython from CVS and retesting? I've added a basic check in test_BioSQL.py for duplicated entries (using a GenBank file) which works on my machine using MySQL. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 21 06:31:42 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 May 2009 06:31:42 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905211031.n4LAVgvW019852@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #3 from andrea at biodec.com 2009-05-21 06:31 EST ------- Created an attachment (id=1299) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1299&action=view) Pickled Seqrecord s1 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 21 06:32:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 May 2009 06:32:12 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905211032.n4LAWBXC019888@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #4 from andrea at biodec.com 2009-05-21 06:32 EST ------- Created an attachment (id=1300) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1300&action=view) Pickled Seqrecord s2 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 21 06:32:28 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 May 2009 06:32:28 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905211032.n4LAWSlA019903@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #5 from andrea at biodec.com 2009-05-21 06:32 EST ------- Created an attachment (id=1301) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1301&action=view) Pickled Seqrecord s3 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 21 06:34:46 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 May 2009 06:34:46 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905211034.n4LAYkhC020056@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #6 from andrea at biodec.com 2009-05-21 06:34 EST ------- Hi Peter, i did 4 tests: [python2.4,python2.5]*[psycopg,psycopg2] with - biopython from "this morning" cvs. - psycopg.__version__ '1.1.21' - psycopg2.__version__ '2.0.7 (dec mx dt ext pq3)' in any case i've the same results: Make sure all records are correctly loaded. ... ok Make sure can't import records twice. ... FAIL Indepth check that SeqFeatures are transmitted through the db. ... ok Load SeqRecord objects into a BioSQL database. ... ok Get a list of all items in the database. ... ok Test retrieval of items using various ids. ... ok Check can add DBSeq objects together. ... ok Check can turn a DBSeq object into a Seq or MutableSeq. ... ok Make sure Seqs from BioSQL implement the right interface. ... ok Check SeqFeatures of a sequence. ... ok Make sure SeqRecords from BioSQL implement the right interface. ... ok Check that slices of sequences are retrieved properly. ... ok ====================================================================== FAIL: Make sure can't import records twice. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 374, in test_reload self.assert_("duplicate" in str(err).lower()) AssertionError ---------------------------------------------------------------------- Ran 12 tests in 23.815s FAILED (failures=1) i've 1 failure in "Make sure can't import records twice. ..." it seems interesting for the problem... Then i tried with python2.4, python2.5, psycopg, psycopg2 i attached the pickles of the 3 seqrecords so you can try by yourself... ########################################################### from BioSQL import BioSeqDatabase import cPickle server = BioSeqDatabase.open_database(driver = "psycopg2", user = 'postgres', passwd = "hidden", host = "dbservertest", db = 'test_biosql' ) ## LOAD SeqRecords from pickle s1=cPickle.load(open('s1.cpk')) s2=cPickle.load(open('s2.cpk')) s3=cPickle.load(open('s3.cpk')) ## LOAD INTO DB db=server.new_database('test') server.commit() db.load([s1]) db.load([s2]) db.load([s3]) db.adaptor.commit() ########################################################### I had always the same problem. So i prepare a buildout environment with the last Biopython and with a new psycopg2 library (for psycopg i had the latest). psycopg2.__version__ '2.0.11 (dt dec ext pq3)' The result from the test was the same The result from the upload (based on pickled seqrecords) was the same Thanks Andrea -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 21 06:39:18 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 May 2009 06:39:18 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905211039.n4LAdIit020365@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-21 06:39 EST ------- (In reply to comment #6) > Hi Peter, > i did 4 tests: [python2.4,python2.5]*[psycopg,psycopg2] > with > - biopython from "this morning" cvs. > - psycopg.__version__ '1.1.21' > - psycopg2.__version__ '2.0.7 (dec mx dt ext pq3)' > > in any case i've the same results: > > Make sure all records are correctly loaded. ... ok > Make sure can't import records twice. ... FAIL > ... > ====================================================================== > FAIL: Make sure can't import records twice. > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_BioSQL.py", line 374, in test_reload > self.assert_("duplicate" in str(err).lower()) > AssertionError OK - the unit test is doing what I expected, and the duplicate insertion is failing. Its just the error message is different to what I expected, which should be trivial to fix. This means inserting the same GenBank record twice fails (which is good). However, the unit test doesn't reproduce your original issue. Hopefully your pickled SeqRecord objects will help there... Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 21 07:36:34 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 May 2009 07:36:34 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905211136.n4LBaYO8024199@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-21 07:36 EST ------- (In reply to comment #7) > However, the unit test doesn't reproduce your original issue. Hopefully > your pickled SeqRecord objects will help there... Based on your example script in comment 6 with the pickled SeqRecord objects, but using MySQL, I get an IntegrityError as expected: Traceback (most recent call last): ... IntegrityError: (1062, "Duplicate entry 'ENST00000334859-2-0' for key 2") I get the same error with simplified records lacking any annotation or features (I just saved your three records to a FASTA file and reloaded them). So what ever is going wrong seems to be PostgreSQL specific (or at least, does not affect MySQL). I have updated test_BioSQL.py in CVS to cover more variations (revision 1.33), and hopefully the error message check should work on PostgreSQL as well. It would be very helpful if you could test that. Part of the new tests is a slight variation on your original example. Could you try this: db.load([s1]) server.commit() db.load([s2]) server.commit() db.load([s3]) server.commit() This might tell us if the issue is with PostgreSQL not checking the key constraints until the commit. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Thu May 21 08:29:27 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 21 May 2009 08:29:27 -0400 Subject: [Biopython-dev] Repeated options in command line interfaces In-Reply-To: <320fb6e00905191000g473b9e68r12b8704652b1ad93@mail.gmail.com> References: <320fb6e00905191000g473b9e68r12b8704652b1ad93@mail.gmail.com> Message-ID: <20090521122927.GM84112@sobchak.mgh.harvard.edu> Hi Peter; > Yes - its another thread about command line wrappers! It seems like y'all are unearthing every single crazy command line option choice out there. Great to have this fleshed out. > One of the Roche 454 off instrument applications is runMapping, > which in the most general situation allows you to map one or > more SFF files onto one or more FASTA files, e.g. > > runMapping -o ~/test -ref example1.fasta example2.fasta -read > data1.sff data2.sff [...] > the --seed parameter in Mafft, which is used to specify one or more > alignment files, e.g. > > mafft ... --seed alignment1 --seed alignment2 --seed alignment3 ... > > Notice that "--seed" is repeated before each value. > > I was thinking it would be nice to treat this as a single > property (seed) which takes a list of strings as its value: > > from Bio.Align.Applications import MafftCommandline > cline = MafftCommandline() > cline.seed = ["alignment1", "alignment2", ...] [...] > #These modules don't exist (yet): > from Bio.Sequencing.Applications import RunMappingCommandline > cline = RunMappingCommandline() > cline.ref = ["example1.fasta", "example2.fasta"] > cline.read = ["data1.sff", "data2.sff"] This makes good sense to me. It hides the actual nastiness a bit and makes it clear in the code what is happening -- assigning multiple parameters to a single option. It sounds like a great way to handle it. Brad From bugzilla-daemon at portal.open-bio.org Thu May 21 11:04:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 May 2009 11:04:40 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905211504.n4LF4ej0015238@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #9 from andrea at biodec.com 2009-05-21 11:04 EST ------- (In reply to comment #8) > (In reply to comment #7) > > However, the unit test doesn't reproduce your original issue. Hopefully > > your pickled SeqRecord objects will help there... > > Based on your example script in comment 6 with the pickled SeqRecord objects, > but using MySQL, I get an IntegrityError as expected: > > Traceback (most recent call last): > ... > IntegrityError: (1062, "Duplicate entry 'ENST00000334859-2-0' for key 2") > > I get the same error with simplified records lacking any annotation or features > (I just saved your three records to a FASTA file and reloaded them). So what > ever is going wrong seems to be PostgreSQL specific (or at least, does not > affect MySQL). According to me it's postgres specific the fact that i don't have any error at all. If biopython expects from postgres an error in this situation there are some problem in postgres (or in mine). > > I have updated test_BioSQL.py in CVS to cover more variations (revision 1.33), > and hopefully the error message check should work on PostgreSQL as well. It > would be very helpful if you could test that. This is te results of the test: it's the same on python2.4 and python2.5: Make sure can't import records with same ID (in one go). ... FAIL Make sure can't import records with same ID (in steps). ... FAIL Make sure can't import records with same ID (in steps with commit). ... FAIL Make sure can't import a single record twice (in one go). ... FAIL Make sure can't import a single record twice (in steps). ... FAIL Make sure can't import a single record twice (in steps with commit). ... FAIL Make sure all records are correctly loaded. ... ok Make sure can't reimport existing records. ... FAIL Indepth check that SeqFeatures are transmitted through the db. ... ok Load SeqRecord objects into a BioSQL database. ... ok Get a list of all items in the database. ... ok Test retrieval of items using various ids. ... ok Check can add DBSeq objects together. ... ok Check can turn a DBSeq object into a Seq or MutableSeq. ... ok Make sure Seqs from BioSQL implement the right interface. ... ok Check SeqFeatures of a sequence. ... ok Make sure SeqRecords from BioSQL implement the right interface. ... ok Check that slices of sequences are retrieved properly. ... ok ====================================================================== FAIL: Make sure can't import records with same ID (in one go). ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 397, in test_duplicate_id_load err.__class__.__name__ + "\n" + str(err)) AssertionError: Exception Should have failed! ====================================================================== FAIL: Make sure can't import records with same ID (in steps). ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 410, in test_duplicate_id_load2 err.__class__.__name__ + "\n" + str(err)) AssertionError: Exception Should have failed! ====================================================================== FAIL: Make sure can't import records with same ID (in steps with commit). ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 424, in test_duplicate_id_load3 err.__class__.__name__ + "\n" + str(err)) AssertionError: Exception Should have failed! ====================================================================== FAIL: Make sure can't import a single record twice (in one go). ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 361, in test_duplicate_load err.__class__.__name__ + "\n" + str(err)) AssertionError: Exception Should have failed! ====================================================================== FAIL: Make sure can't import a single record twice (in steps). ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 373, in test_duplicate_load2 err.__class__.__name__ + "\n" + str(err)) AssertionError: Exception Should have failed! ====================================================================== FAIL: Make sure can't import a single record twice (in steps with commit). ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 386, in test_duplicate_load3 err.__class__.__name__ + "\n" + str(err)) AssertionError: Exception Should have failed! ====================================================================== FAIL: Make sure can't reimport existing records. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 463, in test_reload err.__class__.__name__ + "\n" + str(err)) AssertionError: OperationalError currval of sequence "bioentry_pk_seq" is not yet defined in this session ---------------------------------------------------------------------- Ran 18 tests in 26.938s FAILED (failures=7) > > Part of the new tests is a slight variation on your original example. Could > you try this: > > db.load([s1]) > server.commit() > db.load([s2]) > server.commit() > db.load([s3]) > server.commit() > >>> ## LOAD INTO DB >>> db.load([s1]) 1 >>> server.commit() >>> db.load([s2]) 1 >>> server.commit() >>> db.load([s3]) 1 >>> server.commit() >>> i don't have any errors!!! > This might tell us if the issue is with PostgreSQL not checking the key > constraints until the commit. > it seems that. If i try to do the insertion via SQL i don't have any errors. I just have a message of the type: INSERT 0 0 due to the fact the postgres doesn't insert anything. Andrea -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 21 13:05:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 May 2009 13:05:12 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905211705.n4LH5Ca6028981@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #10 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-21 13:05 EST ------- Well, some progress :) (In reply to comment #9) > This is te results of the test: it's the same on python2.4 and python2.5: > Make sure can't import records with same ID (in one go). ... FAIL > Make sure can't import records with same ID (in steps). ... FAIL > Make sure can't import records with same ID (in steps with commit). ... FAIL > Make sure can't import a single record twice (in one go). ... FAIL > Make sure can't import a single record twice (in steps). ... FAIL > Make sure can't import a single record twice (in steps with commit). ... FAIL > Make sure all records are correctly loaded. ... ok > Make sure can't reimport existing records. ... FAIL > Indepth check that SeqFeatures are transmitted through the db. ... ok > Load SeqRecord objects into a BioSQL database. ... ok > Get a list of all items in the database. ... ok > Test retrieval of items using various ids. ... ok > Check can add DBSeq objects together. ... ok > Check can turn a DBSeq object into a Seq or MutableSeq. ... ok > Make sure Seqs from BioSQL implement the right interface. ... ok > Check SeqFeatures of a sequence. ... ok > Make sure SeqRecords from BioSQL implement the right interface. ... ok > Check that slices of sequences are retrieved properly. ... ok > > ====================================================================== > FAIL: Make sure can't import records with same ID (in one go). > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_BioSQL.py", line 397, in test_duplicate_id_load > err.__class__.__name__ + "\n" + str(err)) > AssertionError: Exception > Should have failed! > ... Also the error formatting wasn't quite what I had intended, fixed in CVS. However, most of the tests are allowing duplicates to be recorded without any error (on PostgreSQL). This is bad. > ====================================================================== > FAIL: Make sure can't reimport existing records. > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_BioSQL.py", line 463, in test_reload > err.__class__.__name__ + "\n" + str(err)) > AssertionError: OperationalError > currval of sequence "bioentry_pk_seq" is not yet defined in this session Interestingly the final test gives us an OperationalError about the bioentry table's primary key (presumably from our last_id method which would call the SQL statement "select currval('bioentry_pk_seq')"). This suggests some clues about what is going wrong. http://www.postgresql.org/docs/8.3/static/functions-sequence.html http://www.postgresql.org/docs/8.3/static/sql-createsequence.html See also: http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/sql/biosqldb-pg.sql CREATE SEQUENCE bioentry_pk_seq; CREATE TABLE bioentry ( bioentry_id INTEGER DEFAULT nextval ( 'bioentry_pk_seq' ) NOT NULL , biodatabase_id INTEGER NOT NULL , taxon_id INTEGER , name VARCHAR ( 40 ) NOT NULL , accession VARCHAR ( 128 ) NOT NULL , identifier VARCHAR ( 40 ) , division VARCHAR ( 6 ) , description TEXT , version INTEGER NOT NULL , PRIMARY KEY ( bioentry_id ) , UNIQUE ( accession , biodatabase_id , version ) , -- CONFIG: uncomment one (and only one) of the two lines below. The -- first puts a uniqueness constraint on the identifier column alone; -- the other one puts a uniqueness constraint on identifier only -- within a namespace. -- UNIQUE ( identifier ) UNIQUE ( identifier , biodatabase_id ) ) ; CREATE INDEX bioentry_name ON bioentry ( name ); CREATE INDEX bioentry_db ON bioentry ( biodatabase_id ); CREATE INDEX bioentry_tax ON bioentry ( taxon_id ); I'm a little surprised all the other duplicate record tests show different behaviour. I have updated test_BioSQL.py to perform all these new duplicate tests on a clean database - which I probably should have done in the first place (CVS revision 1.35). [All these tests are passing on MySQL. Trying the example by hand triggers an IntegrityError.] Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 21 18:22:18 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 May 2009 18:22:18 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905212222.n4LMMIls028194@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #11 from andrea at biodec.com 2009-05-21 18:22 EST ------- So the problem is related to the different behaviur adopted by postgres loaded with the biosql schema, with respect to mysql. Sorry because i thought the problem was due to BioSQL because i didn't know wich was the "expected database behaviour". Since we expect an error during insertion of a "duplicate" or "quite duplicate" record... we have only to focus on the postgres biosql schema, and why/where it differs from the mysql one. I didn't have time to have a look to the difference between the various "duplicate record tests". I will do. [i've tried postgres 8.4... and it's exactly the same] -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From cy at cymon.org Thu May 21 18:52:39 2009 From: cy at cymon.org (Cymon Cox) Date: Thu, 21 May 2009 23:52:39 +0100 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: <200905212222.n4LMMIls028194@portal.open-bio.org> References: <200905212222.n4LMMIls028194@portal.open-bio.org> Message-ID: <7265d4f0905211552pddc3180ua3a3deb6ba8102bc@mail.gmail.com> 2009/5/21 > http://bugzilla.open-bio.org/show_bug.cgi?id=2833 > > > > > > ------- Comment #11 from andrea at biodec.com 2009-05-21 18:22 EST ------- > So the problem is related to the different behaviur adopted by postgres > loaded > with the biosql schema, with respect to mysql. > > Sorry because i thought the problem was due to BioSQL because i didn't know > wich was the "expected database behaviour". > > Since we expect an error during insertion of a "duplicate" or "quite > duplicate" > record... we have only to focus on the postgres biosql schema, and > why/where it > differs from the mysql one. > > I didn't have time to have a look to the difference between the various > "duplicate record tests". I will do. > > [i've tried postgres 8.4... and it's exactly the same] Hi Andrea, The problem appears to be related to the BioSQL schema/PostGreSQL. As you indicated, adding a duplicate entry to bioentry returns a "INSERT 0 0" and doesnt throw an IntegrityError which is what the code is looking from and presumably what MySQL throws. The reason it doesnt throw an error is because of one (or both) of the RULES in the schema: rule_bioentry_i1 and/or rule_bioentry_i2 If you delete these two rules, load the schema and try to do a duplicate entry: mytest=# insert into bioentry(bioentry_id, biodatabase_id, name, accession, version) values (2, 1, 'blah1', 'test4', 1); INSERT 0 1 mytest=# select * from bioentry; bioentry_id | biodatabase_id | taxon_id | name | accession | identifier | division | description | version -------------+----------------+----------+-------+-----------+------------+----------+-------------+--------- 2 | 1 | | blah1 | test4 | | | | 1 (1 row) mytest=# insert into bioentry(bioentry_id, biodatabase_id, name, accession, version) values (2, 1, 'blah1', 'test4', 1); ERROR: duplicate key value violates unique constraint "bioentry_pkey" we have an error rather than a "INSERT 0 0" I'm going to assume that psycopg2 would pick-up this error and throw an IntegrityError, but I havent taken it any further to check. Cheers, C. From hlapp at gmx.net Thu May 21 22:05:17 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 21 May 2009 22:05:17 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: <7265d4f0905211552pddc3180ua3a3deb6ba8102bc@mail.gmail.com> References: <200905212222.n4LMMIls028194@portal.open-bio.org> <7265d4f0905211552pddc3180ua3a3deb6ba8102bc@mail.gmail.com> Message-ID: <8C0BF1E3-15DF-4F89-AB57-7AE09B86BCCE@gmx.net> On May 21, 2009, at 6:52 PM, Cymon Cox wrote: > [...] > > Hi Andrea, > > The problem appears to be related to the BioSQL schema/PostGreSQL. > > As you indicated, adding a duplicate entry to bioentry returns a > "INSERT 0 > 0" and doesnt throw an IntegrityError which is what the code is > looking from > and presumably what MySQL throws. > > The reason it doesnt throw an error is because of one (or both) of > the RULES > in the schema: Indeed, I'd almost forgotten. The rules are there mostly as a remnant from earlier versions of PostgreSQL to support transactional loading the way bioperl-db (the object-relational mapping for BioPerl) is optimized. You probably don't need them anywhere else. -hilmar Bioperl-db is optimized such that entities that very likely don't exist yet in the database are attempted for insert right away. If the insert fails due to a unique key violation, the record is looked up (and then expected to be found). In Oracle and MySQL you can do this and the transaction remains healthy; i.e., you can commit the transaction later and all statements except those that failed will be committed. In PostgreSQL any failed statement dooms the entire transaction, and the only way out is a rollback. In this case, if you want the loading of one sequence record as one transaction, failing to insert a single feature record will doom the entire sequence load and you would need to start over with the sequence. To fix this, I wrote the rules, which in essence do do the lookups for PostgreSQL that the bioperl-db code would otherwise avoid, and on insert do nothing if the record is found, which results in zero rows affected when you would expect one (which is what bioperl-db cues off of and then triggers a lookup). The right way to do this meanwhile is to use nested transactions, which PostgreSQL supports since v8.0.x, but I haven't gotten around to implement support for that in Bioperl-db. -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bugzilla-daemon at portal.open-bio.org Thu May 21 23:56:13 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 May 2009 23:56:13 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905220356.n4M3uDfM021127@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #12 from cymon.cox at gmail.com 2009-05-21 23:56 EST ------- After deleting the RULES in the BioSQL schema, all the new unittests pass. (All the RULES can be deleted as they are all there to circumvent the problem in Bioperl-db described by Hilmar Lapp on the biopython-dev list: http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006084.html See also the comment in the schema.) C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 22 04:41:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 May 2009 04:41:39 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905220841.n4M8fd3w015716@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #13 from andrea at biodec.com 2009-05-22 04:41 EST ------- (In reply to comment #12) > After deleting the RULES in the BioSQL schema, all the new unittests pass. > > (All the RULES can be deleted as they are all there to circumvent the problem > in Bioperl-db described by Hilmar Lapp on the biopython-dev list: > > http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006084.html > > See also the comment in the schema.) > > C. I've deleted the two rules, rule_bioentry_i1 rule_bioentry_i2 and then i run the tests: Make sure can't import records with same ID (in one go). ... ok Make sure can't import records with same ID (in steps). ... ok Make sure can't import records with same ID (in steps with commit). ... ok Make sure can't import a single record twice (in one go). ... ok Make sure can't import a single record twice (in steps). ... ok Make sure can't import a single record twice (in steps with commit). ... ok Make sure all records are correctly loaded. ... ok Make sure can't reimport existing records. ... ok Indepth check that SeqFeatures are transmitted through the db. ... ok Load SeqRecord objects into a BioSQL database. ... ok Get a list of all items in the database. ... ok Test retrieval of items using various ids. ... ok Check can add DBSeq objects together. ... ok Check can turn a DBSeq object into a Seq or MutableSeq. ... ok Make sure Seqs from BioSQL implement the right interface. ... ok Check SeqFeatures of a sequence. ... ok Make sure SeqRecords from BioSQL implement the right interface. ... ok Check that slices of sequences are retrieved properly. ... ok ---------------------------------------------------------------------- Ran 18 tests in 58.371s OK with pythhon2.4, python2.5, psycopg, psycopg2. Everything seems to be ok. I don't know which other possible effects could be triggered by this deletion. But i think it should be inserted as soon as possbile into the BioSQL Schema/PostGreSQL (updating also the Test BioSQL schema/PostGreSQL). After removing the rules i've run my own tests: ..... >>> ## LOAD INTO DB >>> db.load([s1]) 1 >>> db.load([s2]) 1 >>> db.load([s3]) Traceback (most recent call last): File "", line 1, in ? File "../BioSQL/BioSeqDatabase.py", line 442, in load File "../BioSQL/Loader.py", line 50, in load_seqrecord File "../BioSQL/Loader.py", line 550, in _load_bioentry_table File "../BioSQL/BioSeqDatabase.py", line 301, in execute IntegrityError: duplicate key value violates unique constraint "bioentry_accession_key" And i've got the error, that is what it is expected as a normal behaviour. So now i've only to trap the exception or pre-check duplications. Many Thanks Andrea -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 22 08:06:36 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 May 2009 08:06:36 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905221206.n4MC6aWo000368@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #14 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-22 08:06 EST ------- (In reply to comment #13) > (In reply to comment #12) > > After deleting the RULES in the BioSQL schema, all the new unittests pass. > > > > (All the RULES can be deleted as they are all there to circumvent the > > problem in Bioperl-db described by Hilmar Lapp on the biopython-dev list: > > > > http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006084.html > > > > See also the comment in the schema.) > > > > C. Well spotted Cymon - I'd missed that. > I've deleted the two rules, > rule_bioentry_i1 > rule_bioentry_i2 > > ... > with pythhon2.4, python2.5, psycopg, psycopg2. > Everything seems to be ok. > ... > After removing the rules i've run my own tests: > ..... > >>> ## LOAD INTO DB > >>> db.load([s1]) > 1 > >>> db.load([s2]) > 1 > >>> db.load([s3]) > Traceback (most recent call last): > File "", line 1, in ? > File "../BioSQL/BioSeqDatabase.py", line 442, in load > File "../BioSQL/Loader.py", line 50, in load_seqrecord > File "../BioSQL/Loader.py", line 550, in _load_bioentry_table > File "../BioSQL/BioSeqDatabase.py", line 301, in execute > IntegrityError: duplicate key value violates unique constraint > "bioentry_accession_key" > > And i've got the error, that is what it is expected as a normal behaviour. > So now i've only to trap the exception or pre-check duplications. Great. It will be down to BioSQL to change the schema (in conjunction with BioPerl), but Hilmar seems to be looking into this: http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006084.html I suppose in the short term we could change our local copy of the schema used in the Biopython unit tests... Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri May 22 08:27:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 22 May 2009 13:27:06 +0100 Subject: [Biopython-dev] RULES in BioSQL PostgreSQL schema Message-ID: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> Hi all, This is a continuation of a thread / bug report from Biopython (Bug 2833) where attempting to import duplicate entries into BioSQL did not raise an error on PostgreSQL (but does on MySQL). Cymon traced this to the RULES present in the schema to help bioperl-db. On Fri, May 22, 2009 at 3:05 AM, Hilmar Lapp wrote: > > On May 21, 2009, at 6:52 PM, Cymon Cox wrote: > >> [...] >> >> Hi Andrea, >> >> The problem appears to be related to the BioSQL schema/PostGreSQL. >> >> As you indicated, adding a duplicate entry to bioentry returns a "INSERT 0 >> 0" and doesnt throw an IntegrityError which is what the code is looking >> from and presumably what MySQL throws. >> >> The reason it doesnt throw an error is because of one (or both) of the >> RULES in the schema: > > Indeed, I'd almost forgotten. The rules are there mostly as a remnant from > earlier versions of PostgreSQL to support transactional loading the way > bioperl-db (the object-relational mapping for BioPerl) is optimized. You > probably don't need them anywhere else. > > ? ? ? ?-hilmar > > > Bioperl-db is optimized such that entities that very likely don't exist yet > in the database are attempted for insert right away. If the insert fails due > to a unique key violation, the record is looked up (and then expected to be > found). In Oracle and MySQL you can do this and the transaction remains > healthy; i.e., you can commit the transaction later and all statements > except those that failed will be committed. In PostgreSQL any failed > statement dooms the entire transaction, and the only way out is a rollback. > In this case, if you want the loading of one sequence record as one > transaction, failing to insert a single feature record will doom the entire > sequence load and you would need to start over with the sequence. To fix > this, I wrote the rules, which in essence do do the lookups for PostgreSQL > that the bioperl-db code would otherwise avoid, and on insert do nothing if > the record is found, which results in zero rows affected when you would > expect one (which is what bioperl-db cues off of and then triggers a > lookup). > The right way to do this meanwhile is to use nested transactions, which > PostgreSQL supports since v8.0.x, but I haven't gotten around to implement > support for that in Bioperl-db. > Hilmar, It seems for Biopython to work properly with BioSQL on PostgreSQL these bioentry rules should be removed from the schema (as the comments in the schema do suggest). Obviously doing this would break any installation also using the current version of bioperl-db. Do the RULES affect BioJava or BioRuby using BioSQL on PostgreSQL? Are you happy to remove these RULES in BioSQL v1.0.x (after making the outlined transactional changes in bioperl-db)? Thanks, Peter From hlapp at gmx.net Fri May 22 11:03:11 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 22 May 2009 11:03:11 -0400 Subject: [Biopython-dev] RULES in BioSQL PostgreSQL schema In-Reply-To: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> Message-ID: On May 22, 2009, at 8:27 AM, Peter wrote: > Are you happy to remove these RULES in BioSQL v1.0.x (after > making the outlined transactional changes in bioperl-db)? In principle yes. It would also mean dropping support for PostgreSQL v7.x, but I would hope that that's a non-issue. But if anyone here is still using and relying on PostgreSQL v7.x (or earlier?) do let us know, please. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Fri May 22 11:57:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 22 May 2009 16:57:38 +0100 Subject: [Biopython-dev] RULES in BioSQL PostgreSQL schema In-Reply-To: References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> Message-ID: <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com> On Fri, May 22, 2009 at 4:03 PM, Hilmar Lapp wrote: > > On May 22, 2009, at 8:27 AM, Peter wrote: > >> Are you happy to remove these RULES in BioSQL v1.0.x (after >> making the outlined transactional changes in bioperl-db)? > > In principle yes. It would also mean dropping support for PostgreSQL v7.x, > but I would hope that that's a non-issue. > > But if anyone here is still using and relying on PostgreSQL v7.x (or > earlier?) do let us know, please. Great. In the meantime could you add a big warning about this issue to the INSTALL notes for PostgreSQL (i.e. recommend removing the RULES section if not using bioper-db)? http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/INSTALL Peter From biopython at maubp.freeserve.co.uk Fri May 22 12:06:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 22 May 2009 17:06:21 +0100 Subject: [Biopython-dev] Peter at a conference next week Message-ID: <320fb6e00905220906l2446afbfk9804599db74a4d66@mail.gmail.com> Hi all, Just to let you know I will be at a conference next week, so don't expect (Biopython) email replies as promptly as usual. I may even leave my laptop at home ;) Peter From hlapp at gmx.net Fri May 22 14:20:58 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 22 May 2009 14:20:58 -0400 Subject: [Biopython-dev] RULES in BioSQL PostgreSQL schema In-Reply-To: <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com> References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com> Message-ID: <410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net> Yes, agree. Would you mind filing this in Bugzilla for BioSQL? -hilmar On May 22, 2009, at 11:57 AM, Peter wrote: > On Fri, May 22, 2009 at 4:03 PM, Hilmar Lapp wrote: >> >> On May 22, 2009, at 8:27 AM, Peter wrote: >> >>> Are you happy to remove these RULES in BioSQL v1.0.x (after >>> making the outlined transactional changes in bioperl-db)? >> >> In principle yes. It would also mean dropping support for >> PostgreSQL v7.x, >> but I would hope that that's a non-issue. >> >> But if anyone here is still using and relying on PostgreSQL v7.x (or >> earlier?) do let us know, please. > > Great. > > In the meantime could you add a big warning about this issue to the > INSTALL notes for PostgreSQL (i.e. recommend removing the RULES > section if not using bioper-db)? > http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/INSTALL > > Peter -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bugzilla-daemon at portal.open-bio.org Fri May 22 14:37:21 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 May 2009 14:37:21 -0400 Subject: [Biopython-dev] [Bug 2837] New: Reading Roche 454 SFF sequence read files in Bio.SeqIO Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2837 Summary: Reading Roche 454 SFF sequence read files in Bio.SeqIO Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk Roche 454 sequencing returns the read data in SFF files, a documented binary format, capturing the sequence letters and qualities together with trimming information. It would be nice to support reading (and in the longer term also writing) these files directly with Bio.SeqIO. See this thread for background: http://lists.open-bio.org/pipermail/biopython/2009-April/005083.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 22 14:39:26 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 May 2009 14:39:26 -0400 Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read files in Bio.SeqIO In-Reply-To: Message-ID: <200905221839.n4MIdQU5008555@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2837 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-22 14:39 EST ------- Created an attachment (id=1303) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1303&action=view) Bio/SeqIO/RocheSffIO.py This is a rough SeqIO parser constructing SeqRecord objects using a parser contributed by Jose Blanca. Additional work would be required for paired end reads - and even more work to be able to write out these files. Potentially Jose's parser could be exposed as a public module under Bio.Sequencing, but here is it just two private classes. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri May 22 14:40:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 22 May 2009 19:40:45 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> <200904171246.46568.jblanca@btc.upv.es> <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> Message-ID: <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> On Fri, Apr 17, 2009 at 12:08 PM, Peter wrote: > On Fri, Apr 17, 2009 at 11:46 AM, Jose Blanca wrote: >> Hi Peter: >> Here you have some code to read the sff files. > > Thanks - I'm not sure when I'll get to look at this, maybe next week. > >> For the time being it creates a dict for the sequences. I'm not sure about >> how to integrate the generated data in BioPython. The sequence and >> qualities should go to a SeqRecord, but there is also the information >> about the clipping. > > For Bio.SeqIO, we would need to use a SeqRecord. ?Ideally we'd want to > be able to read and write SFF files, and to do that we'll have to record all > the essential annotation (i.e. clipping) somehow. I've had a look at your code this evening, and written a rough SeqIO module using it, available here on enhancement Bug 2837, http://bugzilla.open-bio.org/show_bug.cgi?id=2837 > Can you write SFF files? > >> For my work I use a kind of SeqRecord with a mask property and the >> mask is a Location that shows which part of the sequence is ok. I don't >> know if that's a valid model for BioPython. > > A mask could be done as a list of booleans, and we can treat it as > another per-letter-annotation in the SeqRecord. ?I'm not sure if this > is helpful or not. > > The Roche tools let you choose to extract trimmed reads as FASTA > and QUAL, or untrimmed. ?Perhaps for reading SFF files with > Bio.SeqIO we should get the user to choose between these > options (e.g. format names "roche-sff" and "roche-sff-notrim")? This would work... > Roche's FASTA files use upper case for the trimmed region, and > lower case for the start/end which would get trimmed off. This is > simple and we could do this for Biopython too - meaning you'd get > the same data if you read the SFF file directly, or used Roche's > FASTA+QUAL files with SeqIO. ?Note that when reading an SFF > file directly, we should probably record the real trim data as well. In my current code, I decided to use the same quality trimming representation that Roche use if converting the SFF file into FASTA format (the leading and trailing trim regions are in lower case). We may want to record the trim positions in the SeqRecord's annotation as well. >> There's also a couple of more tricks with the clipping. >> In theory there's clip_qual and clip_adapter, but in the files >> we've seen clip_adapter is always zero and clip_quality is used >> instead for both quality and adapter. I think we could generate >> one clipping combining both. Let me know what do you think. >> Also take into account that in some cases the generated clipping >> from the 454 software are just wrong. > > I'll need to learn more about the details before coming to any > conclusions about how to deal with this information in Biopython. Right now I have not looked at the left/right adaptor clipping information, as you found, in the example file I have looked at these fields are zero. Note I will be away for the next week, so am unlikely to respond to any emails on this. Peter From bugzilla-daemon at portal.open-bio.org Fri May 22 15:23:44 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 May 2009 15:23:44 -0400 Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read files in Bio.SeqIO In-Reply-To: Message-ID: <200905221923.n4MJNiAe013574@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2837 spenthil at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |spenthil at gmail.com -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 22 17:16:07 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 May 2009 17:16:07 -0400 Subject: [Biopython-dev] [Bug 2838] New: If a SeqRecord containing Genbank information is read from BioSQL, it cannot be written to another BioSQL database Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2838 Summary: If a SeqRecord containing Genbank information is read from BioSQL, it cannot be written to another BioSQL database Product: Biopython Version: 1.49 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: david.wyllie at ndm.ox.ac.uk I've been trying to annotate some microbial sequences; some are from genbank. So the proposed series of events was: 1) get sequences from genbank 2) store in BioSQL database called One 3) recover them from BioSql 4) annotate the recovered SeqRecords [this works, but isn't necessary for this problem to be reproduced - here, I'm making no changes at all to the SeqRecord] 5) store the annotated SeqRecords in a different BioSQL database called Two. The problem is that Step 5 fails when the original record was recovered from Genbank. The traceback (below) indicates a problem with the BioSQL loader in _load_bioentry_date Here is the screen output, including traceback. The program (attached) first loads a record from Genbank, writes it to One, recovers it from One; at this point it has changed, in particular in the way date fields are represented. the entrez load has a /date feature which is not a list /date=26-MAY-2005 while the reloaded version has two date fields /dates=['26-MAY-2005'] /date=['26-MAY-2005'] Whether this is relevant I'm not sure. The subsequent write of the recovered version to Two fails. As a control, I've checked that the original version can be written to Two successfully. I'm a novice with Python and Biopython so please accept my apologies if there is something obvious and very stupid responsible for this. --------------------------------------------------------------------------- dwyllie at dwyllie:~/programs/Project/src$ python dbtestcase.py OK, going to recover record 28804743 from genbank.... Record loaded looks like this: ID: AB098727.1 Name: AB098727 Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal protein S11, cytochromoe b/f complex subunit IV, partial cds. Number of features: 5 /sequence_version=1 /source=chloroplast Ceratodon purpureus /taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales', 'Ditrichaceae', 'Ceratodon'] /keywords=[''] /references=[, , , ] /accessions=['AB098727'] /data_file_division=PLN /date=26-MAY-2005 /organism=Ceratodon purpureus /gi=28804743 Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA', IUPACAmbiguousDNA()) ======================================================================== Load from Entrez completed, records= 1 Here is the loaded record: ======================================================================== ID: AB098727.1 Name: AB098727 Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal protein S11, cytochromoe b/f complex subunit IV, partial cds. Number of features: 5 /sequence_version=1 /source=chloroplast Ceratodon purpureus /taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales', 'Ditrichaceae', 'Ceratodon'] /keywords=[''] /references=[, , , ] /accessions=['AB098727'] /data_file_division=PLN /date=26-MAY-2005 /organism=Ceratodon purpureus /gi=28804743 Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA', IUPACAmbiguousDNA()) ======================================================================== Now loading these records into a BioSQL database One. /var/lib/python-support/python2.6/MySQLdb/__init__.py:34: DeprecationWarning: the sets module is deprecated from sets import ImmutableSet Creating a new database One ======================================================================== Load from database One completed, records= 1 ======================================================================== Here is the record recovered from database One: ID: AB098727.1 Name: AB098727 Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal protein S11, cytochromoe b/f complex subunit IV, partial cds. Number of features: 5 /dates=['26-MAY-2005'] /ncbi_taxid=3225 /date=['26-MAY-2005'] /taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Bryopsida', 'Dicranidae', 'Dicranales', 'Ditrichaceae', 'Ceratodon', 'Ceratodon purpureus'] /source=['chloroplast Ceratodon purpureus'] /references=[, , , ] /gi=28804743 /data_file_division=PLN /keywords=[''] /organism=Ceratodon purpureus /sequence_version=['1'] /accessions=['AB098727'] DBSeq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA', DNAAlphabet()) ======================================================================== Creating a new database Two Traceback (most recent call last): File "dbtestcase.py", line 206, in from dbtestcase import AuthDetails File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 225, in DemonstrateProblem(problemgi,ad) File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 199, in DemonstrateProblem db2.load(listtoload) File "/var/lib/python-support/python2.6/BioSQL/BioSeqDatabase.py", line 430, in load db_loader.load_seqrecord(cur_record) File "/var/lib/python-support/python2.6/BioSQL/Loader.py", line 50, in load_seqrecord self._load_bioentry_date(record, bioentry_id) File "/var/lib/python-support/python2.6/BioSQL/Loader.py", line 577, in _load_bioentry_date self.adaptor.execute(sql, (bioentry_id, date_id, date)) File "/var/lib/python-support/python2.6/BioSQL/BioSeqDatabase.py", line 289, in execute self.cursor.execute(sql, args or ()) File "/var/lib/python-support/python2.6/MySQLdb/cursors.py", line 166, in execute self.errorhandler(self, exc, value) File "/var/lib/python-support/python2.6/MySQLdb/connections.py", line 35, in defaulterrorhandler raise errorclass, errorvalue _mysql_exceptions.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '), 1)' at line 1") dwyllie at dwyllie:~/programs/Project/src$ -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 22 17:19:03 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 May 2009 17:19:03 -0400 Subject: [Biopython-dev] [Bug 2838] If a SeqRecord containing Genbank information is read from BioSQL, it cannot be written to another BioSQL database In-Reply-To: Message-ID: <200905222119.n4MLJ3d3026350@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2838 ------- Comment #1 from david.wyllie at ndm.ox.ac.uk 2009-05-22 17:19 EST ------- Created an attachment (id=1304) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1304&action=view) A python script which reproduces the error. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 22 18:46:04 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 May 2009 18:46:04 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905222246.n4MMk4QO000548@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn| |2839 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri May 22 18:46:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 22 May 2009 23:46:54 +0100 Subject: [Biopython-dev] RULES in BioSQL PostgreSQL schema In-Reply-To: <410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net> References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com> <410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net> Message-ID: <320fb6e00905221546i26edc7a2u2a02fb0d01c374ea@mail.gmail.com> On 5/22/09, Hilmar Lapp wrote: > Yes, agree. Would you mind filing this in Bugzilla for BioSQL? -hilmar I've filed Bug 2839, hopefully this is what you had in mind: http://bugzilla.open-bio.org/show_bug.cgi?id=2839 Peter From chapmanb at 50mail.com Fri May 22 18:54:32 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 22 May 2009 18:54:32 -0400 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> <200904171246.46568.jblanca@btc.upv.es> <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> Message-ID: <20090522225432.GU84112@sobchak.mgh.harvard.edu> Peter and Jose; I haven't used SFF files myself as we don't have a 454 machine, but do know of a couple of implementations of SFF TO Fastq/Fasta. Flower is a Haskell implementation: http://blog.malde.org/index.php/flower/ And PyroBayes is a 454 base caller: http://bioinformatics.bc.edu/marthlab/PyroBayes Depending on what you all end up doing, these might be useful as comparison points, or for wrapping with Application command lines. Brad > On Fri, Apr 17, 2009 at 12:08 PM, Peter wrote: > > On Fri, Apr 17, 2009 at 11:46 AM, Jose Blanca wrote: > >> Hi Peter: > >> Here you have some code to read the sff files. > > > > Thanks - I'm not sure when I'll get to look at this, maybe next week. > > > >> For the time being it creates a dict for the sequences. I'm not sure about > >> how to integrate the generated data in BioPython. The sequence and > >> qualities should go to a SeqRecord, but there is also the information > >> about the clipping. > > > > For Bio.SeqIO, we would need to use a SeqRecord. ?Ideally we'd want to > > be able to read and write SFF files, and to do that we'll have to record all > > the essential annotation (i.e. clipping) somehow. > > I've had a look at your code this evening, and written a rough SeqIO > module using it, available here on enhancement Bug 2837, > http://bugzilla.open-bio.org/show_bug.cgi?id=2837 > > > Can you write SFF files? > > > >> For my work I use a kind of SeqRecord with a mask property and the > >> mask is a Location that shows which part of the sequence is ok. I don't > >> know if that's a valid model for BioPython. > > > > A mask could be done as a list of booleans, and we can treat it as > > another per-letter-annotation in the SeqRecord. ?I'm not sure if this > > is helpful or not. > > > > The Roche tools let you choose to extract trimmed reads as FASTA > > and QUAL, or untrimmed. ?Perhaps for reading SFF files with > > Bio.SeqIO we should get the user to choose between these > > options (e.g. format names "roche-sff" and "roche-sff-notrim")? > > This would work... > > > Roche's FASTA files use upper case for the trimmed region, and > > lower case for the start/end which would get trimmed off. This is > > simple and we could do this for Biopython too - meaning you'd get > > the same data if you read the SFF file directly, or used Roche's > > FASTA+QUAL files with SeqIO. ?Note that when reading an SFF > > file directly, we should probably record the real trim data as well. > > In my current code, I decided to use the same quality trimming > representation that Roche use if converting the SFF file into FASTA > format (the leading and trailing trim regions are in lower case). We > may want to record the trim positions in the SeqRecord's annotation > as well. > > >> There's also a couple of more tricks with the clipping. > >> In theory there's clip_qual and clip_adapter, but in the files > >> we've seen clip_adapter is always zero and clip_quality is used > >> instead for both quality and adapter. I think we could generate > >> one clipping combining both. Let me know what do you think. > >> Also take into account that in some cases the generated clipping > >> from the 454 software are just wrong. > > > > I'll need to learn more about the details before coming to any > > conclusions about how to deal with this information in Biopython. > > Right now I have not looked at the left/right adaptor clipping information, > as you found, in the example file I have looked at these fields are zero. > > Note I will be away for the next week, so am unlikely to respond to > any emails on this. > > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From bugzilla-daemon at portal.open-bio.org Fri May 22 18:58:24 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 May 2009 18:58:24 -0400 Subject: [Biopython-dev] [Bug 2838] If a SeqRecord containing Genbank information is read from BioSQL, it cannot be written to another BioSQL database In-Reply-To: Message-ID: <200905222258.n4MMwOXA001311@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2838 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-22 18:58 EST ------- (In reply to comment #0) > I've been trying to annotate some microbial sequences; some are from genbank. > So the proposed series of events was: > 1) get sequences from genbank > 2) store in BioSQL database called One > 3) recover them from BioSql > 4) annotate the recovered SeqRecords [this works, but isn't > necessary for this problem to be reproduced - here, I'm > making no changes at all to the SeqRecord] > 5) store the annotated SeqRecords in a different BioSQL database called Two. > > The problem is that Step 5 fails when the original record was recovered from > Genbank. > > The traceback (below) indicates a problem with the BioSQL loader in > _load_bioentry_date > ... > I'm a novice with Python and Biopython so please accept my apologies if > there is something obvious and very stupid responsible for this. What you are trying to do sounds very reasonable (although I have never actually needed to or tried to do this myself). You were right about the date thing, the loader code only expected a string, not a list. Fixed in CVS revision 1.40 of BioSQL/Loader.py, and I have also added a unit test for this use case in Tests/test_BioSQL.py revision 1.36. Note there is a known minor discrepancy with dates (see Bug 2681) when comparing the original SeqRecord to the DBSeqRecord after loading/retrieving from BioSQL. If you could confirm this solves your problem, I think we can close this bug. Thank you! Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Fri May 22 18:54:32 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 22 May 2009 18:54:32 -0400 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> <200904171246.46568.jblanca@btc.upv.es> <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> Message-ID: <20090522225432.GU84112@sobchak.mgh.harvard.edu> Peter and Jose; I haven't used SFF files myself as we don't have a 454 machine, but do know of a couple of implementations of SFF TO Fastq/Fasta. Flower is a Haskell implementation: http://blog.malde.org/index.php/flower/ And PyroBayes is a 454 base caller: http://bioinformatics.bc.edu/marthlab/PyroBayes Depending on what you all end up doing, these might be useful as comparison points, or for wrapping with Application command lines. Brad > On Fri, Apr 17, 2009 at 12:08 PM, Peter wrote: > > On Fri, Apr 17, 2009 at 11:46 AM, Jose Blanca wrote: > >> Hi Peter: > >> Here you have some code to read the sff files. > > > > Thanks - I'm not sure when I'll get to look at this, maybe next week. > > > >> For the time being it creates a dict for the sequences. I'm not sure about > >> how to integrate the generated data in BioPython. The sequence and > >> qualities should go to a SeqRecord, but there is also the information > >> about the clipping. > > > > For Bio.SeqIO, we would need to use a SeqRecord. ?Ideally we'd want to > > be able to read and write SFF files, and to do that we'll have to record all > > the essential annotation (i.e. clipping) somehow. > > I've had a look at your code this evening, and written a rough SeqIO > module using it, available here on enhancement Bug 2837, > http://bugzilla.open-bio.org/show_bug.cgi?id=2837 > > > Can you write SFF files? > > > >> For my work I use a kind of SeqRecord with a mask property and the > >> mask is a Location that shows which part of the sequence is ok. I don't > >> know if that's a valid model for BioPython. > > > > A mask could be done as a list of booleans, and we can treat it as > > another per-letter-annotation in the SeqRecord. ?I'm not sure if this > > is helpful or not. > > > > The Roche tools let you choose to extract trimmed reads as FASTA > > and QUAL, or untrimmed. ?Perhaps for reading SFF files with > > Bio.SeqIO we should get the user to choose between these > > options (e.g. format names "roche-sff" and "roche-sff-notrim")? > > This would work... > > > Roche's FASTA files use upper case for the trimmed region, and > > lower case for the start/end which would get trimmed off. This is > > simple and we could do this for Biopython too - meaning you'd get > > the same data if you read the SFF file directly, or used Roche's > > FASTA+QUAL files with SeqIO. ?Note that when reading an SFF > > file directly, we should probably record the real trim data as well. > > In my current code, I decided to use the same quality trimming > representation that Roche use if converting the SFF file into FASTA > format (the leading and trailing trim regions are in lower case). We > may want to record the trim positions in the SeqRecord's annotation > as well. > > >> There's also a couple of more tricks with the clipping. > >> In theory there's clip_qual and clip_adapter, but in the files > >> we've seen clip_adapter is always zero and clip_quality is used > >> instead for both quality and adapter. I think we could generate > >> one clipping combining both. Let me know what do you think. > >> Also take into account that in some cases the generated clipping > >> from the 454 software are just wrong. > > > > I'll need to learn more about the details before coming to any > > conclusions about how to deal with this information in Biopython. > > Right now I have not looked at the left/right adaptor clipping information, > as you found, in the example file I have looked at these fields are zero. > > Note I will be away for the next week, so am unlikely to respond to > any emails on this. > > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From biopython at maubp.freeserve.co.uk Fri May 22 19:09:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 23 May 2009 00:09:56 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <20090522225432.GU84112@sobchak.mgh.harvard.edu> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> <200904171246.46568.jblanca@btc.upv.es> <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> <20090522225432.GU84112@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> On 5/22/09, Brad Chapman wrote: > Peter and Jose; > I haven't used SFF files myself as we don't have a 454 machine, We don't have one in house either, and have instead out-sourced to a couple of sequencing centres in the UK with 454 machines. > but do know of a couple of implementations of SFF TO > Fastq/Fasta. > Flower is a Haskell implementation: > > http://blog.malde.org/index.php/flower/ > > And PyroBayes is a 454 base caller: > > http://bioinformatics.bc.edu/marthlab/PyroBayes > > Depending on what you all end up doing, these might be useful as > comparison points, or for wrapping with Application command lines. I would say Roche's own tools are the best reference, but these only output FASTA and QUAL, not FASTQ files (at the moment at least). So yes, being able to compare a Biopython SFF to FASTQ conversion with that by Flower (or anything else) would be handy. Peter From spenthil at gmail.com Fri May 22 19:52:30 2009 From: spenthil at gmail.com (Senthil Palanisami) Date: Fri, 22 May 2009 16:52:30 -0700 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> <200904171246.46568.jblanca@btc.upv.es> <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> <20090522225432.GU84112@sobchak.mgh.harvard.edu> <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> Message-ID: <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> I have been working with SFF files for the past month, and can say it's definitely frustrating working with custom binary formats. Take a look at sff_extract which is written in python. It converts sff files into fasta and xml or caf files: http://bioinf.comav.upv.es/sff_extract/index.html You can find detailed specs of the format @ http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#header-global -- Senthil Palanisami http://spenthil.com On Fri, May 22, 2009 at 4:09 PM, Peter wrote: > On 5/22/09, Brad Chapman wrote: > > Peter and Jose; > > I haven't used SFF files myself as we don't have a 454 machine, > > We don't have one in house either, and have instead out-sourced to a > couple of sequencing centres in the UK with 454 machines. > > > but do know of a couple of implementations of SFF TO > > Fastq/Fasta. > > Flower is a Haskell implementation: > > > > http://blog.malde.org/index.php/flower/ > > > > And PyroBayes is a 454 base caller: > > > > http://bioinformatics.bc.edu/marthlab/PyroBayes > > > > Depending on what you all end up doing, these might be useful as > > comparison points, or for wrapping with Application command lines. > > I would say Roche's own tools are the best reference, but these only > output FASTA and QUAL, not FASTQ files (at the moment at least). So > yes, being able to compare a Biopython SFF to FASTQ conversion with > that by Flower (or anything else) would be handy. > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython at maubp.freeserve.co.uk Fri May 22 20:10:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 23 May 2009 01:10:57 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> <200904171246.46568.jblanca@btc.upv.es> <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> <20090522225432.GU84112@sobchak.mgh.harvard.edu> <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> Message-ID: <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com> On 5/23/09, Senthil Palanisami wrote: > I have been working with SFF files for the past month, and can say it's > definitely frustrating working with custom binary formats. At least in this case it is publicly documented. Have you needed to write out (or edit) an SFF file yet? Have you used any paired end reads in SFF format? > Take a look at sff_extract which is written in python. It converts sff files > into fasta and xml or caf files: > http://bioinf.comav.upv.es/sff_extract/index.html That is what this code is based on - Jose Blanca is one of the authors of sff_extract. > You can find detailed specs of the format @ > http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#header-global I think you must have missed this thread last month ;) http://lists.open-bio.org/pipermail/biopython/2009-April/005084.html Peter From bugzilla-daemon at portal.open-bio.org Fri May 22 21:16:54 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 May 2009 21:16:54 -0400 Subject: [Biopython-dev] [Bug 2838] If a SeqRecord containing Genbank information is read from BioSQL, it cannot be written to another BioSQL database In-Reply-To: Message-ID: <200905230116.n4N1GsRl010917@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2838 ------- Comment #3 from david.wyllie at ndm.ox.ac.uk 2009-05-22 21:16 EST ------- Thank you! Unfortunately I'm not sure it's fixed, or maybe there is another problem: I have uninstalled the BioPython package using Synaptic package manager (previously I was using 1.49), downloaded from cvs checkout. Thanks for your message http://osdir.com/ml/python.bio.general/2008-07/msg00035.html I can confirm that the default ubuntu 9.0 install lacks the python-dev package, with the necessary Python.h headers. After python-dev is installed, build is OK, Tests pass running test test_Ace ... ok test_AlignIO ... ok test_BioSQL ... /var/lib/python-support/python2.6/MySQLdb/__init__.py:34: DeprecationWarning: the sets module is deprecated from sets import ImmutableSet /home/dwyllie/biopython/build/lib.linux-x86_64-2.6/BioSQL/BioSeqDatabase.py:144: Warning: 'TYPE=storage_engine' is deprecated; use 'ENGINE=storage_engine' instead self.adaptor.cursor.execute(sql_line) ok test_BioSQL_SeqIO ... ok test_CAPS ... ok test_Clustalw ... ok .. and install is OK too. This is all new to me but it seems to work OK. I have checked the source code and I think your modification is correctly in place I think I have your patch in place: def _load_bioentry_date(self, record, bioentry_id): """Add the effective date of the entry into the database. record - a SeqRecord object with an annotated date bioentry_id - corresponding database identifier """ # dates are GenBank style, like: # 14-SEP-2000 date = record.annotations.get("date", strftime("%d-%b-%Y", gmtime()).upper()) if isinstance(date, list) : date = date[0] annotation_tags_id = self._get_ontology_id("Annotation Tags") date_id = self._get_term_id("date_changed", annotation_tags_id) sql = r"INSERT INTO bioentry_qualifier_value" \ r" (bioentry_id, term_id, value, rank)" \ r" VALUES (%s, %s, %s, 1)" self.adaptor.execute(sql, (bioentry_id, date_id, date)) Now when I re-run dbtestcase.py (attached previously) I get a different error message. dwyllie at dwyllie:~/programs/CheckleyProject/src$ python dbtestcase.py OK, going to recover record 28804743 from genbank.... Record loaded looks like this: ID: AB098727.1 Name: AB098727 Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal protein S11, cytochromoe b/f complex subunit IV, partial cds. Number of features: 5 /sequence_version=1 /source=chloroplast Ceratodon purpureus /taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales', 'Ditrichaceae', 'Ceratodon'] /keywords=[''] /references=[, , , ] /accessions=['AB098727'] /data_file_division=PLN /date=26-MAY-2005 /organism=Ceratodon purpureus /gi=28804743 Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA', IUPACAmbiguousDNA()) ======================================================================== Load from Entrez completed, records= 1 Here is the loaded record: ======================================================================== ID: AB098727.1 Name: AB098727 Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal protein S11, cytochromoe b/f complex subunit IV, partial cds. Number of features: 5 /sequence_version=1 /source=chloroplast Ceratodon purpureus /taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales', 'Ditrichaceae', 'Ceratodon'] /keywords=[''] /references=[, , , ] /accessions=['AB098727'] /data_file_division=PLN /date=26-MAY-2005 /organism=Ceratodon purpureus /gi=28804743 Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA', IUPACAmbiguousDNA()) ======================================================================== Now loading these records into a BioSQL database One. /var/lib/python-support/python2.6/MySQLdb/__init__.py:34: DeprecationWarning: the sets module is deprecated from sets import ImmutableSet Creating a new database One ======================================================================== Load from database One completed, records= 1 ======================================================================== Here is the record recovered from database One: Traceback (most recent call last): File "dbtestcase.py", line 165, in from dbtestcase import AuthDetails File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 182, in DemonstrateProblem(problemgi,ad) File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 138, in DemonstrateProblem print recordrecovered File "/usr/local/lib/python2.6/dist-packages/Bio/SeqRecord.py", line 489, in __str__ if self.letter_annotations : File "/usr/local/lib/python2.6/dist-packages/Bio/SeqRecord.py", line 165, in fget=lambda self : self._per_letter_annotations, AttributeError: 'DBSeqRecord' object has no attribute '_per_letter_annotations' dwyllie at dwyllie:~/programs/CheckleyProject/src$ Have I failed to install something? Unfortunately, I wasn't running off CVS before your change. Best wishes d -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From spenthil at gmail.com Fri May 22 21:48:24 2009 From: spenthil at gmail.com (Senthil Palanisami) Date: Fri, 22 May 2009 18:48:24 -0700 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> <200904171246.46568.jblanca@btc.upv.es> <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> <20090522225432.GU84112@sobchak.mgh.harvard.edu> <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com> Message-ID: <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> Sorry, I only recently joined this list - should have gone through the archives first. I have done some minimal SFF tweaking, but only by first converting them to CA format. No paired end reads yet, but I do know my PI wants me to start looking at some in the next month or two. -- Senthil Palanisami http://spenthil.com On Fri, May 22, 2009 at 5:10 PM, Peter wrote: > On 5/23/09, Senthil Palanisami wrote: > > I have been working with SFF files for the past month, and can say it's > > definitely frustrating working with custom binary formats. > > At least in this case it is publicly documented. Have you needed to > write out (or edit) an SFF file yet? Have you used any paired end > reads in SFF format? > > > Take a look at sff_extract which is written in python. It converts sff > files > > into fasta and xml or caf files: > > http://bioinf.comav.upv.es/sff_extract/index.html > > That is what this code is based on - Jose Blanca is one of the authors > of sff_extract. > > > You can find detailed specs of the format @ > > > http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#header-global > > I think you must have missed this thread last month ;) > http://lists.open-bio.org/pipermail/biopython/2009-April/005084.html > > Peter > From biopython at maubp.freeserve.co.uk Sat May 23 07:28:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 23 May 2009 12:28:36 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> <200904171246.46568.jblanca@btc.upv.es> <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> <20090522225432.GU84112@sobchak.mgh.harvard.edu> <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com> <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> Message-ID: <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> On Sat, May 23, 2009 at 2:48 AM, Senthil Palanisami wrote: > Sorry, I only recently joined this list - should have gone through the > archives first. Don't worry - and if I sounded grumpy, sorry - I was up late last night. > I have done some minimal SFF tweaking, but only by first converting them > to CA format. What do you mean by CA format? I don't recall seeing that abbreviation before. > No paired end reads yet, but I do know my PI wants me to start looking > at some in the next month or two. I haven't had any paired end 454 reads to work with personally, but I'm sure there are some examples available online somewhere. Peter From bugzilla-daemon at portal.open-bio.org Sat May 23 07:49:18 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 23 May 2009 07:49:18 -0400 Subject: [Biopython-dev] [Bug 2838] If a SeqRecord containing Genbank information is read from BioSQL, it cannot be written to another BioSQL database In-Reply-To: Message-ID: <200905231149.n4NBnIEQ023192@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2838 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-23 07:49 EST ------- (In reply to comment #3) > Thank you! > > Unfortunately I'm not sure it's fixed, or maybe there is another problem: > ... > Now when I re-run dbtestcase.py (attached previously) I get a different error > message. > ... > Traceback (most recent call last): > ... > File "/usr/local/lib/python2.6/dist-packages/Bio/SeqRecord.py", line 489, in > __str__ > if self.letter_annotations : > File "/usr/local/lib/python2.6/dist-packages/Bio/SeqRecord.py", line 165, in > > fget=lambda self : self._per_letter_annotations, > AttributeError: 'DBSeqRecord' object has no attribute '_per_letter_annotations' > dwyllie at dwyllie:~/programs/CheckleyProject/src$ > > > Have I failed to install something? No - everything looks OK, and the deprecation warnings are known about and not in Biopython anyway. > Unfortunately, I wasn't running off CVS before your change. The original problem is fixed. However, you've found a new bug in the __str__ method for the DBSeqRecord related to the fact there is no per-letter-annotation (this would have been introduced in Biopython 1.50 when I added the letter_annotations dictionary to the SeqRecord class). I'm a little surprised that our unit tests didn't catch this - but its fixed now: Tests/test_BioSQL.py CVS revision 1.37 BioSQL/BioSeq.py CVS revision 1.36 Note BioSQL doesn't yet support recording anything more complicated than strings, although we've started talking about using XML or JSON for this. As a result, Biopython does not attempt to record any per-letter-annotation in the BioSQL database. With the fix the DBSeqRecord now has an empty per-letter-annotation dictionary. Before it didn't, hense the AttributeError. Hopefully you won't find any more issues, but if you do, please file another bug - I'm marking this one as fixed. Thanks for your report and time David, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From spenthil at gmail.com Sat May 23 12:11:22 2009 From: spenthil at gmail.com (Senthil Palanisami) Date: Sat, 23 May 2009 09:11:22 -0700 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <200904171246.46568.jblanca@btc.upv.es> <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> <20090522225432.GU84112@sobchak.mgh.harvard.edu> <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com> <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> Message-ID: <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com> You didn't sound particularly grumpy, I am just aware of the annoyances related to people too lazy to do a quick search of through a mailing list before spamming. I pulled 'CA' straight out of a wgs assembler program: http://apps.sourceforge.net/mediawiki/wgs-assembler/index.php?title=Formatting_Inputs#sffToCA I think 'frg' is the real file format name. -- Senthil Palanisami http://spenthil.com On Sat, May 23, 2009 at 4:28 AM, Peter wrote: > On Sat, May 23, 2009 at 2:48 AM, Senthil Palanisami > wrote: > > Sorry, I only recently joined this list - should have gone through the > > archives first. > > Don't worry - and if I sounded grumpy, sorry - I was up late last night. > > > I have done some minimal SFF tweaking, but only by first converting them > > to CA format. > > What do you mean by CA format? I don't recall seeing that abbreviation > before. > > > No paired end reads yet, but I do know my PI wants me to start looking > > at some in the next month or two. > > I haven't had any paired end 454 reads to work with personally, but I'm > sure there are some examples available online somewhere. > > Peter > From mjldehoon at yahoo.com Sun May 24 00:10:28 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 23 May 2009 21:10:28 -0700 (PDT) Subject: [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL Message-ID: <867081.50034.qm@web62404.mail.re1.yahoo.com> I suggest that for the short term, we store the DE lines as one string in the same way as Bioperl 1.5 and 1.6, until we decide on a more advanced way to treat these lines. Currently Bio.SeqIO and Bio.SwissProt use different ways to handle the DE lines, and neither of them agrees with Bioperl. --Michiel. --- On Mon, 5/18/09, Peter wrote: > From: Peter > Subject: Re: [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL > To: "Hilmar Lapp" > Cc: "Chris Fields" , "BioPerl List" , "biosql-l" , biopython-dev at biopython.org > Date: Monday, May 18, 2009, 9:38 AM > On Sun, May 17, 2009 at 4:21 PM, > Hilmar Lapp > wrote: > > > > On May 17, 2009, at 8:40 AM, Peter wrote: > >> > >> [...] Here you have mapped RecName and AltName > fields in the DE lines to > >> Name and Synonyms (shouldn't that be Synonym > singular?). > > > > The example is for the GN lines in SwissProt, not the > DE lines. > > Ah, that probably explains some of my confusion. > > >> In this example, searching the database using one > of the SwissProt > >> AltNames (synonyms), or filtering on the Flags > sounds like a > >> reasonable request - but this would be very > difficult if the data is > >> stored inside XML strings. > > > > Actually no. Modern full-text indexers (inside or > outside the database) can > > index XML text columns right away and very well. In > fact, for the last > > project that I built a full-text search for (on top of > a BioSQL database) I > > did that by writing custom XML documents to a separate > table for each > > record I wanted indexed. Oracle's full text indexer > did the rest. I also built a > > separate identifier/name/accession index that pulled > all the gene names, > > symbols, accession numbers, identifiers etc into a > single table for > > indexing. > > OK, when I said searching "would be very difficult if the > data is > stored inside XML strings", maybe it wasn't so difficult > for you - but > that still sounds complicated! > > Sticking with the GN lines and the synonym, if this was > stored as a > simple tag/value as usual in BioSQL, I would write my SQL > statement to > search the annotation table where the term id was that > associated with > a GN synonym, and the annotation value was "HABP1".? > Simple. > > Using the XML approach, are you suggesting you could do a > full text > search on the annotation value field, looking for any rows > where the > field contains "HABP1", > where the term id matches > the GN lines' XML string? This sounds simplistic and > probably rather > slow - presumably why you resorted to the more complicated > indexing > scheme described above? > > > What I mean is, a fully normalized relational > representation, especially if > > nested, is often not the most efficient data structure > for efficient > > searching and filtering. > > OK.? But do we really need to worry about complex > nested structures > for the SwissProt annotation (or in general)? > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython at maubp.freeserve.co.uk Sun May 24 06:42:14 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 24 May 2009 11:42:14 +0100 Subject: [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <867081.50034.qm@web62404.mail.re1.yahoo.com> References: <867081.50034.qm@web62404.mail.re1.yahoo.com> Message-ID: <320fb6e00905240342t7d59f783t8203cce581256f88@mail.gmail.com> On Sun, May 24, 2009 at 5:10 AM, Michiel de Hoon wrote: > > I suggest that for the short term, we store the DE lines as one > string in the same way as Bioperl 1.5 and 1.6, until we decide > on a more advanced way to treat these lines. Agreed. > Currently Bio.SeqIO and Bio.SwissProt use different ways to > handle the DE lines, and neither of them agrees with Bioperl. Well, Bio.SeqIO agrees with BioPerl modulo the white space - but we might as well agree with the current BioPerl behaviour until something is settled for storing more complex objects than strings in BioSQL. As I mentioned earlier, I'll be away for this week, so feel free to press ahead with this. Peter From bugzilla-daemon at portal.open-bio.org Mon May 25 14:21:26 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 May 2009 14:21:26 -0400 Subject: [Biopython-dev] [Bug 2840] New: When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord fails in _load_reference Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2840 Summary: When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord fails in _load_reference Product: Biopython Version: Not Applicable Platform: All OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: david.wyllie at ndm.ox.ac.uk Hi I have been trying to load SeqRecords from BioSQL, annotate them, and then write them to a different BioSQL database. Reloading the record to the second database fails. This isn't to do with annotation - none is performed. This issue is different from #2838, which has been addressed (thank you). The sequence of events is 1) eFetch a SeqRecord from Genbank (succeeds) 2) write to BioSQL (succeeds) 3) recover from BioSQL (succeeds) 4) write to BioSQL (fails, although no modifications have been made). The current problem seems related to references: Loader.load_seqrecord._load_reference. Error says: _load_reference start = 1 + int(str(reference.location[0].start)) ValueError: invalid literal for int() with base 10: 'None' Testing has been done on Ubuntu 9 x64 with Python 2.6 (debian package), python-dev (debian package), load from CVS as of 24.5.09, and a testcase program, dbtestcase.py, attached to the now fixed bug #2838. To run dbtestcase.py, the mysql details will have to be altered on line beginning ad=AuthDetails(... but otherwise it should I think run. Traceback and program output from dbtestcase.py follow. dwyllie at dwyllie:~/programs/CheckleyProject/src$ python dbtestcase.py OK, going to recover record 28804743 from genbank.... Record loaded looks like this: ID: AB098727.1 Name: AB098727 Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal protein S11, cytochromoe b/f complex subunit IV, partial cds. Number of features: 5 /sequence_version=1 /source=chloroplast Ceratodon purpureus /taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales', 'Ditrichaceae', 'Ceratodon'] /keywords=[''] /references=[, , , ] /accessions=['AB098727'] /data_file_division=PLN /date=26-MAY-2005 /organism=Ceratodon purpureus /gi=28804743 Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA', IUPACAmbiguousDNA()) ======================================================================== Load from Entrez completed, records= 1 Here is the loaded record: ======================================================================== ID: AB098727.1 Name: AB098727 Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal protein S11, cytochromoe b/f complex subunit IV, partial cds. Number of features: 5 /sequence_version=1 /source=chloroplast Ceratodon purpureus /taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales', 'Ditrichaceae', 'Ceratodon'] /keywords=[''] /references=[, , , ] /accessions=['AB098727'] /data_file_division=PLN /date=26-MAY-2005 /organism=Ceratodon purpureus /gi=28804743 Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA', IUPACAmbiguousDNA()) ======================================================================== Now loading these records into a BioSQL database One. /var/lib/python-support/python2.6/MySQLdb/__init__.py:34: DeprecationWarning: the sets module is deprecated from sets import ImmutableSet Creating a new database One ======================================================================== Load from database One completed, records= 1 ======================================================================== Here is the record recovered from database One: ID: AB098727.1 Name: AB098727 Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal protein S11, cytochromoe b/f complex subunit IV, partial cds. Number of features: 5 /dates=['26-MAY-2005'] /ncbi_taxid=3225 /date=['26-MAY-2005'] /taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Bryopsida', 'Dicranidae', 'Dicranales', 'Ditrichaceae', 'Ceratodon', 'Ceratodon purpureus'] /source=['chloroplast Ceratodon purpureus'] /references=[, , , ] /gi=28804743 /data_file_division=PLN /keywords=[''] /organism=Ceratodon purpureus /sequence_version=['1'] /accessions=['AB098727'] DBSeq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA', DNAAlphabet()) ======================================================================== Creating a new database Two Traceback (most recent call last): File "dbtestcase.py", line 165, in from dbtestcase import AuthDetails File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 182, in DemonstrateProblem(problemgi,ad) File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 158, in DemonstrateProblem db2.load(listtoload) File "/usr/local/lib/python2.6/dist-packages/BioSQL/BioSeqDatabase.py", line 442, in load db_loader.load_seqrecord(cur_record) File "/usr/local/lib/python2.6/dist-packages/BioSQL/Loader.py", line 57, in load_seqrecord self._load_reference(reference, rank, bioentry_id) File "/usr/local/lib/python2.6/dist-packages/BioSQL/Loader.py", line 733, in _load_reference start = 1 + int(str(reference.location[0].start)) ValueError: invalid literal for int() with base 10: 'None' dwyllie at dwyllie:~/programs/CheckleyProject/src$ -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 25 14:23:52 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 May 2009 14:23:52 -0400 Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord in _load_reference In-Reply-To: Message-ID: <200905251823.n4PINq60005295@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2840 david.wyllie at ndm.ox.ac.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|When a record has been |When a record has been |loaded from BioSQL, trying |loaded from BioSQL, trying |to save it to another |to save it to another |database fails with loader |database fails with loader |db_loader.load_seqrecord |db_loader.load_seqrecord in |fails in _load_reference |_load_reference -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 25 18:23:20 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 May 2009 18:23:20 -0400 Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord in _load_reference In-Reply-To: Message-ID: <200905252223.n4PMNKL7023601@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2840 ------- Comment #1 from david.wyllie at ndm.ox.ac.uk 2009-05-25 18:23 EST ------- I have modified the dbtestcase.py script to show the contents of the reference of the record downloaded from genbank, and from the record recovered from BioSQL. Here is a print out of the last two references before saving to BioSQL: authors: Sugita,M., Sugiura,C., Arikawa,T. and Higuchi,M. title: Molecular evidence of an rpoA gene in the basal moss chloroplast genomes: rpoA is a useful molecular marker for phylogenetic analysis of mosses journal: Hikobia 14, 171-175 (2004) medline id: pubmed id: comment: location: [0:789] authors: Sugita,M. title: Direct Submission journal: Submitted (25-DEC-2002) Mamoru Sugita, Nagoya University, Center for Gene Research; Chikusa-ku, Nagoya, Aichi 464-8602, Japan (E-mail:sugita at gene.nagoya-u.ac.jp, Tel:81-52-789-3080(ex.3080), Fax:81-52-789-3080) medline id: pubmed id: comment: --- note: no location in the first one; only a location in the last reference (why? - should references have a location? I suppose they might, if they referred to a part of a chromosome?) Now, after saving to BioSQL and recovering, all the records have a location, but in some cases, it is [None:None]; here are the same two records. location: [None:None] authors: Sugita,M., Sugiura,C., Arikawa,T. and Higuchi,M. title: Molecular evidence of an rpoA gene in the basal moss chloroplast genomes: rpoA is a useful molecular marker for phylogenetic analysis of mosses journal: Hikobia 14, 171-175 (2004) medline id: pubmed id: comment: location: [0:789] authors: Sugita,M. title: Direct Submission journal: Submitted (25-DEC-2002) Mamoru Sugita, Nagoya University, Center for Gene Research; Chikusa-ku, Nagoya, Aichi 464-8602, Japan (E-mail:sugita at gene.nagoya-u.ac.jp, Tel:81-52-789-3080(ex.3080), Fax:81-52-789-3080) medline id: pubmed id: comment: After this, the db.load method calls _load_reference. I think the problem is because the last line doesn't cope with none values. If one edits _load_reference to put the last reference inside a test for the null condition if (start is not None and end is not None): sql = "INSERT INTO bioentry_reference (bioentry_id, reference_id," \ " start_pos, end_pos, rank)" \ " VALUES (%s, %s, %s, %s, %s)" self.adaptor.execute(sql, (bioentry_id, reference_id, start, end, rank + 1)) Then the problem is solved, but I'm not sure how this fits in the bigger scheme of things. d -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 25 18:26:21 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 May 2009 18:26:21 -0400 Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord in _load_reference In-Reply-To: Message-ID: <200905252226.n4PMQK9o023893@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2840 ------- Comment #2 from david.wyllie at ndm.ox.ac.uk 2009-05-25 18:26 EST ------- Created an attachment (id=1305) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1305&action=view) A program which tests for the problem. Alter the ad=AuthDetails line to include MySQl passwords for your system; using root and no password in the script as is. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 25 20:14:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 May 2009 20:14:40 -0400 Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord in _load_reference In-Reply-To: Message-ID: <200905260014.n4Q0EeBh030704@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2840 ------- Comment #3 from cymon.cox at gmail.com 2009-05-25 20:14 EST ------- (In reply to comment #1) > I have modified the dbtestcase.py script to show the contents of the reference > of the record downloaded from genbank, and from the record recovered from > BioSQL. > > Here is a print out of the last two references before saving to BioSQL: > > authors: Sugita,M., Sugiura,C., Arikawa,T. and Higuchi,M. > title: Molecular evidence of an rpoA gene in the basal moss chloroplast > genomes: rpoA is a useful molecular marker for phylogenetic analysis of mosses > journal: Hikobia 14, 171-175 (2004) > medline id: > pubmed id: > comment: > > location: [0:789] > authors: Sugita,M. > title: Direct Submission > journal: Submitted (25-DEC-2002) Mamoru Sugita, Nagoya University, Center for > Gene Research; Chikusa-ku, Nagoya, Aichi 464-8602, Japan > (E-mail:sugita at gene.nagoya-u.ac.jp, Tel:81-52-789-3080(ex.3080), > Fax:81-52-789-3080) > medline id: > pubmed id: > comment: > > --- note: no location in the first one; only a location in the last reference > (why? - should references have a location? I suppose they might, if they > referred to a part of a chromosome?) > > Now, after saving to BioSQL and recovering, all the records have a location, > but in some cases, it is [None:None]; here are the same two records. > > location: [None:None] > authors: Sugita,M., Sugiura,C., Arikawa,T. and Higuchi,M. > title: Molecular evidence of an rpoA gene in the basal moss chloroplast > genomes: rpoA is a useful molecular marker for phylogenetic analysis of mosses > journal: Hikobia 14, 171-175 (2004) > medline id: > pubmed id: > comment: > > location: [0:789] > authors: Sugita,M. > title: Direct Submission > journal: Submitted (25-DEC-2002) Mamoru Sugita, Nagoya University, Center for > Gene Research; Chikusa-ku, Nagoya, Aichi 464-8602, Japan > (E-mail:sugita at gene.nagoya-u.ac.jp, Tel:81-52-789-3080(ex.3080), > Fax:81-52-789-3080) > medline id: > pubmed id: > comment: > > > After this, the db.load method calls _load_reference. > > I think the problem is because the last line doesn't cope with none values. > If one edits > _load_reference to put the last reference inside a test for the null condition > > if (start is not None and end is not None): > sql = "INSERT INTO bioentry_reference (bioentry_id, reference_id," > \ > " start_pos, end_pos, rank)" \ > " VALUES (%s, %s, %s, %s, %s)" > self.adaptor.execute(sql, (bioentry_id, reference_id, > start, end, rank + 1)) > > Then the problem is solved, but I'm not sure how this fits in the bigger scheme > of things. > > d > The BioSQL loader uses None for "start" and "end" if a reference doesn't have a location. When the reference is retrieved the location remains set to ["None","None"] Try this alteration to BioSeq.py, it should solve your problem: cymon at gyra:~/git/github-master/BioSQL$ git diff BioSeq.py diff --git a/BioSQL/BioSeq.py b/BioSQL/BioSeq.py index cc47cf4..8d1e02a 100644 --- a/BioSQL/BioSeq.py +++ b/BioSQL/BioSeq.py @@ -351,8 +351,11 @@ def _retrieve_reference(adaptor, primary_id): references = [] for start, end, location, title, authors, dbname, accession in refs: reference = SeqFeature.Reference() - if start: start -= 1 - reference.location = [SeqFeature.FeatureLocation(start, end)] + if start: + start -= 1 + reference.location = [SeqFeature.FeatureLocation(start, end)] + else: + reference.location = [] #Don't replace the default "" with None. if authors : reference.authors = authors if title : reference.title = title Heres a patch for the unittest to compare locations of injected and retrieved records: diff --git a/Tests/test_BioSQL_SeqIO.py b/Tests/test_BioSQL_SeqIO.py index 2d8caf8..9479e02 100644 --- a/Tests/test_BioSQL_SeqIO.py +++ b/Tests/test_BioSQL_SeqIO.py @@ -360,6 +360,19 @@ def compare_records(old, new) : assert len(old.annotations[key]) == len(new.annotations[key]) for old_r, new_r in zip(old.annotations[key], new.annotations[key]) : compare_references(old_r, new_r) + for old_ref, new_ref in zip(old.annotations[key], + new.annotations[key]): + if old_ref.location == []: + assert new_ref.location == [], "old_reference.location %s !=" \ + "new_reference location %s" % (old_ref.location, + new_ref.location) + else: + assert old_ref.location[0].start == new_ref.location[0].start, \ + "old ref.location[0].start %s != new ref.location[0].start %s" % \ + (old_ref.location[0].start, new_ref.location[0].start) + assert old_ref.location[0].end == new_ref.location[0].end, \ + "old ref.location[0].end %s != new ref.location[0].end %s" % \ + (old_ref.location[0].end, new_ref.location[0].end) elif key == "comment": if isinstance(old.annotations[key], list): old_comment = [comm.replace("\n", " ") for comm in \ Cheers, Cymon -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 26 10:17:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 26 May 2009 10:17:48 -0400 Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord in _load_reference In-Reply-To: Message-ID: <200905261417.n4QEHmf9007821@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2840 ------- Comment #4 from cymon.cox at gmail.com 2009-05-26 10:17 EST ------- (In reply to comment #3) > (In reply to comment #1) The functions in old Tests/BioSQL_Seq.py have moved to seq_tests_common.py. So ive updated the seq_tests_common: diff --git a/Tests/seq_tests_common.py b/Tests/seq_tests_common.py index d3b7fb4..392a96c 100644 --- a/Tests/seq_tests_common.py +++ b/Tests/seq_tests_common.py @@ -40,10 +40,17 @@ def compare_references(old_r, new_r) : #allow us to store a consortium. assert new_r.consrtm == "" - #TODO - reference location? - #The parser seems to give a location object (i.e. which - #nucleotides from the file is the reference for), while the - #we seem to use the database to hold the journal details (!) + # Reference location + if old_r.location == []: + assert new_r.location == [], "old_r.location %s != " \ + "new_r.location %s" % (old_r.location, new_r.location) + else: + assert old_r.location[0].start == new_r.location[0].start, \ + "old_r.location[0].start %s != new_r.location[0].start %s" % \ + (old_r.location[0].start, new_r.location[0].start) + assert old_r.location[0].end == new_r.location[0].end, \ + "old_r.location[0].end %s != new_r.location[0].end %s" % \ + (old_r.location[0].end, new_r.location[0].end) return True Pushed to http://github.com/cymon/biopython-github-master/tree/bug2840 C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 26 13:32:34 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 26 May 2009 13:32:34 -0400 Subject: [Biopython-dev] [Bug 2841] New: SeqFeature constructor ignores qualifiers and sub_features arguments Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2841 Summary: SeqFeature constructor ignores qualifiers and sub_features arguments Product: Biopython Version: 1.50 Platform: All OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: n.j.loman at bham.ac.uk The constructor to Bio.SeqFeature.SeqFeature ignores qualifiers and sub_features, although the prototype to the constructor allows these keyword arguments to be specified. I see in the code there is a reason for it to be ignored: # XXX right now sub_features and qualifiers cannot be set # from the initializer because this causes all kinds # of recursive import problems. I can't understand why this is # at all :-< self.qualifiers = {} self.sub_features = [] However, would it not be better to get rid of the keyword arguments from the constructor prototype to stop people getting confused? I keep stumbling over this problem myself and forgetting about it. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 27 03:57:05 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 27 May 2009 03:57:05 -0400 Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord in _load_reference In-Reply-To: Message-ID: <200905270757.n4R7v5iv004300@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2840 ------- Comment #5 from david.wyllie at ndm.ox.ac.uk 2009-05-27 03:57 EST ------- Thank you very much! I haven't tested the unit tests but the patch in #3 resolves the problem. With best wishes -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Sat May 30 05:37:35 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 30 May 2009 02:37:35 -0700 (PDT) Subject: [Biopython-dev] More SwissProt inconsistencies Message-ID: <880385.97797.qm@web62401.mail.re1.yahoo.com> Looking some more at how Bio.SeqIO and Bio.SwissProt store the information in a SwissProt file, I found the following two inconsistencies: 1) A multi-line author list such as the following: RA Pain A., Renauld H., Berriman M., Murphy L., Yeats C.A., Weir W., RA Kerhornou A., Aslett M., Bishop R., Bouchier C., Cochet M., RA Coulson R.M.R., Cronin A., de Villiers E.P., Fraser A., Fosker N., RA Gardner M., Goble A., Griffiths-Jones S., Harris D.E., Katzer F., RA Larke N., Lord A., Maser P., McKellar S., Mooney P., Morton F., RA Nene V., O'Neil S., Price C., Quail M.A., Rabbinowitsch E., RA Rawlings N.D., Rutter S., Saunders D., Seeger K., Shah T., Squares R., RA Squares S., Tivey A., Walker A.R., Woodward J., Dobbelaere D.A.E., RA Langsley G., Rajandream M.A., McKeever D., Shiels B., Tait A., RA Barrell B.G., Hall N.; is stored without newlines by Bio.SeqIO: >>> seq_record.annotations['references'][0].authors "Pain A., Renauld H., Berriman M., Murphy L., Yeats C.A., Weir W.,Kerhornou A., Aslett M., Bishop R., Bouchier C., Cochet M.,Coulson R.M.R., Cronin A., de Villiers E.P., Fraser A., Fosker N.,Gardner M., Goble A., Griffiths-Jones S., Harris D.E., Katzer F.,Larke N., Lord A., Maser P., McKellar S., Mooney P., Morton F.,Nene V., O'Neil S., Price C., Quail M.A., Rabbinowitsch E.,Rawlings N.D., Rutter S., Saunders D., Seeger K., Shah T., Squares R.,Squares S., Tivey A., Walker A.R., Woodward J., Dobbelaere D.A.E.,Langsley G., Rajandream M.A., McKeever D., Shiels B., Tait A.,Barrell B.G., Hall N.;" but with newlines by Bio.SwissProt: >>> swiss_record.references[0].authors "Pain A., Renauld H., Berriman M., Murphy L., Yeats C.A., Weir W.,\nKerhornou A., Aslett M., Bishop R., Bouchier C., Cochet M.,\nCoulson R.M.R., Cronin A., de Villiers E.P., Fraser A., Fosker N.,\nGardner M., Goble A., Griffiths-Jones S., Harris D.E., Katzer F.,\nLarke N., Lord A., Maser P., McKellar S., Mooney P., Morton F.,\nNene V., O'Neil S., Price C., Quail M.A., Rabbinowitsch E.,\nRawlings N.D., Rutter S., Saunders D., Seeger K., Shah T., Squares R.,\nSquares S., Tivey A., Walker A.R., Woodward J., Dobbelaere D.A.E.,\nLangsley G., Rajandream M.A., McKeever D., Shiels B., Tait A.,\nBarrell B.G., Hall N.;" To me, the Bio.SeqIO approach seems more reasonable. I think we should add a space though at places where there is a newline in the file. The same happens for multiline RL such as RL (In) Baker M.J., Crush J.R., Humphreys L.R. (eds.); RL Proceedings of the XVII international grassland congress, RL pp.2:1033-1034, Dunmore Press, Palmerston North (1993). and for multiline RT lines such as RT "Genome of the host-cell transforming parasite Theileria annulata RT compared with T. parva."; This is stored by Bio.SeqIO as '"Genome of the host-cell transforming parasite Theileria annulatacompared with T. parva.";' and by Bio.SwissProt as '"Genome of the host-cell transforming parasite Theileria annulata\ncompared with T. parva.";' whereas I think that both should be stored as '"Genome of the host-cell transforming parasite Theileria annulata compared with T. parva.";' 2) Comments in a references such as the following: RC STRAIN=cv. VF36; TISSUE=Anther; are stored as a single string by Bio.SeqIO: >>> seq_record.annotations['references'][i].comment 'STRAIN=cv. VF36; TISSUE=Anther;' but as a list of (key, value) pairs by Bio.SwissProt: [('STRAIN', 'cv. VF36'), ('TISSUE', 'Anther')] Whereas I think both are reasonable, Bio.SeqIO drops the space between two (key, value) pairs if they are on two separate lines: RC STRAIN=C57BL/6J; RC TISSUE=Bone marrow, Embryo, Kidney, Liver, Thymus, and Visual cortex; is stored as >>> seq_record.annotations['references'][i].comment 'STRAIN=C57BL/6J;TISSUE=Bone marrow, Embryo, Kidney, Liver, Thymus, and Visual cortex;' I think we should add a space here, or just store these as (key, value) pairs as Bio.SwissProt is doing. Any objections or comments? --Michiel From chapmanb at 50mail.com Fri May 1 12:11:25 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 1 May 2009 08:11:25 -0400 Subject: [Biopython-dev] MUMmer In-Reply-To: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca> References: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca> Message-ID: <20090501121125.GD50777@sobchak.mgh.harvard.edu> Marcin; > I guess I should start with a nice 'hi' to everybody, now that I am > sending my first message to this group. So: Hi, Everybody! Welcome. We are happy to have you. > Now, that we have the formality out of the way, I will get to the point. > Recently, I have written some Python code for parsing and processing the > output of MUMmer tool (http://mummer.sourceforge.net/). More > specifically, the code I have manages invocations and handles outputs of > the nucmer pipeline (alignment of multiple closely related nucleotide > sequences) and of mummer itself (short exact matches). Obviously, the > results are ultimately rendered as pairs of biopython's Seq objects. This is great -- we don't have support for MUMmer alignments so this is very welcome. > I use this stuff only myself, in work on bacterial genomes, but I would > be more than willing to contribute it to the project. It may be rough > around the edges at the moment, but I think I could easily give it the > necessary polish if there is interest in having it included. As Bartek mentioned, the first step is to organize the code you have and start it as a branch on GitHub. Being able to see the code will help us make specific suggestions. Generally, based on what you've written it sounds like this will fit into the alignment interfaces. Peter and Cymon have been working on organizing this. Support for command lines and running programs lives in: http://github.com/biopython/biopython/tree/master/Bio/Align/Applications Parsing output and returning alignment objects is organized in the AlignIO module: http://github.com/biopython/biopython/tree/master/Bio/AlignIO http://www.biopython.org/wiki/AlignIO Tests are an important part of the submission process and many examples are found here: http://github.com/biopython/biopython/tree/master/Tests test_Clustalw.py is an example of a print and compare style test, and test_Mafft_tool.py is a unittest style test. We are more concerned with good testing coverage then how exactly the tests get written. We can definitely help with more specific feedback but hopefully this gives you a general idea to get started. Looking forward to seeing the code, Brad From chapmanb at 50mail.com Fri May 1 12:28:06 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 1 May 2009 08:28:06 -0400 Subject: [Biopython-dev] XML parsing library for new modules In-Reply-To: <3f6baf360904291228m1bba6008kb26e345510772e7a@mail.gmail.com> References: <3f6baf360904291228m1bba6008kb26e345510772e7a@mail.gmail.com> Message-ID: <20090501122806.GE50777@sobchak.mgh.harvard.edu> Eric; Thanks for summarizing the issues. I know Peter is taking a few well deserved days off but I suspect he will have some thoughts when he returns. We'd love to hear the experience of others who have used different python XML parsers. My lean is towards ElementTree for reasons of code clarity. SAX parsers require a lot of boilerplate style code. They also can be tricky with nested elements; I always find myself using a lot of "if in_tag; else if in_tag" style code. ElementTree eliminates a lot of these issues which should result in easier to maintain code. Brad > I'm writing a parser for the PhyloXML format for Google Summer of Code this > year, and as the name would imply, it requires parsing some large XML files. > The existing modules in Biopython for parsing XML formats seem to use > xml.sax in the standard library. In Python 2.5, a faster and more Pythonic > parser was added to the standard lib: ElementTree (xml.etree), in > pure-Python and C-enhanced flavors. How do you feel about each of these > libraries as the basis for a new Biopython module? > > Here are some interesting benchmarks: > http://effbot.org/zone/celementtree.htm#benchmarks > > The ElementTree library is also available as a standalone package, > compatible back to Python 2.1, and the lxml package also offers an > independent implementation. So maintaining compatibility with Python 2.4 > would require the availability of one of these third-party packages, and my > code would try each of these imports in order: > > from xml.etree import cElementTree as ElementTree > from xml.etree import ElementTree > # Separate lxml package > from lxml.etree import ElementTree > # Standalone elementtree package > import cElementTree as ElementTree > from elementtree import ElementTree > > Then one day, when Python 2.4 is no longer supported, only the first two > lines would be needed. (The second line is for sites that disable C > extensions, like Google App Engine, or alternate Python implementations like > Jython.) > > Another option is xml.parsers.expat, but just Googling around, it appears > that the Python zeitgeist is strongly in favor of xml.etree for new code. > > Thoughts? > > Thanks, > Eric > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From marcin.swiatek at mail.mcgill.ca Fri May 1 18:17:14 2009 From: marcin.swiatek at mail.mcgill.ca (Marcin Swiatek) Date: Fri, 1 May 2009 14:17:14 -0400 Subject: [Biopython-dev] MUMmer In-Reply-To: <8b34ec180904300950n2c75f010oed27493f52d0da14@mail.gmail.com> References: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca> <8b34ec180904300950n2c75f010oed27493f52d0da14@mail.gmail.com> Message-ID: <176A06E658ED0745965C072C5F2C116A037F084C@EXCHANGE2VS2.campus.mcgill.ca> Bartek, Brad, Thank you for the suggestions. I will set myself up as proposed and see what I can do to align my code with local customs and traditions. If questions arise, I will post again. As for the use of alignment object, I have actually chosen to represent 'candidate' matches by my own simplistic class. Nucmer, the way I use it, generates lots of spurious matches, which I always need to somehow filter. Thus, it seemed perfectly reasonable at the time to create the proper representation of alignment later on, in a separate function call. Following your suggestion I will probably change it to return an alignment object, rather than a pair of sequences. But details are best discussed once the code is available, so I think we will return to this matter later. Regards, Marcin -----Original Message----- From: barwil at gmail.com [mailto:barwil at gmail.com] On Behalf Of Bartek Wilczynski Sent: Thursday, April 30, 2009 12:51 PM To: Marcin Swiatek Cc: biopython-dev at biopython.org Subject: Re: [Biopython-dev] MUMmer Hi Marcin, On Thu, Apr 30, 2009 at 5:23 PM, Marcin Swiatek wrote: > Hello, > > > > I use this stuff only myself, in work on bacterial genomes, but I would > be more than willing to contribute it to the project. It may be rough > around the edges at the moment, but I think I could easily give it the > necessary polish if there is interest in having it included. > Contributions are always welome > > > Should that be the case, could one of the project leads point me in the > right direction, please? How should I go about the submission? > > I don't think I qualify as a lead, but nonetheless I think I can help here. I think that the best way to submit your code currently is to create a branch (fork) of biopython on github and submit your changes there and then notify people on biopython-dev that there is new code to review. You can also submit an enhancement bug to bugzilla. There are a couple of wiki pages which might be of interest to you: - http://biopython.org/wiki/Contributing - http://biopython.org/wiki/GitUsage If you have any questions or problems during the process, ask on the list. As for the code, I'm not sure, but maybe instead of returning a pair of sequences, an alignment object might be a better choice? You might want to also check out a recent code on application wrappers: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005766.html cheers Bartek From bugzilla-daemon at portal.open-bio.org Fri May 1 18:16:57 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 1 May 2009 14:16:57 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200905011816.n41IGvXO012709@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #8 from eric.talevich at gmail.com 2009-05-01 14:16 EST ------- (In reply to comment #7) > (In reply to comment #2) > > Python 2.6 includes a context manager that makes all these problems > > *completely* go away, by catching all of the warnings raised within a > > context and optionally storing them as a list of warning objects that > > can be inspected. > > That sounds much better :) > > > Would you be interested in having a unit test that does a more thorough > > check of the warnings system, but only runs on Py2.6? I'm guessing no, > > but hey, worth a shot. > > Yes - other than using the old print-and-compare test, this seems worth doing > in order to actually test the warnings we expect are being issued. It could be > a whole new file, test_PDB_warnings.py which required Python 2.6+, but as its > just one or two tests, maybe just use conditional method(s) within the > test_PDB_unit.py file. > > Peter > I have something that works on both Py2.5 and Py2.6 now: http://github.com/etal/biopython/tree/pdbtidy I added a new file called _PDB_extra.py which test_PDB_unit.py imports if an attribute called 'catch_warnings' is available in the current warnings module. If so, the method test_warnings is added to the class, otherwise nothing happens. So Py2.6 runs 9 tests in test_PDB_unit.py, while Py2.5 only runs 8. This seemed easier than creating a whole separate unittest suite for one tricky test, but I defer to you on the organization and naming. I think I'll need to do a similar separation of tests for PhyloXML, so I'd like to have a consistent pattern to follow here. Also, apparently tests are run in alphabetical order, and Exposure was jumping ahead of PDBExceptionTest. I renamed PDBExceptionTest to ExceptionTest to restore the natural order of things and stop setting off the warnings prematurely. Maybe test suites with multiple TestCase classes should be arranged alphabetically in the code to avoid confusion in the future. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 4 10:57:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 4 May 2009 06:57:33 -0400 Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline - properties and kwargs In-Reply-To: Message-ID: <200905041057.n44AvXil006684@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2822 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1288 is|0 |1 obsolete| | ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-04 06:57 EST ------- Created an attachment (id=1289) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1289&action=view) Patch to add keyword arguments and properties to command line wrappers Brad likes the idea, and as the Bio.Application module owner that's good :) http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005963.html This patch makes a very slight difference to reduce the changes needed to old code (i.e. in the __init__ method use self.parameters = [...] as before) with the bonus that the base class and subclasses have the same __init__ signature (argument list). This patch also now covers Bio.Align.Applications, Bio.Motif.Applications and Bio.AlignAce.Applications as well as Bio.Emboss.Applications (i.e. all affected files). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Mon May 4 12:02:59 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 4 May 2009 13:02:59 +0100 Subject: [Biopython-dev] MUMmer In-Reply-To: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca> References: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca> Message-ID: <320fb6e00905040502y4785a0f9t4475ab0868a791c@mail.gmail.com> On Thu, Apr 30, 2009 at 4:23 PM, Marcin Swiatek wrote: > Hello, > > I guess I should start with a nice 'hi' to everybody, now that I am > sending my first message to this group. So: Hi, Everybody! Hi! > Now, that we have the formality out of the way, I will get to the point. > Recently, I have written some Python code for parsing and processing the > output of MUMmer tool (http://mummer.sourceforge.net/). More > specifically, the code I have manages invocations and handles outputs of > the nucmer pipeline (alignment of multiple closely related nucleotide > sequences) and of mummer itself (short exact matches). Obviously, the > results are ultimately rendered as pairs of biopython's Seq objects. > > I use this stuff only myself, in work on bacterial genomes, but I would > be more than willing to contribute it to the project. It may be rough > around the edges at the moment, but I think I could easily give it the > necessary polish if there is interest in having it included. Great! I assume your OK with our licence, and there are no problems from your employer/University with a contribution like this? > Should that be the case, could one of the project leads point me in the > right direction, please? How should I go about the submission? In terms of showing us the code, how do you feel about trying out github (see Bartek's email)? Alternatively file and enhancement bug on our bugzilla and upload your current python file (or a zip file if this is split up into several modules). >From your description above it sounds like you have two main lumps of code: a pairwise alignment parser, and some command line tool wrappers. Brad and Bartek have already mentioned returning Alignment objects, that would let us integrate MUMmer as an input format for Bio.AlignIO, http://biopython.org/wiki/AlignIO It may be helpful to have a look at how we parse FASTA output into pairwise alignments, and also the EMBOSS "pairs" files from needle and water. Although (as Brad mentioned), this is currently undergoing a little flux, for the command line wrappers I'd like this to use our Bio.Application framework to represent the command line object, giving a string the user can then invoke as the prefer. Having the MUMmer wrapper under Bio.Align.Applications seems sensible at this point. If you have been lurking on the dev mailing list for a while, these topics may be familiar already. If not, have a look over the last month or so in the archives here: http://lists.open-bio.org/pipermail/biopython-dev/ Thanks, Peter From p.j.a.cock at googlemail.com Mon May 4 12:15:04 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 4 May 2009 13:15:04 +0100 Subject: [Biopython-dev] XML parsing library for new modules In-Reply-To: <20090501122806.GE50777@sobchak.mgh.harvard.edu> References: <3f6baf360904291228m1bba6008kb26e345510772e7a@mail.gmail.com> <20090501122806.GE50777@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00905040515q63312065oec904ed3f5ffab93@mail.gmail.com> On Fri, May 1, 2009 at 1:28 PM, Brad Chapman wrote: > Eric; > Thanks for summarizing the issues. I know Peter is taking a few well > deserved days off but I suspect he will have some thoughts when he > returns. We'd love to hear the experience of others who have used > different python XML parsers. I would be interested to hear Michiel's views on this, as he knows more about the specifics of the existing XML parsers in Biopython (e.g. Bio.Entrez). > My lean is towards ElementTree for reasons of code clarity. SAX > parsers require a lot of boilerplate style code. They also can be > tricky with nested elements; I always find myself using a lot of "if > in_tag; else if in_tag" style code. ElementTree eliminates a lot of > these issues which should result in easier to maintain code. We have been trying to avoid external library dependencies where possible (moving away from Martel for parsing has really helped here). Given ElementTree and cElementTree are included with Python 2.5+, this is only an issue for Biopython running on Python 2.4. Both ElementTree and cElementTree are available as separate downloads (with Windows installers). I think under their licence we could even bundle it with Biopython if need be. So, while it is a shame ElementTree isn't part of Python 2.4, if it is the best technical solution, that shouldn't stop us from using it. Note we should ONLY use those core features which are included with Python 2.5+ inself. Peter P.S. I wonder if our BLAST XML parser would get a big speed boost if we switched it to ElementTree instead of xml.sax? From bugzilla-daemon at portal.open-bio.org Mon May 4 13:47:25 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 4 May 2009 09:47:25 -0400 Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline - properties and kwargs In-Reply-To: Message-ID: <200905041347.n44DlPQD018238@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2822 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1289 is|0 |1 obsolete| | ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-04 09:47 EST ------- Created an attachment (id=1290) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1290&action=view) Patch to add keyword arguments, properties and __repr__ to command line wrappers Extended to include __repr__ support (using the new keyword arguments support). Note that the Muscle wrapper will need an alternative python valid identifier for the -in argument, e.g. "input", because we can't use just "in" as a property or keyword argument. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 4 14:07:57 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 4 May 2009 10:07:57 -0400 Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline - properties and kwargs In-Reply-To: Message-ID: <200905041407.n44E7vI9020041@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2822 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1290 is|0 |1 obsolete| | ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-04 10:07 EST ------- Created an attachment (id=1291) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1291&action=view) Patch to add keyword arguments, properties and __repr__ to command line wrappers As in previous patch but with support for clearing parameters by "deleting" the property, and some basic doctests in Bio.Application. Still need to co-ordinate with Cymon to give the Muscle wrapper a valid python identifier as an alias for the -in argument, e.g. "input", because we can't use just "in" as a property or keyword argument. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon May 4 14:48:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 4 May 2009 15:48:53 +0100 Subject: [Biopython-dev] Properties in Bio.Application interface? In-Reply-To: <20090430120532.GA50777@sobchak.mgh.harvard.edu> References: <320fb6e00904260546s2d3eeb73iec08df89d93f9908@mail.gmail.com> <320fb6e00904290325w756493f2i264f79cd3f7e08b8@mail.gmail.com> <320fb6e00904290834g7a73c7au487564e3b103250@mail.gmail.com> <20090430120532.GA50777@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00905040748w7a0b940aub82220b9c78e7dc3@mail.gmail.com> On Thu, Apr 30, 2009 at 1:05 PM, Brad Chapman wrote: > I love what you are doing here. The keywords and properties make > it much more Pythonic; the old way reeks of Java-style get/sets. My > vote is to put them both in. Cool - I was hoping people would agree it is more pythonic. I have some follow up thoughts, or points for discussion ... Peter From biopython at maubp.freeserve.co.uk Mon May 4 14:53:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 4 May 2009 15:53:37 +0100 Subject: [Biopython-dev] Properties names in command line wrappers Message-ID: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com> On Mon, May 4, 2009 at 3:48 PM, Peter wrote: > On Thu, Apr 30, 2009 at 1:05 PM, Brad Chapman wrote: >> I love what you are doing here. The keywords and properties make >> it much more Pythonic; the old way reeks of Java-style get/sets. My >> vote is to put them both in. > > Cool - I was hoping people would agree it is more pythonic. > > I have some follow up thoughts, or points for discussion ... > I updated the patch on Bug 2822 to cover all the Bio.Application command line wrapper subclasses, and included __repr__ support. However, that has raised a real example of a parameter where the current "human readable" name is not a valid python identifier ("in", for "-in" in Muscle). I think the pragmatic solution is to add a sensible alternative which we can use for the property and keyword argument name (e.g. "input" in this case) while in general keeping these names as close as possible to the actual parameter name as used at the command line. On the other hand, some might argue for giving all the options meaningful names. The (hardly used) existing blastall wrapper in Bio/Blast/Applications.py gives the "-a" argument a human readable name of "nprocessors", and "-A" gets "window_size". With the old set_parameter call either alias could be used. However, with a python property we need to pick one as a preferred name - and I'm not 100% sure being helpful and using "nprocessors" (e.g. cline.nprocessors=4) is actually better than using the actual argument name (e.g. cline.a = 4). My instinct is that these are low level wrappers, which don't try to second guess the user. To take full advantage of any command line tool you will need to read the tool's documentation to know what the arguments are - and having Biopython making up its own aliases just makes things more complicated. Therefore I think the property names in the command line wrapper objects should be as close as possible to the actual command line arguments. In this case, for blastall use "a" for number of processors and "A" for window size. However, I see the existing "helper functions" in Bio/Blast/NCBIStandalone.py as a higher level wrapper, which tries to insulate the user from the precise details of the command line string, and here using an argument name "nprocessors" makes more sense (although again, it differs from the actual command line making cross referencing to the NCBI documentation more difficult). What are your thoughts Brad? Peter From biopython at maubp.freeserve.co.uk Mon May 4 15:03:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 4 May 2009 16:03:17 +0100 Subject: [Biopython-dev] Switches in the Bio.Application interface Message-ID: <320fb6e00905040803v78029f26k48b8e6908e40cc5@mail.gmail.com> On Mon, May 4, 2009 at 3:48 PM, Peter wrote: > On Thu, Apr 30, 2009 at 1:05 PM, Brad Chapman wrote: >> I love what you are doing here. The keywords and properties make >> it much more Pythonic; the old way reeks of Java-style get/sets. My >> vote is to put them both in. > > Cool - I was hoping people would agree it is more pythonic. > > I have some follow up thoughts, or points for discussion ... > > Peter > It seems sensible to me to allow "deleting" a property to clear it. There is an example in the proposed Bio/Application/__init__.py docstring of how this would work: >>> from Bio.Emboss.Applications import WaterCommandline >>> cline = WaterCommandline(gapopen=10, gapextend=0.5) >>> cline WaterCommandline(cmd='water', gapopen=10, gapextend=0.5) You can also manipulate the parameters via their properties, e.g. >>> cline.gapopen 10 >>> cline.gapopen = 20 >>> cline WaterCommandline(cmd='water', gapopen=20, gapextend=0.5) You can clear a parameter you have already added by 'deleting' the corresponding property: >>> del cline.gapopen >>> cline.gapopen >>> cline WaterCommandline(cmd='water', gapextend=0.5) That does seem to work and covers most situation, however there is a special case of command line "switches" (arguments which don't take an argument, like -kimura in ClustalW, or -l in ls). There are a lot of these cases in Cymon's new alignment wrappers. These worked OK when used with set_parameter("kimura"), the value is omitted and defaults to None. Using the current patch, to set this via the keyword argument or property, it must explicitly be set to None, which is ugly: >>> from Bio.Align.Applications import ClustalwCommandline >>> print ClustalwCommandline(gapopen=2, gapext=0.5, infile="demo.fasta") clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5 >>> print ClustalwCommandline(gapopen=2, gapext=0.5, kimura=None, infile="demo.fasta") clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5 -kimura For these "switch" arguments, perhaps the value should be interpreted as a boolean (should the switch be added or not?). This would be a change to the current API, but I don't think any of the existing wrappers actually have this kind of parameter, so there shouldn't be a backwards compatibility issue here. Instead I want to do this: >>> from Bio.Align.Applications import ClustalwCommandline >>> print ClustalwCommandline(gapopen=2, gapext=0.5, infile="demo.fasta") clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5 >>> print ClustalwCommandline(gapopen=2, gapext=0.5, kimura=True, infile="demo.fasta") clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5 -kimura >>> print ClustalwCommandline(gapopen=2, gapext=0.5, kimura=False, infile="demo.fasta") clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5 An example use case is to allow parameter searches e.g. from Bio.Align.Applications import ClustalwCommandline for gap_open in [0, 1, 2, 10] : for gap_extend in [0, 0.25, 0.5] : for use_kimura in [True, False] : #Won't work yet!: cline = ClustalwCommandline(gapopen=gap_open, gapext=gap_extend, kimura=use_kimura, infile="demo.fasta") print cline Or, modifying and reusing a single command line wrapper object: from Bio.Align.Applications import ClustalwCommandline #Set standard options: cline = ClustalwCommandline(infile="demo.fasta") #Do parameter sweep: for gap_open in [0, 1, 2, 10] : cline.gapopen = gap_open for gap_extend in [0, 0.25, 0.5] : cline.gapext = gap_extend for use_kimura in [True, False] : cline.kimura = use_kimura #Won't work yet! print cline Peter From bugzilla-daemon at portal.open-bio.org Mon May 4 15:29:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 4 May 2009 11:29:33 -0400 Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline - properties and kwargs In-Reply-To: Message-ID: <200905041529.n44FTXr9025530@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2822 ------- Comment #7 from cymon.cox at gmail.com 2009-05-04 11:29 EST ------- (In reply to comment #6) > Created an attachment (id=1291) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1291&action=view) [details] > Patch to add keyword arguments, properties and __repr__ to command line > wrappers > > As in previous patch but with support for clearing parameters by "deleting" the > property, and some basic doctests in Bio.Application. > > Still need to co-ordinate with Cymon to give the Muscle wrapper a valid python > identifier as an alias for the -in argument, e.g. "input", because we can't use > just "in" as a property or keyword argument. "input" for -in and maybe also "input1" "input2" as alternatives for -in1 -in2, might the the way to go, and document it. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Mon May 4 15:25:17 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 4 May 2009 08:25:17 -0700 (PDT) Subject: [Biopython-dev] XML parsing library for new modules In-Reply-To: <320fb6e00905040515q63312065oec904ed3f5ffab93@mail.gmail.com> Message-ID: <3493.66471.qm@web62406.mail.re1.yahoo.com> --- On Mon, 5/4/09, Peter Cock wrote: > > My lean is towards ElementTree for reasons of code > clarity. SAX > > parsers require a lot of boilerplate style code. They > also can be > > tricky with nested elements; I always find myself > using a lot of "if > > in_tag; else if in_tag" style code. ElementTree > eliminates a lot of > > these issues which should result in easier to maintain > code. This is partially true. SAX parsers can be complicated, but with some dedication reasonably clear code is also possible. The SAX parser in Bio.Entrez is not all that bad, and it can handle all kinds of different XML pages as long as a DTD is available. The prime motivation for ElementTree is that it's mutable; I don't know if that is really needed in this case. Another thing to consider is what to do with the result returned by ElementTree. Whereas it will contain all the information in the XML file, it may not represent it in a user-friendly way. You may want to take the output from ElementTree and store it in a more biopython-like object. Also keep in mind memory usage: ElementTree will keep the complete XML file in memory, whereas the SAX parser gives you more flexibility here (see below). That said, I don't have any fundamental objections against using ElementTree. > > We have been trying to avoid external library dependencies > where > possible (moving away from Martel for parsing has really > helped here). > Given ElementTree and cElementTree are included with Python > 2.5+, > this is only an issue for Biopython running on Python 2.4. I think it's OK to require Python 2.5 or later for Biopython. > P.S. I wonder if our BLAST XML parser would get a big speed > boost if we switched it to ElementTree instead of xml.sax? I doubt it, since the SAX parser is pretty straightforward -- the hard part is to go through the DTD and find out how to interpret each element in the XML (this is not time-consuming though). The key point though is memory usage. With the SAX parser, you can parse the XML file in chunks, and use an iterator to return individual Blast records -- you don't need to keep the full XML file in memory. The Blast parser NCBIXML.parse does exactly that. With ElementTree, as far as I understand you read in the full XML file and keep it in memory. --Michiel. From cy at cymon.org Mon May 4 15:34:52 2009 From: cy at cymon.org (Cymon Cox) Date: Mon, 4 May 2009 16:34:52 +0100 Subject: [Biopython-dev] Switches in the Bio.Application interface In-Reply-To: <320fb6e00905040803v78029f26k48b8e6908e40cc5@mail.gmail.com> References: <320fb6e00905040803v78029f26k48b8e6908e40cc5@mail.gmail.com> Message-ID: <7265d4f0905040834p419f49c5ka33ceef6f7dcab19@mail.gmail.com> 2009/5/4 Peter > On Mon, May 4, 2009 at 3:48 PM, Peter > wrote: > > That does seem to work and covers most situation, however there is a > special case of command line "switches" (arguments which don't take an > argument, like -kimura in ClustalW, or -l in ls). There are a lot of > these cases in Cymon's new alignment wrappers. These worked OK when > used with set_parameter("kimura"), the value is omitted and defaults > to None. Using the current patch, to set this via the keyword > argument or property, it must explicitly be set to None, which is > ugly: > > >>> from Bio.Align.Applications import ClustalwCommandline > >>> print ClustalwCommandline(gapopen=2, gapext=0.5, infile="demo.fasta") > clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5 > >>> print ClustalwCommandline(gapopen=2, gapext=0.5, kimura=None, > infile="demo.fasta") > clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5 -kimura Ugly, and very confusing. > For these "switch" arguments, perhaps the value should be interpreted > as a boolean (should the switch be added or not?). This is what i did in my Muscle helper functions - so makes sense to me... C. -- ____________________________________________________________________ Cymon J. Cox Centro de Ciencias do Mar Faculdade de Ciencias do Mar e Ambiente (FCMA) Universidade do Algarve Campus de Gambelas 8005-139 Faro Portugal Phone: +0351 289800909 ext 7909 Fax: +0351 289800051 Email: cy at cymon.org, cymon at ualg.pt, cymon.cox at gmail.com HomePage : http://biology.duke.edu/bryology/cymon.html -8.63/-6.77 From p.j.a.cock at googlemail.com Mon May 4 15:45:12 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 4 May 2009 16:45:12 +0100 Subject: [Biopython-dev] XML parsing library for new modules In-Reply-To: <3493.66471.qm@web62406.mail.re1.yahoo.com> References: <320fb6e00905040515q63312065oec904ed3f5ffab93@mail.gmail.com> <3493.66471.qm@web62406.mail.re1.yahoo.com> Message-ID: <320fb6e00905040845g67219f36r563f4dfa1b080125@mail.gmail.com> Brad wrote: >>> My lean[ing] is towards ElementTree for reasons of code >>> clarity. SAX parsers require a lot of boilerplate style code. >>> They also can be tricky with nested elements; I always >>> find myself using a lot of "if in_tag; else if in_tag" style >>> code. ElementTree eliminates a lot of these issues >>> which should result in easier to maintain code. Michiel wrote: > This is partially true. SAX parsers can be complicated, but > with some dedication reasonably clear code is also possible. > The SAX parser in Bio.Entrez is not all that bad, and it can > handle all kinds of different XML pages as long as a DTD > is available. The prime motivation for ElementTree is that > it's mutable; I don't know if that is really needed in this case. Eric will have to answer that regarding PhyloXML, but if the aim is to turn it into one of our existing tree objects, then having the XML structure mutable is irrelevant. > Another thing to consider is what to do with the result > returned by ElementTree. Whereas it will contain all the > information in the XML file, it may not represent it in a > user-friendly way. You may want to take the output from > ElementTree and store it in a more biopython-like object. > Also keep in mind memory usage: ElementTree will keep > the complete XML file in memory, whereas the SAX > parser gives you more flexibility here (see below). Something for Eric to consider. Michiel wrote: > That said, I don't have any fundamental objections > against using ElementTree. Peter wrote: >> We have been trying to avoid external library dependencies >> where possible (moving away from Martel for parsing has >> really helped here). Given ElementTree and cElementTree >> are included with Python 2.5+, this is only an issue for >> Biopython running on Python 2.4. > > I think it's OK to require Python 2.5 or later for Biopython. As this stage I disagree, Python 2.4 would still be widely used on production servers running stable distributions. Also we'd have to give a couple of releases notice about dropping Python 2.4 support. In any case, if we want to use ElementTree with Python 2.4 this is possible. Peter wrote: >> P.S. I wonder if our BLAST XML parser would get a big speed >> boost if we switched it to ElementTree instead of xml.sax? > > I doubt it, since the SAX parser is pretty straightforward -- > the hard part is to go through the DTD and find out how to > interpret each element in the XML (this is not > time-consuming though). The key point though is memory > usage. With the SAX parser, you can parse the XML file in > chunks, and use an iterator to return individual Blast records > -- you don't need to keep the full XML file in memory. The > Blast parser NCBIXML.parse does exactly that. With > ElementTree, as far as I understand you read in the full > XML file and keep it in memory. Keeping a full BLAST XML file in memory would be a bad idea, and would spoil the memory savings of the iterator approach to parsing it. So ElementTree isn't suitable for everything ;) Peter From biopython at maubp.freeserve.co.uk Mon May 4 15:47:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 4 May 2009 16:47:58 +0100 Subject: [Biopython-dev] Switches in the Bio.Application interface In-Reply-To: <7265d4f0905040834p419f49c5ka33ceef6f7dcab19@mail.gmail.com> References: <320fb6e00905040803v78029f26k48b8e6908e40cc5@mail.gmail.com> <7265d4f0905040834p419f49c5ka33ceef6f7dcab19@mail.gmail.com> Message-ID: <320fb6e00905040847s32bc9e4fr3f7fb045b2d3429b@mail.gmail.com> On Mon, May 4, 2009 at 4:34 PM, Cymon Cox wrote: > >> For these "switch" arguments, perhaps the value should be interpreted >> as a boolean (should the switch be added or not?). > > This is what i did in my Muscle helper functions - so makes sense to me... > Good :) Peter From bugzilla-daemon at portal.open-bio.org Mon May 4 16:29:10 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 4 May 2009 12:29:10 -0400 Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline - properties and kwargs In-Reply-To: Message-ID: <200905041629.n44GTAeq030521@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2822 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1291 is|0 |1 obsolete| | ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-04 12:29 EST ------- (From update of attachment 1291) Checked into CVS: Checking in Tests/test_Prank_tool.py; /home/repository/biopython/biopython/Tests/test_Prank_tool.py,v <-- test_Prank_tool.py new revision: 1.5; previous revision: 1.4 done Checking in Tests/test_Muscle_tool.py; /home/repository/biopython/biopython/Tests/test_Muscle_tool.py,v <-- test_Muscle_tool.py new revision: 1.7; previous revision: 1.6 done Checking in Tests/test_Emboss.py; /home/repository/biopython/biopython/Tests/test_Emboss.py,v <-- test_Emboss.py new revision: 1.20; previous revision: 1.19 done Checking in Tests/test_Clustalw_tool.py; /home/repository/biopython/biopython/Tests/test_Clustalw_tool.py,v <-- test_Clustalw_tool.py new revision: 1.13; previous revision: 1.12 done Checking in Bio/Application/__init__.py; /home/repository/biopython/biopython/Bio/Application/__init__.py,v <-- __init__.py new revision: 1.15; previous revision: 1.14 done Checking in Bio/Emboss/Applications.py; /home/repository/biopython/biopython/Bio/Emboss/Applications.py,v <-- Applications.py new revision: 1.23; previous revision: 1.22 done Checking in Bio/AlignAce/Applications.py; /home/repository/biopython/biopython/Bio/AlignAce/Applications.py,v <-- Applications.py new revision: 1.5; previous revision: 1.4 done Checking in Bio/Motif/Applications/_AlignAce.py; /home/repository/biopython/biopython/Bio/Motif/Applications/_AlignAce.py,v <-- _AlignAce.py new revision: 1.3; previous revision: 1.2 done Checking in Bio/Align/Applications/_Clustalw.py; /home/repository/biopython/biopython/Bio/Align/Applications/_Clustalw.py,v <-- _Clustalw.py new revision: 1.5; previous revision: 1.4 done Checking in Bio/Align/Applications/_Mafft.py; /home/repository/biopython/biopython/Bio/Align/Applications/_Mafft.py,v <-- _Mafft.py new revision: 1.4; previous revision: 1.3 done Checking in Bio/Align/Applications/_Muscle.py; /home/repository/biopython/biopython/Bio/Align/Applications/_Muscle.py,v <-- _Muscle.py new revision: 1.6; previous revision: 1.5 done Checking in Bio/Align/Applications/_Prank.py; /home/repository/biopython/biopython/Bio/Align/Applications/_Prank.py,v <-- _Prank.py new revision: 1.4; previous revision: 1.3 done (In reply to comment #7) > (In reply to comment #6) > > Still need to co-ordinate with Cymon to give the Muscle wrapper a valid > > python identifier as an alias for the -in argument, e.g. "input", because > > we can't use just "in" as a property or keyword argument. > > "input" for -in and maybe also "input1" "input2" as alternatives for -in1 > -in2, might the the way to go, and document it. I've used "input" as the preferred alias for "-in". Leaving this bug open to cover dealing with "switch" arguments like -kimura in clustalw, where it makes sense to treat the value as a boolean (see dev mailing list). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 4 17:48:28 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 4 May 2009 13:48:28 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905041748.n44HmSaN003712@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #23 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-04 13:48 EST ------- In Prank, should realbranches take no arguments? i.e. use the new _Switch class? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 4 17:49:20 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 4 May 2009 13:49:20 -0400 Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline - properties and kwargs In-Reply-To: Message-ID: <200905041749.n44HnK8j003766@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2822 ------- Comment #9 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-04 13:49 EST ------- (In reply to comment #8) > Leaving this bug open to cover dealing with "switch" arguments like -kimura in > clustalw, where it makes sense to treat the value as a boolean (see dev mailing > list). Done in CVS, I think. Next, more test and documentation... Checking in Bio/Application/__init__.py; /home/repository/biopython/biopython/Bio/Application/__init__.py,v <-- __init__.py new revision: 1.16; previous revision: 1.15 done Checking in Bio/Align/Applications/_Clustalw.py; /home/repository/biopython/biopython/Bio/Align/Applications/_Clustalw.py,v <-- _Clustalw.py new revision: 1.6; previous revision: 1.5 done Checking in Bio/Align/Applications/_Mafft.py; /home/repository/biopython/biopython/Bio/Align/Applications/_Mafft.py,v <-- _Mafft.py new revision: 1.5; previous revision: 1.4 done Checking in Bio/Align/Applications/_Muscle.py; /home/repository/biopython/biopython/Bio/Align/Applications/_Muscle.py,v <-- _Muscle.py new revision: 1.7; previous revision: 1.6 done Checking in Bio/Align/Applications/_Prank.py; /home/repository/biopython/biopython/Bio/Align/Applications/_Prank.py,v <-- _Prank.py new revision: 1.5; previous revision: 1.4 done Checking in Tests/test_Clustalw_tool.py; /home/repository/biopython/biopython/Tests/test_Clustalw_tool.py,v <-- test_Clustalw_tool.py new revision: 1.14; previous revision: 1.13 done Checking in Tests/test_Muscle_tool.py; /home/repository/biopython/biopython/Tests/test_Muscle_tool.py,v <-- test_Muscle_tool.py new revision: 1.8; previous revision: 1.7 done Checking in Tests/test_Prank_tool.py; /home/repository/biopython/biopython/Tests/test_Prank_tool.py,v <-- test_Prank_tool.py new revision: 1.6; previous revision: 1.5 done -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 5 12:04:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 5 May 2009 08:04:09 -0400 Subject: [Biopython-dev] [Bug 2294] Writing GenBank files with Bio.SeqIO In-Reply-To: Message-ID: <200905051204.n45C4987022142@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2294 ------- Comment #14 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-05 08:04 EST ------- Created an attachment (id=1292) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1292&action=view) Patch to Bio/SeqIO/InsdcIO.py to write GenBank features This patch adds basic support for writing features in GenBank files. There is still plenty to do: * Full testing, both manual and with extended unit test coverage * Wrapping long feature locations * Writing references * Extending to cover writing EBML files Note that this requires the latest Bio.GenBank code from CVS, as during this work I found and fixed two small issues with the location parsing. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Tue May 5 12:36:57 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 5 May 2009 08:36:57 -0400 Subject: [Biopython-dev] Properties names in command line wrappers In-Reply-To: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com> References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com> Message-ID: <20090505123656.GB15113@sobchak.mgh.harvard.edu> Hi Peter; Nice to have you back. Hope you had a relaxing few days away. > I updated the patch on Bug 2822 to cover all the Bio.Application > command line wrapper subclasses, and included __repr__ support. > However, that has raised a real example of a parameter where the > current "human readable" name is not a valid python identifier ("in", > for "-in" in Muscle). I think the pragmatic solution is to add a > sensible alternative which we can use for the property and keyword > argument name (e.g. "input" in this case) while in general keeping > these names as close as possible to the actual parameter name as used > at the command line. Agreed. This is the best solution for these few conflicting cases. > On the other hand, some might argue for giving all the options > meaningful names. The (hardly used) existing blastall wrapper in > Bio/Blast/Applications.py gives the "-a" argument a human readable > name of "nprocessors", and "-A" gets "window_size". With the old > set_parameter call either alias could be used. However, with a python > property we need to pick one as a preferred name - and I'm not 100% > sure being helpful and using "nprocessors" (e.g. cline.nprocessors=4) > is actually better than using the actual argument name (e.g. cline.a = > 4). Could we support both the original argument and optional human readable arguments? I know the code in Application is a bit hard coded for the first argument as the real name and the last argument as the readable name; the cleanest solution would be to generalize this to have multiple names where it makes sense. More practically, it always makes sense to have the low level standard arguments from the program itself. Even if it is non-intuitive like BLASTs switches, people who already understand the program can just use their existing knowledge without any specific knowledge of how Biopython. Where someone wants to support more useful names, they can add those in. You have been digging around in this so probably have a good idea how hard this is to implement practically. If it's a pain, I'd argue to just have the original arguments now, and the useful names can do on a todo list. Brad From chapmanb at 50mail.com Tue May 5 12:50:59 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 5 May 2009 08:50:59 -0400 Subject: [Biopython-dev] XML parsing library for new modules In-Reply-To: <320fb6e00905040845g67219f36r563f4dfa1b080125@mail.gmail.com> References: <320fb6e00905040515q63312065oec904ed3f5ffab93@mail.gmail.com> <3493.66471.qm@web62406.mail.re1.yahoo.com> <320fb6e00905040845g67219f36r563f4dfa1b080125@mail.gmail.com> Message-ID: <20090505125058.GC15113@sobchak.mgh.harvard.edu> Peter, Michiel and Eric; > > Another thing to consider is what to do with the result > > returned by ElementTree. Whereas it will contain all the > > information in the XML file, it may not represent it in a > > user-friendly way. You may want to take the output from > > ElementTree and store it in a more biopython-like object. Agreed. Most of the fun creative parts of the project, as opposed to the parsing nuts and bolts, will be in developing the object representations. > > Also keep in mind memory usage: ElementTree will keep > > the complete XML file in memory, whereas the SAX > > parser gives you more flexibility here (see below). ElementTree can do incremental parsing, so you can also deal with large files using it: http://effbot.org/zone/element-iterparse.htm Brad From biopython at maubp.freeserve.co.uk Tue May 5 13:58:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 May 2009 14:58:04 +0100 Subject: [Biopython-dev] Properties names in command line wrappers In-Reply-To: <20090505123656.GB15113@sobchak.mgh.harvard.edu> References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com> <20090505123656.GB15113@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00905050658h2cabf55dhfbb467042135843a@mail.gmail.com> On Tue, May 5, 2009 at 1:36 PM, Brad Chapman wrote: > > Could we support both the original argument and optional human > readable arguments? I know the code in Application is a bit > hard coded for the first argument as the real name and the last > argument as the readable name; the cleanest solution would be to > generalize this to have multiple names where it makes sense. You mean for these BLAST examples, create two properties "a" and "nprocessors", both controlling the "-a" parameter, and also two properties "A" and "window_size" both controlling "-A"? From a code point of view, this would be moderately straight forward - but I'm not convinced about this. > More practically, it always makes sense to have the low level > standard arguments from the program itself. Even if it is > non-intuitive like BLASTs switches, people who already understand > the program can just use their existing knowledge without any > specific knowledge of how Biopython. Yes :) Personally I initially found it very frustrating when using the Bio.Blast.NCBIStandalone.blastall wrapper because the NCBI switches had all been given friendly names, and it wasn't clear without looking at the source code what mapped to what. As a minor change, I think the Bio.Blast.NCBIStandalone.blastall docstring should actually include the real NCBI switch used by each Biopython keyword. > Where someone wants to support more useful names, they can > add those in. So that we cater to those familiar with the NCBI command line arguments, but also give a more human alternative? On the downside, it means there are two ways to set these parameters. Also, if we go down this route for consistency for all command line wrappers we may want to invent more human readable aliases (if the tool arguments are too cryptic). We are also opening up a potential problem if the tool later adds a new argument whose name clashes with one of our inventions. Also would we care about the lack of consistency between tools (e.g. infile versus input?), and should we try and be consistent in our new names? I favour using only a single property for each parameter, with the name as similar as possible to the actual command line switch (i.e. property name "a" for "-a", not "nprocessors"). Note each property would have a docstring which will say what is it for ("Number of processors to use."). In the case of the existing blastall wrapper in Bio.Blast.Applications, I would use change names=["-a", "nprocessors"] to ["-a", "nprocessors", "a"], meaning "a" (last entry) would be the property name used, "-a" (first entry) would be used for the actual command line string. I would keep the "nprocessors" alias for backwards compatibility only - all three aliases would be available to the (legacy) method set_parameter. > You have been digging around in this so probably have a good idea > how hard this is to implement practically. If it's a pain, I'd argue > to just have the original arguments now, and the useful names can do > on a todo list. It is certainly possible, although probably a bit tedious due to changing the "boilerplate" code. Peter From bugzilla-daemon at portal.open-bio.org Tue May 5 14:37:56 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 5 May 2009 10:37:56 -0400 Subject: [Biopython-dev] [Bug 2294] Writing GenBank files with Bio.SeqIO In-Reply-To: Message-ID: <200905051437.n45EbuNA006427@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2294 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1292 is|0 |1 obsolete| | ------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-05 10:37 EST ------- (From update of attachment 1292) Checked into CVS now. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Tue May 5 15:26:20 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 5 May 2009 16:26:20 +0100 Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF) In-Reply-To: <320fb6e00904210655u5f497227o24457ec2ee3d711f@mail.gmail.com> References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> <320fb6e00904210655u5f497227o24457ec2ee3d711f@mail.gmail.com> Message-ID: <320fb6e00905050826y5ae0b13eiaa6d9e56fd9049e9@mail.gmail.com> On Tue, Apr 21, 2009 at 2:55 PM, Peter Cock wrote: >> I have also been thinking about how I would (re)design the SeqFeature >> and FeatureLocation objects. ?In particular I would want to put the >> strand as part of the same object as the location, and also any >> join-locations. ?I would still want to cope with fuzzy locations, but >> make the non-fuzzy approximations more prominent in comparison. ?Also, >> I really don't like the way joins are currently stored as more >> SeqFeatures in the sub_features list (plus this kind of blocks >> alternative usage for child/parent nesting that might be nice for GFF >> files). >> >> The prime use case to keep in mind is taking a feature location (even >> a join), and using this to extract that region of nucleotides from the >> parent sequence (i.e. a Seq object or a SeqRecord object, as now both >> can be sliced). I've written code to do this in test_SeqIO_features.py, which cross checks the nucleotides pulled out from a GenBank files based on the SeqFeature, against what the NCBI provide in FASTA format. This seems to work OK, but has not been tested extensively (e.g. running it on drosophila or arabidopsis would be good). It could make sense to expose this functionality directly in Biopython, maybe as a method of the SeqRecord taking a SeqFeature (or the index of a feature in that record), returning a Seq object (or perhaps a SeqRecord using the feature's annotation). e.g. >>> from Bio import SeqIO >>> record = SeqIO.read(open("NC_005816.gb"),"genbank") >>> record.extract_feature_seq(6) Seq('GTGAACAAACAACAACAAACTGCGCTGAATATGGCGCGATTTATCAGAAGCCAG...TAA', IUPACAmbiguousDNA()) >>> feature = record.features[6] >>> record.extract_feature_seq(feature) Seq('GTGAACAAACAACAACAAACTGCGCTGAATATGGCGCGATTTATCAGAAGCCAG...TAA', IUPACAmbiguousDNA()) Alternatively, rather than introducing a new method (e.g. "extract_feature_seq" as in the above example) we could overload the __getitem__ method of the SeqRecord, i.e. overloading the slice mechanism so a SeqFeature can alternatively be given, e.g. record[feature]. Note that passing the index of a feature wouldn't work as record[6] currently means the seventh letter, rather than the seventh feature. Note that just passing a SeqFeature's FeatureLocation is not enough, as this lacks the strand information, and also any sub-features and associated location operator (i.e. join). > I forgot to mention the second major use case I'm concerned about, > which is recovering the GenBank/EMBL style location string. ?I have > looked at this in the past, by adding methods to the FeatureLocation > and all the Position objects, but it is complicated by the fact the > Position objects don't know if they are at the start or end (and for > the start locations we need to add one to convert from Python > counting). ?This is the main block on having Bio.SeqIO support writing > GenBank (or EMBL) files with their features included. See Bug 2294 for writing GenBank files: http://bugzilla.open-bio.org/show_bug.cgi?id=2294 I've just checked in some code to record the features when writing GenBank files with Bio.SeqIO. I solved the feature location issue by introducing a private function which knows about all the currently used AbstractPosition objects - the code is actually pretty short. Peter From p.j.a.cock at googlemail.com Tue May 5 16:41:31 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 5 May 2009 17:41:31 +0100 Subject: [Biopython-dev] Dropping Python 2.3 support in Biopython Message-ID: <320fb6e00905050941m37725eb5ibba02ca99236212e@mail.gmail.com> Hello all, This is a final warning that the next release of Biopython will not support Python 2.3. As far as we are aware, no-one has come forward with a need for continued support for Python 2.3, so we will soon begin removing the special case code needed to keep Biopython working on Python 2.3. This will give us a simpler code base, less platforms to test on, and we can also take advantage of various language features only available in Python 2.4+ (e.g. generator expressions and decorators). Any last minute requests to postpone this should be made to the main Biopython mailing list by Friday 8 May. Thank you, Peter From sbassi at clubdelarazon.org Tue May 5 22:49:11 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Tue, 5 May 2009 19:49:11 -0300 Subject: [Biopython-dev] Missing directories with easy_install? Message-ID: <9e2f512b0905051549w688195f6w719572fe42ee2678@mail.gmail.com> When I install Biopython 1.5 (and previous versions too) using easy_install, it seems that docs, test and scripts directories are not installed (see here for a screenshot, panel at left is easy_install product while right panel is when I manually uncompress biopython tarball: http://www.genesdigitales.com/bioinfo/biopy.jpg). Is this expected or an oversight? From biopython at maubp.freeserve.co.uk Tue May 5 22:56:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 May 2009 23:56:00 +0100 Subject: [Biopython-dev] Missing directories with easy_install? In-Reply-To: <9e2f512b0905051549w688195f6w719572fe42ee2678@mail.gmail.com> References: <9e2f512b0905051549w688195f6w719572fe42ee2678@mail.gmail.com> Message-ID: <320fb6e00905051556g2821e4f0k3db66ec545dcd399@mail.gmail.com> On Tue, May 5, 2009 at 11:49 PM, Sebastian Bassi wrote: > When I install Biopython 1.5 (and previous versions too) using > easy_install, it seems that docs, test and scripts directories are not > installed (see here for a screenshot, panel at left is easy_install > product while right panel is when I manually uncompress biopython > tarball: http://www.genesdigitales.com/bioinfo/biopy.jpg). > Is this expected or an oversight? You'd have to ask Brad for an expert opinion, but I think this is probably to be expected. If you install from source, the only folders copied to site-packages are Bio, BioSQL, and Martel. See also this thread: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005924.html Peter P.S. I assume you meant Biopython 1.50 and not 1.5 ;) From sbassi at clubdelarazon.org Tue May 5 23:05:46 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Tue, 5 May 2009 20:05:46 -0300 Subject: [Biopython-dev] Missing directories with easy_install? In-Reply-To: <320fb6e00905051556g2821e4f0k3db66ec545dcd399@mail.gmail.com> References: <9e2f512b0905051549w688195f6w719572fe42ee2678@mail.gmail.com> <320fb6e00905051556g2821e4f0k3db66ec545dcd399@mail.gmail.com> Message-ID: <9e2f512b0905051605k663035d7td84372847675c7d4@mail.gmail.com> On Tue, May 5, 2009 at 7:56 PM, Peter wrote: > You'd have to ask Brad for an expert opinion, but I think this is > probably to be expected. If you install from source, the only folders > copied to site-packages are Bio, BioSQL, and Martel. > See also this thread: > http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005924.html OK, so that is. > P.S. I assume you meant Biopython 1.50 and not 1.5 ;) yes!. From biopython at maubp.freeserve.co.uk Tue May 5 23:33:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 May 2009 00:33:16 +0100 Subject: [Biopython-dev] SeqRecord per-letter-annotation : avoid lists? Message-ID: <320fb6e00905051633i70604746i332b3bfaf3476876@mail.gmail.com> Hi all, I was thinking that about the SeqRecord object's letter_annotations, and that perhaps we should only allow strings and tuples (which are immutable), but not lists. Because lists are mutable, the user can (accidentaly) alter the list such that its length doesn't match that of the associated sequence (which would be bad). Currently we do use lists in the SeqRecord's letter_annotations, e.g. for qualities. I don't recall having any particular reason for using a list rather than a tuple. Any thoughts on this? Peter From p.j.a.cock at googlemail.com Wed May 6 10:32:01 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 6 May 2009 11:32:01 +0100 Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF) In-Reply-To: <320fb6e00905050826y5ae0b13eiaa6d9e56fd9049e9@mail.gmail.com> References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> <320fb6e00904210655u5f497227o24457ec2ee3d711f@mail.gmail.com> <320fb6e00905050826y5ae0b13eiaa6d9e56fd9049e9@mail.gmail.com> Message-ID: <320fb6e00905060332t2b9d9595pca68b83db8cef28f@mail.gmail.com> On Tue, May 5, 2009 at 4:26 PM, Peter Cock wrote: > On Tue, Apr 21, 2009 at 2:55 PM, Peter Cock wrote: >>> The prime use case to keep in mind is taking a feature location (even >>> a join), and using this to extract that region of nucleotides from the >>> parent sequence (i.e. a Seq object or a SeqRecord object, as now both >>> can be sliced). > > I've written code to do this in test_SeqIO_features.py, which cross > checks the nucleotides pulled out from a GenBank files based on the > SeqFeature, against what the NCBI provide in FASTA format. ?This seems > to work OK, but has not been tested extensively (e.g. running it on > drosophila or arabidopsis would be good). Yep - found a corner case my code can't yet cope with, from the Arabidopsis thaliana chloroplasts (NC_000932). This has some pathological mixed strand locations, like join(complement(69611..69724),139856..140650) which is for a trans-spliced ribosomal protein. > It could make sense to expose this functionality directly in > Biopython, ... Given this code is non-trivial to implement, this seems worth doing. Peter From bugzilla-daemon at portal.open-bio.org Wed May 6 22:50:08 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 6 May 2009 18:50:08 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200905062250.n46Mo8EM023616@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #9 from eric.talevich at gmail.com 2009-05-06 18:50 EST ------- Created an attachment (id=1293) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1293&action=view) Additional warnings test for Py2.6+ This is the file that test_PDB_unit.py can import to plug in an additional test for specific warnings. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 6 22:54:06 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 6 May 2009 18:54:06 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200905062254.n46Ms6YP023831@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #10 from eric.talevich at gmail.com 2009-05-06 18:54 EST ------- Created an attachment (id=1294) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1294&action=view) test_PDB_unit.py, with conditional import This is a modified test_PDB_unit.py that checks whether the necessary context manager is available (it will be for Py2.6+), and if so, imports the additional unit test from _PDB_extra.py into the current class. (Sorry it's a whole file, I was having trouble diffing between git branches.) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 7 08:51:35 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 7 May 2009 04:51:35 -0400 Subject: [Biopython-dev] [Bug 2824] New: Bio.Entrez.epost is using an HTTP GET not an HTTP POST Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2824 Summary: Bio.Entrez.epost is using an HTTP GET not an HTTP POST Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk Following from a query on our mailing list suggesting Bio.Entrez.epost is failing with long ID lists, I looked a little more closely at the code and it is actually using an HTTP GET instead of an HTTP POST (which would avoid the long URL problem). See: http://lists.open-bio.org/pipermail/biopython/2009-May/005149.html We can still use urllib to do this with its data argument... http://docs.python.org/library/urllib.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 7 09:18:58 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 7 May 2009 05:18:58 -0400 Subject: [Biopython-dev] [Bug 2824] Bio.Entrez.epost is using an HTTP GET not an HTTP POST In-Reply-To: Message-ID: <200905070918.n479IwHQ031195@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2824 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-07 05:18 EST ------- Created an attachment (id=1295) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1295&action=view) Patch for Bio/Entrez/__init__.py This patch does two things, (1) Makes Bio.Entrez.epost do an HTTP POST (2) Catches the too long URL error 414 messages and raises an IOError Without the patch: >>> print Entrez.epost("pubmed", id=",".join(str(i) for i in range(1,10000))).read() 414 Request-URI Too Large

Request-URI Too Large

The requested URL's length exceeds the capacity limit for this server.

>>> print Entrez.efetch("pubmed", id=",".join(str(i) for i in range(1,10000)), retmode="text", rettype="uilist").read() 414 Request-URI Too Large

Request-URI Too Large

The requested URL's length exceeds the capacity limit for this server.

Note both the above trigger the Error 414 message, but it does not get caught. With the patch: >>> print Entrez.epost("pubmed", id=",".join(str(i) for i in range(1,10000))).read() 1 NCID_01_264798363_130.14.18.47_9001_1241687667 >>> print Entrez.efetch("pubmed", id=",".join(str(i) for i in range(1,10000)), retmode="text", rettype="uilist").read() Traceback (most recent call last): File "", line 1, in File "Bio/Entrez/__init__.py", line 126, in efetch return _open(cgi, variables) File "Bio/Entrez/__init__.py", line 370, in _open raise IOError("Requested URL too long (try using EPost?)") IOError: Requested URL too long (try using EPost?) Now epost works with long arguments, and using the other tools with too long a URL will trigger an IOError. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 7 10:20:10 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 7 May 2009 06:20:10 -0400 Subject: [Biopython-dev] [Bug 2824] Bio.Entrez.epost is using an HTTP GET not an HTTP POST In-Reply-To: Message-ID: <200905071020.n47AKAGD002826@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2824 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-07 06:20 EST ------- Patch checked in (OK'd with Michiel), marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 7 13:56:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 7 May 2009 09:56:09 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905071356.n47Du9iQ018532@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #24 from cymon.cox at gmail.com 2009-05-07 09:56 EST ------- (In reply to comment #23) > In Prank, should realbranches take no arguments? i.e. use the new _Switch > class? Yes, verified and done; pushed to applic-int branch. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 7 14:07:23 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 7 May 2009 10:07:23 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905071407.n47E7Nn7019531@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #25 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-07 10:07 EST ------- (In reply to comment #24) > (In reply to comment #23) > > In Prank, should realbranches take no arguments? i.e. use the new _Switch > > class? > > Yes, verified and done; pushed to applic-int branch. > C. Thanks for checking - that's done in CVS now. I think the final bit of new code is _Dialign.py which still needs to be updated for the new style __init__ method. Then there are your unit tests... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 7 14:39:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 7 May 2009 10:39:40 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905071439.n47Edeaj022126@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #26 from cymon.cox at gmail.com 2009-05-07 10:39 EST ------- (In reply to comment #25) > (In reply to comment #24) > > (In reply to comment #23) > > > In Prank, should realbranches take no arguments? i.e. use the new _Switch > > > class? > > > > Yes, verified and done; pushed to applic-int branch. > > C. > > Thanks for checking - that's done in CVS now. > > I think the final bit of new code is _Dialign.py which still needs to be > updated for the new style __init__ method. Done - pushed to applic-int (Note windows path stuff absent from _Dialign) > Then there are your unit tests... As they are at present, unittests for Muscle, Mafft, Dialign and Prank all pass. They could of course be made arbitrarily more complex... they should probably have at least one test that uses the properties style parameter setting rather than just set_paramter() C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 7 15:22:35 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 7 May 2009 11:22:35 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905071522.n47FMZ16025500@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #27 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-07 11:22 EST ------- (In reply to comment #26) > > I think the final bit of new code is _Dialign.py which still needs to be > > updated for the new style __init__ method. > > Done - pushed to applic-int (Note windows path stuff absent from _Dialign) > OK, that is in CVS now. > > Then there are your unit tests... > > As they are at present, unittests for Muscle, Mafft, Dialign and Prank all > pass. They could of course be made arbitrarily more complex... they should > probably have at least one test that uses the properties style parameter > setting rather than just set_paramter() > C. I've added test_Dialign_tool.py to CVS, and then switched a few to using keyword arguments and properties. As far as I can see from here, the tool isn't expected to work on Windows (although it might still be possible with cygwin): http://bibiserv.techfak.uni-bielefeld.de/download/tools/DIALIGN_221.html Is that everything? You'd mentioned a more general test which just builds the strings, but doesn't actually need to run any of the tools themselves. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 8 12:07:03 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 8 May 2009 08:07:03 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905081207.n48C73cT012732@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #28 from cymon.cox at gmail.com 2009-05-08 08:07 EST ------- (In reply to comment #27) > (In reply to comment #26) > > > I think the final bit of new code is _Dialign.py which still needs to be > > > updated for the new style __init__ method. > > > > Done - pushed to applic-int (Note windows path stuff absent from _Dialign) > > > > OK, that is in CVS now. > > > > Then there are your unit tests... > > > > As they are at present, unittests for Muscle, Mafft, Dialign and Prank all > > pass. They could of course be made arbitrarily more complex... they should > > probably have at least one test that uses the properties style parameter > > setting rather than just set_paramter() > > C. > > I've added test_Dialign_tool.py to CVS, and then switched a few to using > keyword arguments and properties. As far as I can see from here, the tool > isn't expected to work on Windows (although it might still be possible with > cygwin): > http://bibiserv.techfak.uni-bielefeld.de/download/tools/DIALIGN_221.html > > Is that everything? That's everything currently written. I still want to add interfaces to ProbCons and T-Coffee. You'd mentioned a more general test which just builds the > strings, but doesn't actually need to run any of the tools themselves. Yes, I'll do that. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 8 12:23:03 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 8 May 2009 08:23:03 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905081223.n48CN3nV013977@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #29 from chapmanb at 50mail.com 2009-05-08 08:23 EST ------- Created an attachment (id=1296) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1296&action=view) Start of TCoffee command line Cymon; Here is the start of a TCoffee command line object. It's not up to date with the latest changes y'all have been making and doesn't have all the options, but should save some typing. Brad -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 8 19:14:27 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 8 May 2009 15:14:27 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200905081914.n48JERYx012798@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 eric.talevich at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1293 is|0 |1 obsolete| | Attachment #1294 is|0 |1 obsolete| | ------- Comment #11 from eric.talevich at gmail.com 2009-05-08 15:14 EST ------- Created an attachment (id=1297) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1297&action=view) Py2.6-only unit test of PDB warnings I pushed a branch called bug2820 to github containing just this commit, if that's easier: http://github.com/etal/biopython/tree/bug2820 Any suggestions for naming the new file? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 8 21:45:53 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 8 May 2009 17:45:53 -0400 Subject: [Biopython-dev] [Bug 2817] Meta-bug for cleanup once we drop Python 2.3 support In-Reply-To: Message-ID: <200905082145.n48Ljr4L023802@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2817 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-08 17:45 EST ------- I've started removing support for Python 2.3 in CVS, including removing all the sets and subprocess special case code. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 8 22:14:36 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 8 May 2009 18:14:36 -0400 Subject: [Biopython-dev] [Bug 2825] New: SeqIO does not successfully parse Genbank records related to whole genome sequencing deposits, as Did not recognise the LOCUS line layout Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2825 Summary: SeqIO does not successfully parse Genbank records related to whole genome sequencing deposits, as Did not recognise the LOCUS line layout Product: Biopython Version: 1.49 Platform: All OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: david.wyllie at ndm.ox.ac.uk Hi I'm using the BioPython distribution 1.49 obtained as a Package using the Ubuntu 9 synaptic package manager. The below describes the problem: NCBI has a record type which describes the contents of whole-genome sequencing projects. The record doesn't itself contain sequence, by constrast to most genbank records. this URL gives an example http://www.ncbi.nlm.nih.gov/nuccore/162285818 should the SeqIO parser be able to read this? it cannot. Here is an example: # import modules from Bio import Entrez from Bio import SeqIO # read the record from NCBI, print out the contents. handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818") masterrecord=handle.readlines() for line in masterrecord: print line handle.close() # let's read it again, and try to parse with with SeqIO. handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818") # this line causes the crash seq_record = SeqIO.read(handle, "genbank") handle.close() # fails. the traceback reads """ Traceback (most recent call last): File "bugreport.py", line 25, in seq_record = SeqIO.read(handle, "genbank") File "/var/lib/python-support/python2.6/Bio/SeqIO/__init__.py", line 435, in read first = iterator.next() File "/var/lib/python-support/python2.6/Bio/GenBank/Scanner.py", line 410, in parse_records record = self.parse(handle) File "/var/lib/python-support/python2.6/Bio/GenBank/Scanner.py", line 393, in parse if self.feed(handle, consumer) : File "/var/lib/python-support/python2.6/Bio/GenBank/Scanner.py", line 360, in feed self._feed_first_line(consumer, self.line) File "/var/lib/python-support/python2.6/Bio/GenBank/Scanner.py", line 907, in _feed_first_line raise ValueError('Did not recognise the LOCUS line layout:\n' + line) ValueError: Did not recognise the LOCUS line layout: LOCUS ABIN01000000 353 rc DNA linear BCT 10-DEC-2007 """ # by contrast, reading one of the constituent genbank records, like this one # http://www.ncbi.nlm.nih.gov/nuccore/162285817 # works correctly; handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285817") seq_record = SeqIO.read(handle, "genbank") handle.close() print "Successfully loaded record GI=162285817" print seq_record.description -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 8 22:37:47 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 8 May 2009 18:37:47 -0400 Subject: [Biopython-dev] [Bug 2825] Parsing whole genome sequencing (WGS) Genbank records In-Reply-To: Message-ID: <200905082237.n48MbleU027475@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2825 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|normal |enhancement Summary|SeqIO does not successfully |Parsing whole genome |parse Genbank records |sequencing (WGS) Genbank |related to whole genome |records |sequencing deposits, as Did | |not recognise the LOCUS line| |layout | ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-08 18:37 EST ------- Hi David, This is not expected to work in Bio.SeqIO or Bio.GenBank at the moment. For the LOCUS line we expect things like "123 aa" for proteins, or "123 bp" for nucleotides. Here you have "353 rc" (rc for record count), which as our error message says, is unexpected. At the end of the record, there are also WGS and/or WGS_SCAFLD lines to worry about: http://www.ncbi.nlm.nih.gov/Genbank/wgs.html http://www.bio.net/bionet/mm/genbankb/2009-January/000299.html Given these WGS files have no sequence, and no real sequence associated features either, it stikes me that supporting this in Bio.SeqIO is a stretch (these records are not really sequences, nor are they about a sequence). However, Bio.GenBank should perhaps be updated to cope... so I'll leave this bug open for that as a possible enhancement. Note I have changed the bug title from "SeqIO does not successfully parse Genbank records related to whole genome sequencing deposits, as Did not recognise the LOCUS line layout", to "Parsing whole genome sequencing (WGS) Genbank records", and changed the bug priority to an enhancement. What information do you want from this file? In the meantime, I suggest you fetch the record as XML, which you can parse using Bio.Entrez.read() or your XML parser of choice. Peter P.S. This is a shorter way to dump the file to screen in python: >>> from Bio import Entrez >>> handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818") >>> print handle.read() LOCUS ABIN01000000 353 rc DNA linear BCT 10-DEC-2007 DEFINITION Mycobacterium intracellulare ATCC 13950, whole genome shotgun sequencing project. ACCESSION ABIN00000000 VERSION ABIN00000000.1 GI:162285818 DBLINK Project:27955 KEYWORDS WGS. SOURCE Mycobacterium intracellulare ATCC 13950 ORGANISM Mycobacterium intracellulare ATCC 13950 Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales; Corynebacterineae; Mycobacteriaceae; Mycobacterium; Mycobacterium avium complex (MAC). REFERENCE 1 (bases 1 to 353) AUTHORS Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J., Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M. TITLE Mycobacterium intracellulare Genome Project JOURNAL Unpublished REFERENCE 2 (bases 1 to 353) AUTHORS Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J., Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M. TITLE Direct Submission JOURNAL Submitted (30-NOV-2007) McGill University and Genome Quebec Innovation Centre, 740 Avenue Docteur Penfield, Montreal, Quebec H3A 1A4, Canada COMMENT The Mycobacterium intracellulare ATCC 13950 whole genome shotgun (WGS) project has the project accession ABIN00000000. This version of the project (01) has the accession number ABIN01000000, and consists of sequences ABIN01000001-ABIN01000353. The whole genome shotgun sequence was generated by the McGill University and Genome Quebec Innovation Centre using the GS De Novo Assembler from GS-FLX reads. This strain is available from the American Type Culture Collection (www.atcc.org). FEATURES Location/Qualifiers source 1..353 /organism="Mycobacterium intracellulare ATCC 13950" /mol_type="genomic DNA" /strain="ATCC 13950" /serovar="16" /isolation_source="human lymph node" /db_xref="taxon:487521" /note="type strain of Mycobacterium intracellulare ATCC 13950 associated with disease" WGS ABIN01000001-ABIN01000353 // -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 8 23:12:43 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 8 May 2009 19:12:43 -0400 Subject: [Biopython-dev] [Bug 2825] Parsing whole genome sequencing (WGS) Genbank records In-Reply-To: Message-ID: <200905082312.n48NChKL030485@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2825 ------- Comment #2 from david.wyllie at ndm.ox.ac.uk 2009-05-08 19:12 EST ------- Thank you for your help. I just wanted to extract the WGS line, which I'm able to do. (In reply to comment #1) > Hi David, > > This is not expected to work in Bio.SeqIO or Bio.GenBank at the moment. For > the LOCUS line we expect things like "123 aa" for proteins, or "123 bp" for > nucleotides. Here you have "353 rc" (rc for record count), which as our error > message says, is unexpected. At the end of the record, there are also WGS > and/or WGS_SCAFLD lines to worry about: > > http://www.ncbi.nlm.nih.gov/Genbank/wgs.html > http://www.bio.net/bionet/mm/genbankb/2009-January/000299.html > > Given these WGS files have no sequence, and no real sequence associated > features either, it stikes me that supporting this in Bio.SeqIO is a stretch > (these records are not really sequences, nor are they about a sequence). > > However, Bio.GenBank should perhaps be updated to cope... so I'll leave this > bug open for that as a possible enhancement. Note I have changed the bug title > from "SeqIO does not successfully parse Genbank records related to whole genome > sequencing deposits, as Did not recognise the LOCUS line layout", to "Parsing > whole genome sequencing (WGS) Genbank records", and changed the bug priority to > an enhancement. > > What information do you want from this file? In the meantime, I suggest you > fetch the record as XML, which you can parse using Bio.Entrez.read() or your > XML parser of choice. > > Peter > > P.S. This is a shorter way to dump the file to screen in python: > > >>> from Bio import Entrez > >>> handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818") > >>> print handle.read() > LOCUS ABIN01000000 353 rc DNA linear BCT 10-DEC-2007 > DEFINITION Mycobacterium intracellulare ATCC 13950, whole genome shotgun > sequencing project. > ACCESSION ABIN00000000 > VERSION ABIN00000000.1 GI:162285818 > DBLINK Project:27955 > KEYWORDS WGS. > SOURCE Mycobacterium intracellulare ATCC 13950 > ORGANISM Mycobacterium intracellulare ATCC 13950 > Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales; > Corynebacterineae; Mycobacteriaceae; Mycobacterium; Mycobacterium > avium complex (MAC). > REFERENCE 1 (bases 1 to 353) > AUTHORS Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J., > Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M. > TITLE Mycobacterium intracellulare Genome Project > JOURNAL Unpublished > REFERENCE 2 (bases 1 to 353) > AUTHORS Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J., > Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M. > TITLE Direct Submission > JOURNAL Submitted (30-NOV-2007) McGill University and Genome Quebec > Innovation Centre, 740 Avenue Docteur Penfield, Montreal, Quebec > H3A 1A4, Canada > COMMENT The Mycobacterium intracellulare ATCC 13950 whole genome shotgun > (WGS) project has the project accession ABIN00000000. This version > of the project (01) has the accession number ABIN01000000, and > consists of sequences ABIN01000001-ABIN01000353. > The whole genome shotgun sequence was generated by the McGill > University and Genome Quebec Innovation Centre using the GS De Novo > Assembler from GS-FLX reads. This strain is available from the > American Type Culture Collection (www.atcc.org). > FEATURES Location/Qualifiers > source 1..353 > /organism="Mycobacterium intracellulare ATCC 13950" > /mol_type="genomic DNA" > /strain="ATCC 13950" > /serovar="16" > /isolation_source="human lymph node" > /db_xref="taxon:487521" > /note="type strain of Mycobacterium intracellulare ATCC > 13950 > associated with disease" > WGS ABIN01000001-ABIN01000353 > // > (In reply to comment #1) > Hi David, > > This is not expected to work in Bio.SeqIO or Bio.GenBank at the moment. For > the LOCUS line we expect things like "123 aa" for proteins, or "123 bp" for > nucleotides. Here you have "353 rc" (rc for record count), which as our error > message says, is unexpected. At the end of the record, there are also WGS > and/or WGS_SCAFLD lines to worry about: > > http://www.ncbi.nlm.nih.gov/Genbank/wgs.html > http://www.bio.net/bionet/mm/genbankb/2009-January/000299.html > > Given these WGS files have no sequence, and no real sequence associated > features either, it stikes me that supporting this in Bio.SeqIO is a stretch > (these records are not really sequences, nor are they about a sequence). > > However, Bio.GenBank should perhaps be updated to cope... so I'll leave this > bug open for that as a possible enhancement. Note I have changed the bug title > from "SeqIO does not successfully parse Genbank records related to whole genome > sequencing deposits, as Did not recognise the LOCUS line layout", to "Parsing > whole genome sequencing (WGS) Genbank records", and changed the bug priority to > an enhancement. > > What information do you want from this file? In the meantime, I suggest you > fetch the record as XML, which you can parse using Bio.Entrez.read() or your > XML parser of choice. > > Peter > > P.S. This is a shorter way to dump the file to screen in python: > > >>> from Bio import Entrez > >>> handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818") > >>> print handle.read() > LOCUS ABIN01000000 353 rc DNA linear BCT 10-DEC-2007 > DEFINITION Mycobacterium intracellulare ATCC 13950, whole genome shotgun > sequencing project. > ACCESSION ABIN00000000 > VERSION ABIN00000000.1 GI:162285818 > DBLINK Project:27955 > KEYWORDS WGS. > SOURCE Mycobacterium intracellulare ATCC 13950 > ORGANISM Mycobacterium intracellulare ATCC 13950 > Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales; > Corynebacterineae; Mycobacteriaceae; Mycobacterium; Mycobacterium > avium complex (MAC). > REFERENCE 1 (bases 1 to 353) > AUTHORS Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J., > Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M. > TITLE Mycobacterium intracellulare Genome Project > JOURNAL Unpublished > REFERENCE 2 (bases 1 to 353) > AUTHORS Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J., > Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M. > TITLE Direct Submission > JOURNAL Submitted (30-NOV-2007) McGill University and Genome Quebec > Innovation Centre, 740 Avenue Docteur Penfield, Montreal, Quebec > H3A 1A4, Canada > COMMENT The Mycobacterium intracellulare ATCC 13950 whole genome shotgun > (WGS) project has the project accession ABIN00000000. This version > of the project (01) has the accession number ABIN01000000, and > consists of sequences ABIN01000001-ABIN01000353. > The whole genome shotgun sequence was generated by the McGill > University and Genome Quebec Innovation Centre using the GS De Novo > Assembler from GS-FLX reads. This strain is available from the > American Type Culture Collection (www.atcc.org). > FEATURES Location/Qualifiers > source 1..353 > /organism="Mycobacterium intracellulare ATCC 13950" > /mol_type="genomic DNA" > /strain="ATCC 13950" > /serovar="16" > /isolation_source="human lymph node" > /db_xref="taxon:487521" > /note="type strain of Mycobacterium intracellulare ATCC > 13950 > associated with disease" > WGS ABIN01000001-ABIN01000353 > // > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat May 9 11:59:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 9 May 2009 07:59:32 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905091159.n49BxWpM015484@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #30 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-09 07:59 EST ------- I've got test_Mafft_tool.py working on one Linux machine using MAFFT v6.626b (2009/03/16) installed from source. However, test_Mafft_tool.py fails on another Linux machine using MAFFT v6.240 (2007/04/04) installed using the distribution's package, in this case Ubuntu Jaunty: http://packages.ubuntu.com/jaunty/mafft Note that the next version of Ubuntu currently also uses the same old package: http://packages.ubuntu.com/karmic/mafft As does Debian unstable: http://packages.debian.org/unstable/science/mafft >From trying mafft v6.240 by hand at the command line, it never seems to actually print anything to the console. Either the MAFFT API changed (which doesn't seem to be the case), or the version Ubuntu installed on this machine is broken. This could be due to something else like the version of awk or gcc (guesses based on the MAFFT change log): http://align.bmr.kyushu-u.ac.jp/mafft/software/ Note that the latest version is now MAFFT 6.704, so we should try that too. If I am right about the current Ubuntu/Debian package being broken, we should get in touch with them about updating it... otherwise we can look forward to bug reports about our wrapper and/or test_Mafft_tool.py failing. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat May 9 12:31:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 9 May 2009 08:31:55 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200905091231.n49CVtUj017919@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #12 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-09 08:31 EST ------- (In reply to comment #8) > I have something that works on both Py2.5 and Py2.6 now: > http://github.com/etal/biopython/tree/pdbtidy Would it be easy for you to test your code on Python 2.4? I can probably do that but not right now... I would prefer to avoid the extra file by writing this test as part of test_PDB_unit.py - but the "with" statement isn't valid syntax on Python 2.4, although it can be used on Python 2.5 via: from __future__ import with_statement Could you re-write this to avoid the with statement? > Also, apparently tests are run in alphabetical order, ... Yes, that is expected. > ... and Exposure was jumping ahead of PDBExceptionTest. I renamed > PDBExceptionTest to ExceptionTest to restore the natural order of > things and stop setting off the warnings prematurely. Maybe test > suites with multiple TestCase classes should be arranged alphabetically > in the code to avoid confusion in the future. Ideally the unit tests should work in any order - and this is generally a reasonable assumption, as they should be independent. Having some carefully named unit tests will only hide the ordering problem (which is due to the global state information in the warnings module). At the very least, we should probably have comments in the code about this (to avoid issues in the future) and maybe use an eye-catching name like AAAAA which should always come first. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Sat May 9 13:06:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 9 May 2009 14:06:15 +0100 Subject: [Biopython-dev] PhyloXML read/parse functions and handles Message-ID: <320fb6e00905090606g7235d4a6t22914f3ff9c293be@mail.gmail.com> Hi Eric, Are you happy to have feedback on your PhyloXML code in public? In this case I wanted to make a fairly general observation about parsing files using handles, so I have cc'd the dev list. I just had a look at the stub in Bio/PhyloXML/__init__.py and Bio/PhyloXML/Parser.py on your github branch, http://github.com/etal/biopython/tree/phyloxml The convention we are following in Biopython for parsing functions is as follows: read(handle, ...) - returns a single object (e.g. a tree in your case) parse(handle, ...) - returns an iterator (e.g. returning multiple trees) [This naming convention is arbitrary, but we should try to stick to it in all new parsers for consistency.] In Bio/PhyloXML/Parser.py you have a parse() sub function which according to the comment appears to return a single tree. If so, this should be a read() function instead of a parse() function. You seem to have a read() stub function in Bio/PhyloXML/__init__.py which returns a single tree (good), but takes a (zip) filename (not a handle - bad). Taking just a filename prevents using a whole range of handle objects as input - e.g. StringIO handles, URL handles, piped output from a command line tool etc. This flexibility is why we focus on dealing with handles for parsers. On a related point, you should leave unzipping the file to the user - this is not specific to dealing with XML tree files. Plus, in addition to zip files (i.e. pkzip/winzip format), there are other compressed fileformats to consider, such as tarballs. They too can be opened and compressed on the fly as a handle (e.g. see the gzip python library). By taking a handle as the input your parser can then be used with any of these import sources. Peter P.S. Finally, a more general note about a possible "Bio.TreeIO" module. For simple Newick trees, a single file can contain one or more trees (e.g. from bootstrapping). A tree can be split over multiple lines (but may be one long line), but multiple trees can be split up because they should all have a semicolon terminator. For Nexus files, I'm not sure off hand if there can be more than one tree. If you are going to use the Tree objects from Bio.Nexus, then we could provide a "Bio.TreeIO" module with read/parse/write methods coping with "newick", "nexus", "phyloxml" formats, all using the same tree objects. From bugzilla-daemon at portal.open-bio.org Sat May 9 16:40:27 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 9 May 2009 12:40:27 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905091640.n49GeRvY002521@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #31 from cymon.cox at gmail.com 2009-05-09 12:40 EST ------- (In reply to comment #30) > I've got test_Mafft_tool.py working on one Linux machine using MAFFT v6.626b > (2009/03/16) installed from source. That was my reference installation when writing the command line tool (on Jaunty/RHE 5.3). > However, test_Mafft_tool.py fails on another Linux machine using MAFFT v6.240 > (2007/04/04) installed using the distribution's package, in this case Ubuntu > Jaunty: > http://packages.ubuntu.com/jaunty/mafft > > Note that the next version of Ubuntu currently also uses the same old package: > http://packages.ubuntu.com/karmic/mafft > > As does Debian unstable: > http://packages.debian.org/unstable/science/mafft > > From trying mafft v6.240 by hand at the command line, it never seems to > actually print anything to the console. Either the MAFFT API changed (which > doesn't seem to be the case), or the version Ubuntu installed on this machine > is broken. This could be due to something else like the version of awk or gcc > (guesses based on the MAFFT change log): > http://align.bmr.kyushu-u.ac.jp/mafft/software/ Hadn't tried the Ubuntu package... On the upside, the Muscle3.7 package installed from Ubuntu passes our tests, whereas the source compiles but core-dumps. Similarly, ProbCons1.2 won't compile but the Ubuntu package looks good (havent written the tests yet). > Note that the latest version is now MAFFT 6.704, so we should try that too. If > I am right about the current Ubuntu/Debian package being broken, we should get > in touch with them about updating it... otherwise we can look forward to bug > reports about our wrapper and/or test_Mafft_tool.py failing. Built from source on Jaunty; it passes our tests. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Sun May 10 05:22:46 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 9 May 2009 22:22:46 -0700 Subject: [Biopython-dev] PhyloXML read/parse functions and handles In-Reply-To: <320fb6e00905090606g7235d4a6t22914f3ff9c293be@mail.gmail.com> References: <320fb6e00905090606g7235d4a6t22914f3ff9c293be@mail.gmail.com> Message-ID: <3f6baf360905092222w305556c0kdbf94e3c336b5958@mail.gmail.com> On Sat, May 9, 2009 at 6:06 AM, Peter wrote: > Hi Eric, > > Are you happy to have feedback on your PhyloXML code in public? Sure am! I was just getting around to drafting up some questions for biopython-dev, but I'm glad to receive some preemptive advice. I just had a look at the stub in Bio/PhyloXML/__init__.py and > Bio/PhyloXML/Parser.py on your github branch, > http://github.com/etal/biopython/tree/phyloxml > > The convention we are following in Biopython for parsing functions is > as follows: > read(handle, ...) - returns a single object (e.g. a tree in your case) > parse(handle, ...) - returns an iterator (e.g. returning multiple trees) > > I noticed that; I'll change the Bio.PhyloXML.Parser.parse() stub to read() and have it behave as expected. The function currently allows either filenames or file handles as the source because ElementTree.iterparse() also accepts either object as a source. The read() function could "assert not isinstance(infile, str)", I guess... The existing Java implementation in Forester/ATV has even more magic, automatically performing Zip extraction if the given filename ends with '.zip'. Since this looks like it will be a pretty common use case, at least for big files, I thought it would be nice to also offer a wrapper function that takes a filename and does the Right Thing -- that's what __init__.read() does currently. Is there a precedent for this in Biopython? The name should probably be something different; in the pdbtidy branch I used load(), to match the Pickle module, since the wrapper function does more than just parse or read a file. So how about: from Bio import PhyloXML handle = open('somefile', 'r') # file-like object from any source tree = PhyloXML.read(handle) Equivalent to: from Bio import PhyloXML tree = PhyloXML.load('somefile') # DTRT for xml, zip, gz, ...? Or, to be explicit, offer a read_zip or load_zip function. I'd leave well enough alone, but the incantation to extract a character stream from a single zipped file is kind of unintuitive, and one of the three example files on phyloxml.org is already zipped. (I should really ask Christian Zmasek about this to see if that's a real convention or not.) P.S. Finally, a more general note about a possible "Bio.TreeIO" > module. For simple Newick trees, a single file can contain one or more > trees (e.g. from bootstrapping). A tree can be split over multiple > lines (but may be one long line), but multiple trees can be split up > because they should all have a semicolon terminator. For Nexus files, > I'm not sure off hand if there can be more than one tree. If you are > going to use the Tree objects from Bio.Nexus, then we could provide a > "Bio.TreeIO" module with read/parse/write methods coping with > "newick", "nexus", "phyloxml" formats, all using the same tree > objects. > OK, I'll give it a try. Brad recommended that I just get a simple PhyloXML parser working first before attempting integration, but if some of Bio.Nexus can be reused in that process, great. I'm about to go dark from the end of this week until 3/31 (getting married, yaknow), but I'll fix all this code when I get back and have access to git again. Thanks for your help, Eric From biopython at maubp.freeserve.co.uk Sun May 10 09:22:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 10 May 2009 10:22:21 +0100 Subject: [Biopython-dev] PhyloXML read/parse functions and handles In-Reply-To: <3f6baf360905092222w305556c0kdbf94e3c336b5958@mail.gmail.com> References: <320fb6e00905090606g7235d4a6t22914f3ff9c293be@mail.gmail.com> <3f6baf360905092222w305556c0kdbf94e3c336b5958@mail.gmail.com> Message-ID: <320fb6e00905100222n22b7670dre26f9368726fce68@mail.gmail.com> On Sun, May 10, 2009 at 6:22 AM, Eric Talevich wrote: > > The function currently allows either filenames or file handles as the source > because ElementTree.iterparse() also accepts either object as a source. The > read() function could "assert not isinstance(infile, str)", I guess... Interesting - ReportLab also allows filenames or handles. If this truely is a widespread or growing trend in Python libraries, maybe we should do this as well. > The existing Java implementation in Forester/ATV has even more magic, > automatically performing Zip extraction if the given filename ends with > '.zip'. Since this looks like it will be a pretty common use case, at least > for big files, I thought it would be nice to also offer a wrapper function > that takes a filename and does the Right Thing -- that's what > __init__.read() does currently. Is there a precedent for this in Biopython? Note that Bio.Nexus does this already, making it a bit inconsistent with the rest of Biopython. I guess no one noticed or commented back when it was added. > The name should probably be something different; in the pdbtidy branch I > used load(), to match the Pickle module, since the wrapper function does > more than just parse or read a file. > > So how about: > > from Bio import PhyloXML > handle = open('somefile', 'r') # file-like object from any source > tree = PhyloXML.read(handle) > > Equivalent to: > > from Bio import PhyloXML > tree = PhyloXML.load('somefile') # DTRT for xml, zip, gz, ...? > > Or, to be explicit, offer a read_zip or load_zip function. I prefer the more explicit read_zip idea, your would also have an optional argument for the filename within the zip file. However, I'm not yet convinced we need this function. > I'd leave well enough alone, but the incantation to extract a character > stream from a single zipped file is kind of unintuitive, and one of the > three example files on phyloxml.org is already zipped. (I should really > ask Christian Zmasek about this to see if that's a real convention or > not.) Do you want to find out if this really is a phyloxml.org convention first? If this is their convention, it surprises me they didn't go for .gz files, which in my experience are more widley used in Bioinformatics (e.g. at the NCBI and PDB). These are supported cross platform and hold one single file (often a tarred file containing multiple files). A zip file can hold multiple files, which means you have to make extra asumptions (e.g. you are using the first file in your code). >> P.S. Finally, a more general note about a possible "Bio.TreeIO" >> module. For simple Newick trees, a single file can contain one or more >> trees (e.g. from bootstrapping). A tree can be split over multiple >> lines (but may be one long line), but multiple trees can be split up >> because they should all have a semicolon terminator. For Nexus files, >> I'm not sure off hand if there can be more than one tree. If you are >> going to use the Tree objects from Bio.Nexus, then we could provide a >> "Bio.TreeIO" module with read/parse/write methods coping with >> "newick", "nexus", "phyloxml" formats, all using the same tree >> objects. >> > > OK, I'll give it a try. Brad recommended that I just get a simple PhyloXML > parser working first before attempting integration, but if some of Bio.Nexus > can be reused in that process, great. Brad is right - getting a simple PhyloXML parser working is the first step. It would be sensible to look at the Bio.Nexus tree structure though. > I'm about to go dark from the end of this week until 3/31 (getting > married, yaknow), but I'll fix all this code when I get back and have > access to git again. Congratulations - it looks like you've got a proper break sheduled as well :) Peter From bugzilla-daemon at portal.open-bio.org Sun May 10 13:50:50 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 10 May 2009 09:50:50 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200905101350.n4ADoo7x001186@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #13 from eric.talevich at gmail.com 2009-05-10 09:50 EST ------- (In reply to comment #12) > Would it be easy for you to test your code on Python 2.4? I can probably do > that but not right now... Yes, I can do that, but only on Linux. I don't think there's anything platform-specific here, though. > I would prefer to avoid the extra file by writing this test as part of > test_PDB_unit.py - but the "with" statement isn't valid syntax on Python 2.4, > although it can be used on Python 2.5 via: > from __future__ import with_statement > > Could you re-write this to avoid the with statement? I think the with statement is isomorphic to a try-except-finally arrangement, calling the context manager's __enter__ method in the try block and __exit__ in the finally block. I'll look at the source code of the warnings module and maybe just copy a substantial chunk of it into this unit test (assuming it's pure Python). That might make it possible to support Py2.4, too. > Ideally the unit tests should work in any order - and this is generally a > reasonable assumption, as they should be independent. Having some carefully > named unit tests will only hide the ordering problem (which is due to the > global state information in the warnings module). At the very least, we should > probably have comments in the code about this (to avoid issues in the future) > and maybe use an eye-catching name like AAAAA which should always come first. > Agreed. I'll tinker with it some more to see what can be improved here. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 11 12:40:49 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 11 May 2009 08:40:49 -0400 Subject: [Biopython-dev] [Bug 2783] Using alternative start codons in Bio.Seq translate method/function In-Reply-To: Message-ID: <200905111240.n4BCenqD006754@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2783 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-11 08:40 EST ------- Created an attachment (id=1298) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1298&action=view) Patch for Bio/Seq.py to support complete CDS translation with non-standard start codons I've recently been doing CDS translations for viral/bacterial genes with alternative start codons - and would like to fix this limitation in Biopython, rather than having to hack around it. On Bug 2381, comment #14, I wrote: > For comparison, the following is copied from the BioPerl documentation about > their sequence object's translate method. It would be nice to follow some of > the same naming conventions for any optional arguments. > > http://www.bioperl.org/Core/Latest/bptutorial.html#iii_3_1_manipulating_sequence_data_with_seq_methods > > If we want to translate full coding regions (CDS) the way major nucleotide > databanks EMBL, GenBank and DDBJ do it, the translate() method has to perform > more checks. Specifically, translate() needs to confirm that the sequence has > appropriate start and terminator codons at the very beginning and the very end > of the sequence and that there are no terminator codons present within the > sequence in frame 0. In addition, if the genetic code being used has an > atypical (non-ATG) start codon, the translate() method needs to convert the > initial amino acid to methionine. These checks and conversions are triggered > by setting ``complete'' to 1: > > $prot_obj = $my_seq_object->translate(-complete => 1); > On Bug 2381, comment #51, Leighton wrote: > In terms of nomenclature: > > The default behaviour of translate() as Peter proposed: read through in-frame > and translate with the appropriate codon table - is fine in nearly all > circumstances. Most other circumstances are covered by stopping at the first > in-frame stop codon, which Peter has implemented, and is an option we all seem > to agree on. > > Biologically-speaking, this behaviour is not always correct for CDS in > prokaryotes, where alternative start codons may occur a significant minority > of the time. These will be mistranslated if no provision is made for them. I > think a useful biological sequence object should at least try to mimic actual > biology, so we should provide an option to handle this. > > We should not assume that a sequence is a CDS unless it is specified by the > user. It seems reasonable to me that the term 'cds' should occur in any such > argument from the user. > > We have at least two options for how to proceed with a CDS: i) we can provide > a strict CDS-type translation, which requires confirmation that the sequence > is, in fact, a CDS; ii) we can provide a weak CDS-type translation, which only > modifies the way the start codon is translated. In both cases, behaviour is > specific to CDS, and so having 'cds' in the argument name *somewhere* seems > obvious, and entirely reasonable. Leighton's option (ii) is start codon only modification. This is what I implemented in the patch on comment 1 (attachment 1259). We haven't agreed on a good name for this - which is partly why I went back to revisit the alternative: Leighton's option (i) is strict CDS-type translation. As Leighton suggests, having "cds" in the argument name here makes sense. Regarding the BioPerl argument name for this functionality, "complete", on Bug 2381 comment 19, Martin wrote: > The "complete" is a cryptic naming, I wouldn't be fond of it. > I think you are both right about the naming. Would complete_cds=True would be clear? In fact, I quite like the idea of using cds=True which is short and also fairly clear. This patch adds a complete_cds=Boolean argument to the Bio.Seq translate methods and function, which should act like the BioPerl equivalent. It includes doctests showing the new functionality. I would like to use either of these approaches in Biopython - but not both ;) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 11 20:00:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 11 May 2009 16:00:29 -0400 Subject: [Biopython-dev] [Bug 2826] New: when creating a de-novo SeqRecord, the dbxrefs are not written by SeqIO.write Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2826 Summary: when creating a de-novo SeqRecord, the dbxrefs are not written by SeqIO.write Product: Biopython Version: 1.49 Platform: All OS/Version: Linux Status: NEW Severity: minor Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: david.wyllie at ndm.ox.ac.uk Hi when creating a SeqRecord de novo, the dbxrefs are not written by SeqIO.write. Is this the intended behaviour? here is an example: # example script from Bio.Seq import Seq from Bio import SeqIO from Bio.SeqRecord import SeqRecord from Bio.Alphabet import generic_protein # list to hold output records outlist=[] # ofh is the output file handle ofh = open("/home/dwyllie/temporary.gbk","w") # example of de novo creation of SeqRecord object from url: # http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html rec = SeqRecord(Seq("MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKAT", generic_protein), \ id="NP_418483.1", name="b4059", description="ssDNA-binding protein", \ dbxrefs=["ASAP:13298", "GI:16131885", "GeneID:948570"]) print rec outlist.append(rec) count = SeqIO.write(outlist, ofh, "genbank") ofh.close() # end of script OUTPUT: ID: NP_418483.1 Name: b4059 Description: ssDNA-binding protein Database cross-references: ASAP:13298, GI:16131885, GeneID:948570 Number of features: 0 Seq('MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKAT', ProteinAlphabet()) Contents of temporary.gbk: LOCUS b4059 46 bp UNK 01-JAN-1980 DEFINITION ssDNA-binding protein ACCESSION NP_418483 VERSION NP_418483.1 KEYWORDS . SOURCE . ORGANISM . . FEATURES Location/Qualifiers ORIGIN 1 MASRGVNKVI LVGNLGQDPE VRYMPNGGAV ANITLATSES WRDKAT // -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 11 20:29:02 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 11 May 2009 16:29:02 -0400 Subject: [Biopython-dev] [Bug 2826] SeqRecord dbxrefs not written to GenBank by SeqIO In-Reply-To: Message-ID: <200905112029.n4BKT2x0024871@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2826 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|when creating a de-novo |SeqRecord dbxrefs not |SeqRecord, the dbxrefs are |written to GenBank by SeqIO |not written by SeqIO.write | ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-11 16:29 EST ------- Hi David, Thank you for another interesting bug report. See here for what the NCBI uses in a GenPept file for this example protein, NP_418483.1 http://www.ncbi.nlm.nih.gov/protein/16131885 The ASAP and GeneID numbers are not recorded at the sequence level - there is nowhere in the GenBank file format to but them. They are however recorded within a CDS feature on the link above. So, if you want these recorded, you'd have to create a SeqFeature with the information (you can't use the SeqRecord's dbxrefs list). The GI number would get written, but due to an anomology in the GenBank parser this is currently stored in the annotations dictionary under the key "gi", so this is where the GenBank writer looks for this. We should probably switch to recording this in the dbxrefs as "gi:12345" as well/instead, and look for this GI number there instead/as well. Currently when parsing GenBank files, the only thing stored in the SeqRecord's dbxref list is a PROJECT line cross reference (see Bug 2225). Looking at the code, we don't currently record that - we should. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 11 22:55:21 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 11 May 2009 18:55:21 -0400 Subject: [Biopython-dev] [Bug 2826] SeqRecord dbxrefs not written to GenBank by SeqIO In-Reply-To: Message-ID: <200905112255.n4BMtLFc004295@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2826 ------- Comment #2 from david.wyllie at ndm.ox.ac.uk 2009-05-11 18:55 EST ------- Thank you. I'm new to BioPython. The goal was to take some whole-genome sequence (which isn't in Genbank) and attach a taxon to it, in order that it be written to a BioSQL database. Other records in the BioSQL database derive from NCBI and so have taxon_ids, so the additional WGS being in a similar format would make things simpler. Thank you very much for all your assistance. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From cy at cymon.org Tue May 12 11:07:59 2009 From: cy at cymon.org (Cymon Cox) Date: Tue, 12 May 2009 12:07:59 +0100 Subject: [Biopython-dev] Clustal alignment format header line Message-ID: <7265d4f0905120407j151408bcnf988ba554fcb5140@mail.gmail.com> Both Muscle (-clw) and Probcons (-clustalw) output a programme specific header line for the clustal format alignment: "MUSCLE (3.7) multiple sequence alignment AK1H_ECOLI/1-378 CPDSINAALICRGEKMSIAIMAGVLEAR etc" "PROBCONS version 1.12 multiple sequence alignment AK1H_ECOLI/1-378 CPDSINAALICRGEKMSIAIMA " Bio.AlignIO will not read these alignments Bio/AlignIO/ClustalIO.py:94 if line[:7] != 'CLUSTAL': raise ValueError("Did not find CLUSTAL header") Muscle does have a -clwstrict flag but ProbCons doesnt. Would it be a good idea to relax the header parsing? C. -- From biopython at maubp.freeserve.co.uk Tue May 12 15:28:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 May 2009 16:28:35 +0100 Subject: [Biopython-dev] Clustal alignment format header line In-Reply-To: <7265d4f0905120407j151408bcnf988ba554fcb5140@mail.gmail.com> References: <7265d4f0905120407j151408bcnf988ba554fcb5140@mail.gmail.com> Message-ID: <320fb6e00905120828t139113d8rceb7db874acdd8a0@mail.gmail.com> On Tue, May 12, 2009 at 12:07 PM, Cymon Cox wrote: > Both Muscle (-clw) and Probcons (-clustalw) ?output a programme specific > header line for the clustal format alignment: > > "MUSCLE (3.7) multiple sequence alignment > > > AK1H_ECOLI/1-378 ? ? ?CPDSINAALICRGEKMSIAIMAGVLEAR etc" > > "PROBCONS version 1.12 multiple sequence alignment > > AK1H_ECOLI/1-378 ? ?CPDSINAALICRGEKMSIAIMA > > " > > Bio.AlignIO will not read these alignments > Bio/AlignIO/ClustalIO.py:94 > ?if line[:7] != 'CLUSTAL': > ? ? ? raise ValueError("Did not find CLUSTAL header") > > Muscle does have a -clwstrict flag but ProbCons doesnt. > > Would it be a good idea to relax the header parsing? > > C. Maybe. Up until now the only example of this I had personally come across was MUSCLE, but they helpfully provide the -clwstrict argument so the issue wasn't important. There are also of course the official variants like: CLUSTAL W (1.81) multiple sequence alignment CLUSTAL 2.0.9 multiple sequence alignment How would you code this? A flexible option would be to take anything where the first line ends with "multiple sequence alignment", but this risks letting a lot of non-clustal files though which will then (hopefully) fail, but probably with a much more cryptic error message. A white list of safe variants like "MUSCLE" and "PROBCONS" would be safest. Also I have a vague memory of some tool using something like "CLUSTAL ... from ToolX" but I don't recall the details. Peter From cy at cymon.org Tue May 12 15:43:47 2009 From: cy at cymon.org (Cymon Cox) Date: Tue, 12 May 2009 16:43:47 +0100 Subject: [Biopython-dev] Clustal alignment format header line In-Reply-To: <320fb6e00905120828t139113d8rceb7db874acdd8a0@mail.gmail.com> References: <7265d4f0905120407j151408bcnf988ba554fcb5140@mail.gmail.com> <320fb6e00905120828t139113d8rceb7db874acdd8a0@mail.gmail.com> Message-ID: <7265d4f0905120843j172f5303y7029e1f7b5f4187f@mail.gmail.com> 2009/5/12 Peter > On Tue, May 12, 2009 at 12:07 PM, Cymon Cox wrote: > > Both Muscle (-clw) and Probcons (-clustalw) output a programme specific > > header line for the clustal format alignment: > > > > "MUSCLE (3.7) multiple sequence alignment > > > > > > AK1H_ECOLI/1-378 CPDSINAALICRGEKMSIAIMAGVLEAR etc" > > > > "PROBCONS version 1.12 multiple sequence alignment > > > > AK1H_ECOLI/1-378 CPDSINAALICRGEKMSIAIMA > > > > " > > > > Bio.AlignIO will not read these alignments > > Bio/AlignIO/ClustalIO.py:94 > > if line[:7] != 'CLUSTAL': > > raise ValueError("Did not find CLUSTAL header") > > > > Muscle does have a -clwstrict flag but ProbCons doesnt. > > > > Would it be a good idea to relax the header parsing? > > > > C. > > Maybe. Up until now the only example of this I had personally come > across was MUSCLE, but they helpfully provide the -clwstrict argument > so the issue wasn't important. > > There are also of course the official variants like: > > CLUSTAL W (1.81) multiple sequence alignment > CLUSTAL 2.0.9 multiple sequence alignment > > How would you code this? A flexible option would be to take anything > where the first line ends with "multiple sequence alignment", but this > risks letting a lot of non-clustal files though which will then > (hopefully) fail, but probably with a much more cryptic error message. > A white list of safe variants like "MUSCLE" and "PROBCONS" would be > safest. > > Also I have a vague memory of some tool using something like "CLUSTAL > ... from ToolX" but I don't recall the details. T-COFFEE for one: "CLUSTAL FORMAT for T-COFFEE Version_6.92 [http://www.tcoffee.org] [MODE: ], CPU=0.00 sec, SCORE=100, Nseq=2, Len=601" Is it so bad to let it fail on the structure of the data - effectively ignore the header? Maybe have a general "this doesnt look like clustal formatted data" error based on the data structure... C. -- From biopython at maubp.freeserve.co.uk Tue May 12 16:05:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 May 2009 17:05:15 +0100 Subject: [Biopython-dev] Loading SeqRecords into BioSQL with NCBI taxon ID Message-ID: <320fb6e00905120905l3d0e31b2p2a3d92f61096cbd5@mail.gmail.com> Over on Bug 2826, David wrote: http://bugzilla.open-bio.org/show_bug.cgi?id=2826#c2 > Thank you. I'm new to BioPython. > > The goal was to take some whole-genome sequence (which isn't in Genbank) and > attach a taxon to it, in order that it be written to a BioSQL database. You've talked about trying to parse WGS GenBank files on Bug 2825 but presumable if this new data isn't in GenBank, it is in another format. What format is your whole-genome sequence? FASTA or something simple? > Other records in the BioSQL database derive from NCBI and so have taxon_ids, > so the additional WGS being in a similar format would make things simpler. I see. Basically you need to import a SeqRecord into BioSQL with an NCBI taxon ID. You don't need to write out a GenBank file to do this. First create the SeqRecord, e.g. from Bio import SeqIO record = SeqIO.read(handle, format, alphabet) There are now two options - because the BioSQL loader will look for the NCBI taxon ID in two places: (Option 1) Record the NCBI taxon ID in the SeqRecord's annotation dictionary under the "ncbi_taxid" key. This should work (untested): record.annotations["ncbi_taxid"] = 12345 #or single element list, [12345] (Option 2) Mimic a SeqRecord from parsing a GenBank file with a source feature containing the taxon ID. This should work (untested): #Create the SeqRecord: record = SeqIO.read(handle, format, alphabet) #Create the source features: from Bio.SeqFeature import SeqFeature, FeatureLocation f = SeqFeature(FeatureLocation(0, len(record)), strand=+1, type="source") f.qualifiers["db_xref"] = ["taxon:12345"] record.features = [f] #or insert at start If you don't really have a sequence, this second approach doesn't make so much sense. [Arguably there could be a third option via the dbxref's list] Then in either case, load the modified SeqRecord into the database. You may want to pre-load the NCBI taxonomy, see http://www.biopython.org/wiki/BioSQL Alternatively, using Biopython 1.49+ you can have this fetched from Entrez on demand with the fetch_NCBI_taxonomy=True option. The BioSQL wiki page needs updating on this topic. Peter From bugzilla-daemon at portal.open-bio.org Tue May 12 16:11:43 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 12 May 2009 12:11:43 -0400 Subject: [Biopython-dev] [Bug 2826] SeqRecord dbxrefs not written to GenBank by SeqIO In-Reply-To: Message-ID: <200905121611.n4CGBhrY001864@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2826 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-12 12:11 EST ------- (In reply to comment #2) > Thank you. I'm new to BioPython. > > The goal was to take some whole-genome sequence (which isn't in Genbank) and > attach a taxon to it, in order that it be written to a BioSQL database. For this example you don't need to write out a GenBank file at all (which is what this bug was about). See my email on the mailing list for details: http://lists.open-bio.org/pipermail/biopython/2009-May/005154.html and sent in error to the dev list: http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006028.html I am leaving this bug open for relevant dbxrefs entries not currently recorded when writing GenBank files with Bio.SeqIO (GI number which goes on the VERSION line, and genome projects on the PROJECT / DBLINK line). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue May 12 16:16:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 May 2009 17:16:35 +0100 Subject: [Biopython-dev] Clustal alignment format header line In-Reply-To: <7265d4f0905120843j172f5303y7029e1f7b5f4187f@mail.gmail.com> References: <7265d4f0905120407j151408bcnf988ba554fcb5140@mail.gmail.com> <320fb6e00905120828t139113d8rceb7db874acdd8a0@mail.gmail.com> <7265d4f0905120843j172f5303y7029e1f7b5f4187f@mail.gmail.com> Message-ID: <320fb6e00905120916p3db7c003kf6eef581cbb4c93b@mail.gmail.com> On Tue, May 12, 2009 at 4:43 PM, Cymon Cox wrote: >Peter wrote: >> Also I have a vague memory of some tool using something like "CLUSTAL >> ... from ToolX" but I don't recall the details. > > T-COFFEE for one: > "CLUSTAL FORMAT for T-COFFEE Version_6.92 [http://www.tcoffee.org] [MODE: > ], CPU=0.00 sec, SCORE=100, Nseq=2, Len=601" Yes - that is almost certainly the example I was thinking of. > Is it so bad to let it fail on the structure of the data - effectively > ignore the header? Maybe have a general "this doesnt look like clustal > formatted data" error based on the data structure... Some of the current error messages are a little cryptic to an end user, I guess they could have "Are you sure this is a Clustal format file?" appended to them. I'd be happy with a whitelist of variant headers, i.e. must start with "CLUSTAL", "MUSCLE" or "PROBCONS" (assuming these tools don't write their own file formats which also start that way!). If people find new cases and report them, it also gives us notice about another tool we may want to include in our command line wrappers, and/or obtain sample output files for the unit tests. Peter From biopython at maubp.freeserve.co.uk Tue May 12 17:14:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 May 2009 18:14:27 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com> <8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com> <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com> <320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com> <320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com> <8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com> Message-ID: <320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com> On Tue, Apr 28, 2009 at 6:50 PM, Bartek Wilczynski wrote: > On Tue, Apr 28, 2009 at 7:45 PM, Peter wrote: >> I take that back - I added an email address of just "peterc" to my >> github account (it seems they don't do any validation, perhaps for >> this very reason?). ?This had no immediate effect, but one day later >> and all my CVS commits are now shown with my photo in github. ?Neat - > > great That seems to have stopped working now - no idea why, "peterc" is still listed an one of my email addresses on my github account, but my github account is no longer linked to commits in Biopython. Odd. Do you think it would be straight forward for your CVS to git conversion to map the CVS usernames to github usernames for future commits (so as not to alter the currently published history)? Peter From bugzilla-daemon at portal.open-bio.org Tue May 12 17:33:03 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 12 May 2009 13:33:03 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905121733.n4CHX3jK009739@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #32 from cymon.cox at gmail.com 2009-05-12 13:33 EST ------- Added PROBCONS and TCOFFEE command line interfaces and unittests. The TCOFFEE commadline implements a very restricted set of options (just those Brad attached). Also added white list of known headers to AlignIO/ClustalwIO.py:97 - the PROBSCONS unittest will fail without this alteration. On http://github.com/cymon/biopython-github-master/tree/applic-int C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bartek at rezolwenta.eu.org Tue May 12 18:23:18 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 12 May 2009 20:23:18 +0200 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com> <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com> <320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com> <320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com> <8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com> <320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com> Message-ID: <8b34ec180905121123w6fe553ahd416b9e1d317dd11@mail.gmail.com> On Tue, May 12, 2009 at 7:14 PM, Peter wrote: > That seems to have stopped working now - no idea why, "peterc" is > still listed an one of my email addresses on my github account, but my > github account is no longer linked to commits in Biopython. ?Odd. It seems to be OK again. Maybe it was temporary ? > > Do you think it would be straight forward for your CVS to git > conversion to map the CVS usernames to github usernames for future > commits (so as not to alter the currently published history)? > It would be straightforward to add a mapping to the conversion, but I think it would affect the whole history... I was thinking that the mapping was going to change when we finally switch to git. Then it would be a natural cause of events... Otherwise, we would have another step in our transition. Whether it's worth doing it, depends on how long we expect to be in the transition between CVS and git. cheers Bartek From bugzilla-daemon at portal.open-bio.org Tue May 12 18:44:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 12 May 2009 14:44:09 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905121844.n4CIi9sb017010@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1296 is|0 |1 obsolete| | ------- Comment #33 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-12 14:44 EST ------- (From update of attachment 1296) (In reply to comment #32) > Added PROBCONS and TCOFFEE command line interfaces and unittests. > > The TCOFFEE commadline implements a very restricted set of options > (just those Brad attached). > > Also added white list of known headers to AlignIO/ClustalwIO.py:97 - the > PROBSCONS unittest will fail without this alteration. > > On http://github.com/cymon/biopython-github-master/tree/applic-int Thank you Cymon and Brad - those are now checked in, more or less as is. I did tweak Bio/AlignIO/ClustalwIO.py a little bit. Also, TCoffee says it can be installed on Windows using Cygwin - we should try that at some point ;) Note for the TCoffee suite we could also consider adding xpresso, 3dcoffee, mcoffee and rcoffee as well - hopefully they have similar interfaces so with some subclassing we won't have to duplicate a lot of the code. One other thought - do you think the EMBOSS water and needle wrappers (and any other alignment tools in EMBOSS) be made available under Bio.Align.Applications (via an import in Bio/Align/Applications/__init__.py so no code duplication)? Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue May 12 18:57:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 May 2009 19:57:24 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <8b34ec180905121123w6fe553ahd416b9e1d317dd11@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com> <320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com> <320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com> <8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com> <320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com> <8b34ec180905121123w6fe553ahd416b9e1d317dd11@mail.gmail.com> Message-ID: <320fb6e00905121157i2068b81fw5a8b21e65c33abe2@mail.gmail.com> On Tue, May 12, 2009 at 7:23 PM, Bartek Wilczynski wrote: > > I was thinking that the mapping was going to change when we finally > switch to git. Then it would be a natural cause of events... > Otherwise, we would have another step in our transition. Whether it's > worth doing it, depends on how long we expect to be in the transition > between CVS and git. I'm happy that git will work, and that I personally know enough about the basics to manage. I'm not happy with the current github repository due to the history tag issue - but we know we can fix that now. Are you going to try removing the old tags and re-doing them on github? Does anyone know how the git provided "ViewCVS" equivalent shows tags in a file's history? I think we should now have a chat with the OBF (off list) about how we might go about installing git on their server. Commits can then be pushed out to github automatically (or pulled from github if we go the other way round). This would make several things easier: (1) Seamless continuation of existing user accounts (2) Keeping the snapshot code up to date: http://biopython.org/SRC/biopython/ (3) Having our own commit RSS feeds (not essential as this could be done on github) (4) Having automatic builds of the documentation (previously discussed as nice to have) Plus of course giving redundancy with the code mirrored on both OBF servers and GitHub :) Peter From bugzilla-daemon at portal.open-bio.org Tue May 12 19:45:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 12 May 2009 15:45:12 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905121945.n4CJjCFj023070@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #34 from cymon.cox at gmail.com 2009-05-12 15:45 EST ------- (In reply to comment #33) > Note for the TCoffee suite we could also consider adding xpresso, 3dcoffee, > mcoffee and rcoffee as well - hopefully they have similar interfaces so with > some subclassing we won't have to duplicate a lot of the code. With the latest version of t_coffee (and not the currently available Jaunty package!), these (ie the meta calls like mcoffee etc) are all covered by the "-mode" option. I just installed t_coffee from source and this appears to be the case. There are so many options and interdependencies in TCOFFEE, and its command line is clearly a moving target, that the interface may require more work before being released. > One other thought - do you think the EMBOSS water and needle wrappers (and any > other alignment tools in EMBOSS) be made available under Bio.Align.Applications > (via an import in Bio/Align/Applications/__init__.py so no code duplication)? Sounds good to me. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue May 12 23:04:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 May 2009 00:04:53 +0100 Subject: [Biopython-dev] Bio.EMBOSS wrappers In-Reply-To: <320fb6e00904130649v699b16d1seea0c51172b9ab71@mail.gmail.com> References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com> <20090413123219.GB5429@sobchak.mgh.harvard.edu> <320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com> <20090413134429.GE5429@sobchak.mgh.harvard.edu> <320fb6e00904130649v699b16d1seea0c51172b9ab71@mail.gmail.com> Message-ID: <320fb6e00905121604q4c70d69ck35fb16210fb0efe2@mail.gmail.com> On Mon, Apr 13, 2009 at 2:49 PM, Peter wrote: > On Mon, Apr 13, 2009 at 2:44 PM, Brad Chapman wrote: >>> > ... Feel free to add away. >>> >>> I need to work on my delegation skills - that seems to have back fired ;) >> >> Oops. I honestly read that as "do I have your permission?" I can of >> course tackle this, but am a bit underwater now. > > Looking back, I was a bit ambiguous. I don't mind who does it - let's > see who has time free first. That's done in CVS now - plus a few other things like -die and -stdout. I've also done -outfile via the new base Emboss wrapper, as all the tools (so far at least) include this option. >>> Regarding adding -auto support, I have a question about the needle >>> wrapper and the gap parameters. Using the needle tool at the command >>> line will prompt for the gap parameters UNLESS the -auto argument has >>> been used. i.e. Without -auto, it makes sense to insist on the gap >>> parameters being included, which is what the current wrapper does. >>> However, if we add support for -auto, then these parameters can be >>> optional. We could handle this in the wrapper, but it would be messy >>> (and there may be similar questions with other EMBOSS tools). What do >>> you think - stick with the simple option of insisting the Biopython >>> user set the gap parameters, even if they are using -auto? >> >> I think we should stick with the simple option. These were meant to >> be pretty dumb specifiers that help users write more modular code than >> simply pasting in a raw string for the command line. Trying to get >> too fancy is probably overkill. > > Agreed. By putting the outfile argument on the base EMBOSS wrapper class, together with the related -filter and -stdout options, I was able to enforce a simple check that at least one of these is used, applicable to all the wrappers. This preserves the old safety check that the output file is required (unless using standard out via -filter and/or -stdout instead). Something similar could be done so that using -auto overrides the any "required" flags we have set (e.g. for gapopen in water), but this seems unnecessary to me (as discussed above). Peter From biopython at maubp.freeserve.co.uk Wed May 13 09:55:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 May 2009 10:55:06 +0100 Subject: [Biopython-dev] Properties names in command line wrappers In-Reply-To: <20090505123656.GB15113@sobchak.mgh.harvard.edu> References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com> <20090505123656.GB15113@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00905130255j22905e80y6ff8678f6af89cb2@mail.gmail.com> On Mon, May 4, 2009, Peter wrote: >>> ... The (hardly used) existing blastall wrapper in >>> Bio/Blast/Applications.py gives the "-a" argument a human >>> readable name of "nprocessors", and "-A" gets "window_size". >>> With the old set_parameter call either alias could be used. >>> However, with a python property we need to pick one as a >>> preferred name - and I'm not 100% sure being helpful and >>> using "nprocessors" (e.g. cline.nprocessors=4) is actually >>> better than using the actual argument name (e.g. cline.a = 4). On Tue, May 5, 2009, Brad wrote: >> Could we support both the original argument and optional human >> readable arguments? I know the code in Application is a bit >> hard coded for the first argument as the real name and the last >> argument as the readable name; the cleanest solution would be to >> generalize this to have multiple names where it makes sense. >> ... On Tue, May 5, 2009, Peter wrote: > ... > I favour using only a single property for each parameter, with the > name as similar as possible to the actual command line switch (i.e. > property name "a" for "-a", not "nprocessors"). Note each property > would have a docstring which will say what is it for ("Number of > processors to use."). I still favour only using a single python property for each parameter, but after some work on the blastall wrapper last night, I am beginning to come round to your point of view. If a command line tool provides a long parameter name (some tools provide both short and long names for important parameters) we should use that rather than inventing our own [so no change here]. However, for tools like BLAST which *only* have cryptic single letter command line options (case sensitive), maybe we should be using a sensible human readable name for the associated property in the Biopython wrapper (i.e. "nprocessors" for "-a", and "window_size" for "-A"). Having actually now tried using properties "a" and "A", the resulting python code is very cryptic - and only makes sense if you are familiar with the blastall arguments (and given there are so many of them, this is difficult!). It should be trivial to extend to documentation strings automatically to include something like "This maps onto the XXX command line argument" so that the mapping is clear to the user without having to look at our source code. Hopefully this gets the balance right between giving nice python code, and staying faithful to the actual command line tool API. Peter From biopython at maubp.freeserve.co.uk Wed May 13 11:15:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 May 2009 12:15:35 +0100 Subject: [Biopython-dev] Properties names in command line wrappers In-Reply-To: <7265d4f0905130350s4006db5bqc40eed1ec535243d@mail.gmail.com> References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com> <20090505123656.GB15113@sobchak.mgh.harvard.edu> <320fb6e00905130255j22905e80y6ff8678f6af89cb2@mail.gmail.com> <7265d4f0905130350s4006db5bqc40eed1ec535243d@mail.gmail.com> Message-ID: <320fb6e00905130415p6757da94id8d7508a2ea3eebf@mail.gmail.com> On Wed, May 13, 2009 at 11:50 AM, Cymon Cox wrote: >> On Tue, May 5, 2009, Peter wrote: >> > ... >> > I favour using only a single property for each parameter, with the >> > name as similar as possible to the actual command line switch (i.e. >> > property name "a" for "-a", not "nprocessors"). ?Note each property >> > would have a docstring which will say what is it for ("Number of >> > processors to use."). >> >> I still favour only using a single python property for each parameter, > > A confusing issue arises where we have alternative names for options. > That the following example from _Probcons.py: > > ??????????? _Option(["-c", "c", "--consistency", "consistency" ], ["input"], > ??????????????????? lambda x: x in range(0,6), > ??????????????????? 0, > ??????????????????? "Use 0 <= REPS <= 5 (default: 2) passes of consistency > transformation", > ??????????????????? 0), > >>>> cmd = cmdline = ProbconsCommandline("probcons", input="blah") >>>> cmd.c = 1 >>>> str(cmd) > 'probcons blah ' >>>> cmd.set_parameter("c", 1) >>>> str(cmd) > 'probcons -c 1 blah ' >>>> cmd.consistency = 2 >>>> str(cmd) > 'probcons -c 2 blah ' >>>> cmd.c = 5 >>>> str(cmd) > 'probcons -c 2 blah ' > > That is, the user needs to look at the code to figure out what the correct > name is to use when assigning to the property. Is it possible to restrict > the binding of attributes to the cmdline to only valid property names? An > alternative would be to restrict all parameters to only one name and > document the alternatives it covers (dont like this idea - see below). Yes, you can use any of the defined aliases with set_parameter, and they are all equally valid, and all do exactly the same thing. e.g. cmd = ProbconsCommandline("probcons", input="blah") cmd.set_parameter("c", 1) cmd.set_parameter("-c", 1) cmd.set_parameter("--consistency", 1) cmd.set_parameter("consistency", 1) I would however regard set_parameter as a legacy method and push the (single) keyword argument or property alternative, for which there is only one name (here "consistency" ): cmd = ProbconsCommandline("probcons", input="blah") cmd.consistency = 1 or, cmd = ProbconsCommandline("probcons", input="blah", consistency=1) [And yes, we should have some error checking code in the base class __init__ method to make sure the string used is a valid python identifier.] The user does NOT have to look at the source code to find this out - just the docstrings or properties - try help(cmd) or dir(cmd) in python. >> but after some work on the blastall wrapper last night, I am >> beginning to come round to your point of view. >> >> If a command line tool provides a long parameter name (some tools >> provide both short and long names for important parameters) we >> should use that rather than inventing our own [so no change here]. >> >> However, for tools like BLAST which *only* have cryptic single letter >> command line options (case sensitive), maybe we should be using >> a sensible human readable name for the associated property in the >> Biopython wrapper (i.e. "nprocessors" for "-a", and "window_size" >> for "-A"). ?Having actually now tried using properties "a" and "A", >> the resulting python code is very cryptic - and only makes sense >> if you are familiar with the blastall arguments (and given there are >> so many of them, this is difficult!). > > I dont agree. If you want to make your python code legible to people > who are not familar with the command line options, you can just > comment it. I think the interfaces should stick as close as possible > to the application documentation. I see these interfaces being used > mostly by people who are familar with the applications, in which case > the command line construction should be fairly intuitive. Well, I am on the fence here. The trouble is that sometimes (e.g. BLAST) the command line parameters themselves are just so cryptic. Yes, we could just use "a" and "A", and leave it up to the user to document their code. If we using "nprocessors" and "window_size" the code becomes self documenting (although you have to know Biopython's mapping). Brad's suggestion to support both in the property and keyword arguments brings us back to having multiple choices on how to do set a parameter (as in the set_parameter with its aliases), confusing and unpythonic. Peter From cy at cymon.org Wed May 13 10:50:54 2009 From: cy at cymon.org (Cymon Cox) Date: Wed, 13 May 2009 11:50:54 +0100 Subject: [Biopython-dev] Properties names in command line wrappers In-Reply-To: <320fb6e00905130255j22905e80y6ff8678f6af89cb2@mail.gmail.com> References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com> <20090505123656.GB15113@sobchak.mgh.harvard.edu> <320fb6e00905130255j22905e80y6ff8678f6af89cb2@mail.gmail.com> Message-ID: <7265d4f0905130350s4006db5bqc40eed1ec535243d@mail.gmail.com> 2009/5/13 Peter > On Mon, May 4, 2009, Peter wrote: > >>> ... The (hardly used) existing blastall wrapper in > >>> Bio/Blast/Applications.py gives the "-a" argument a human > >>> readable name of "nprocessors", and "-A" gets "window_size". > >>> With the old set_parameter call either alias could be used. > >>> However, with a python property we need to pick one as a > >>> preferred name - and I'm not 100% sure being helpful and > >>> using "nprocessors" (e.g. cline.nprocessors=4) is actually > >>> better than using the actual argument name (e.g. cline.a = 4). > > On Tue, May 5, 2009, Brad wrote: > >> Could we support both the original argument and optional human > >> readable arguments? I know the code in Application is a bit > >> hard coded for the first argument as the real name and the last > >> argument as the readable name; the cleanest solution would be to > >> generalize this to have multiple names where it makes sense. > >> ... > > On Tue, May 5, 2009, Peter wrote: > > ... > > I favour using only a single property for each parameter, with the > > name as similar as possible to the actual command line switch (i.e. > > property name "a" for "-a", not "nprocessors"). Note each property > > would have a docstring which will say what is it for ("Number of > > processors to use."). > > I still favour only using a single python property for each parameter, A confusing issue arises where we have alternative names for options. That the following example from _Probcons.py: _Option(["-c", "c", "--consistency", "consistency" ], ["input"], lambda x: x in range(0,6), 0, "Use 0 <= REPS <= 5 (default: 2) passes of consistency transformation", 0), >>> cmd = cmdline = ProbconsCommandline("probcons", input="blah") >>> cmd.c = 1 >>> str(cmd) 'probcons blah ' >>> cmd.set_parameter("c", 1) >>> str(cmd) 'probcons -c 1 blah ' >>> cmd.consistency = 2 >>> str(cmd) 'probcons -c 2 blah ' >>> cmd.c = 5 >>> str(cmd) 'probcons -c 2 blah ' That is, the user needs to look at the code to figure out what the correct name is to use when assigning to the property. Is it possible to restrict the binding of attributes to the cmdline to only valid property names? An alternative would be to restrict all parameters to only one name and document the alternatives it covers (dont like this idea - see below). but after some work on the blastall wrapper last night, I am > beginning to come round to your point of view. > > If a command line tool provides a long parameter name (some tools > provide both short and long names for important parameters) we > should use that rather than inventing our own [so no change here]. > > However, for tools like BLAST which *only* have cryptic single letter > command line options (case sensitive), maybe we should be using > a sensible human readable name for the associated property in the > Biopython wrapper (i.e. "nprocessors" for "-a", and "window_size" > for "-A"). Having actually now tried using properties "a" and "A", > the resulting python code is very cryptic - and only makes sense > if you are familiar with the blastall arguments (and given there are > so many of them, this is difficult!). I dont agree. If you want to make your python code legible to people who are not familar with the command line options, you can just comment it. I think the interfaces should stick as close as possible to the application documentation. I see these interfaces being used mostly by people who are familar with the applications, in which case the command line construction should be fairly intuitive. Cheers, C. -- From biopython at maubp.freeserve.co.uk Wed May 13 13:10:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 May 2009 14:10:59 +0100 Subject: [Biopython-dev] Properties names in command line wrappers In-Reply-To: <320fb6e00905130415p6757da94id8d7508a2ea3eebf@mail.gmail.com> References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com> <20090505123656.GB15113@sobchak.mgh.harvard.edu> <320fb6e00905130255j22905e80y6ff8678f6af89cb2@mail.gmail.com> <7265d4f0905130350s4006db5bqc40eed1ec535243d@mail.gmail.com> <320fb6e00905130415p6757da94id8d7508a2ea3eebf@mail.gmail.com> Message-ID: <320fb6e00905130610g3eb8edb4q99913b8b0ae14bf9@mail.gmail.com> On Wed, May 13, 2009 at 12:15 PM, Peter wrote: > > The user does NOT have to look at the source code to find this out - > just the docstrings or properties - try help(cmd) or dir(cmd) in python. > I've just updated the automatically generated docstrings for each property so that it includes the actual parameter name which will be used to build the string. Peter From bugzilla-daemon at portal.open-bio.org Wed May 13 15:01:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 13 May 2009 11:01:33 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905131501.n4DF1XYv019413@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #35 from cymon.cox at gmail.com 2009-05-13 11:01 EST ------- Ive added some very basic unittests for the command line interfaces, which dont require the applications to be installed. test_Application_Commandlines.py - currently in only includes Bio/Align/Applications but Bio/Emboss tests could be added. Note that the _Mafft.py command line interface is currently broken due the restriction only having a single instance of a parameter on the command line. Mafft uses the following option: --seed alignment1 [--seed alignment2 --seed alignment3 ...] We could remove support this option in Mafft. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 13 15:23:34 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 13 May 2009 11:23:34 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200905131523.n4DFNYX7021233@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #36 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-13 11:23 EST ------- (In reply to comment #35) > > Note that the _Mafft.py command line interface is currently broken due the > restriction only having a single instance of a parameter on the command line. > Mafft uses the following option: > > --seed alignment1 [--seed alignment2 --seed alignment3 ...] > > We could remove support this option in Mafft. Removing the --seed argument might be a pragmatic short term solution. I'd considered this type of thing as a possible corner case - but hadn't mentioned it as I didn't have a concrete example. I would suggest setting the parameter value to a list could work: i.e. Support any of: cline = MafftCommandline(seed=["alignment1", "alignment2", "alignment3"]) cline.set_paramter("seed", ["alignment1", "alignment2", "alignment3"]) cline.seed = ["alignment1", "alignment2", "alignment3"] giving: mafft --seed alignment1 --seed alignment2 --seed alignment3 We'd need to introduce a new _Option subclass for this. A similar situation applies to optional argument lists, like the Unix zip command: zip zipfile file1 file2 file3 ... where there is a single output filename (here zipfile), and then one or more input files or filespecifiers (here three entries). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From winda002 at student.otago.ac.nz Thu May 14 04:53:42 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Thu, 14 May 2009 16:53:42 +1200 Subject: [Biopython-dev] Cookbook entry, concatenating nexus files Message-ID: <4A0BA3D6.5070207@student.otago.ac.nz> I have been slowly adding some of the scripts I use most commonly to the cookbook section of the wiki (http://biopython.org/wiki/Category:Cookbook). Since I'm very much a dilettante at this programming business as the cookbook is meant as supplementary documentation for Biopython it's probably a good idea for someone that knows what they are doing to look at these things (Peter has been really helpful with this thus far, but is seems unfair to saddle one man with so much bad programming :) I've just added a recipe that uses the nexus class to concatenate multiple nexus files and provide some feedback if the taxa are not the same in each one: http://biopython.org/wiki/Concatenate_nexus Any thoughts? If you think you can make it clearer/quicker/better then you can edit it on the wiki or provide comments here of there. Cheers, David From biopython at maubp.freeserve.co.uk Thu May 14 09:27:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 May 2009 10:27:12 +0100 Subject: [Biopython-dev] Cookbook entry, concatenating nexus files In-Reply-To: <4A0BA3D6.5070207@student.otago.ac.nz> References: <4A0BA3D6.5070207@student.otago.ac.nz> Message-ID: <320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com> On Thu, May 14, 2009 at 5:53 AM, David Winter wrote: > > I have been slowly adding some of the scripts I use most commonly to the > cookbook section of the wiki (http://biopython.org/wiki/Category:Cookbook). > Since I'm very much a ?dilettante at this programming business as the > cookbook is meant as supplementary documentation for Biopython it's probably > a good idea for someone that knows what they are doing to look at these > things (Peter has been really helpful with this thus far, but is seems > unfair to saddle one man with so much bad programming :) > > I've just added a recipe that uses the nexus class to concatenate multiple > nexus files and provide some feedback if the taxa are not the same in each > one: http://biopython.org/wiki/Concatenate_nexus > > Any thoughts? If you think you can make it clearer/quicker/better then you > can edit it on the wiki or provide comments here of there. What exactly are you trying to achieve? A big Nexus files with lots of alignments (and trees) in it? When I talked to Frank about Nexus files, he said they should only ever hold one alignment matrix, hence Bio.AlignIO does not allow writing multiple alignments to a single Nexus file. If you have some real world examples of Nexus files holding more than one alignment matrix, please share them - then we can try and get Bio.AlignIO (and if need be Bio.Nexus) to cope with them directly! Peter From cy at cymon.org Thu May 14 09:59:51 2009 From: cy at cymon.org (Cymon Cox) Date: Thu, 14 May 2009 10:59:51 +0100 Subject: [Biopython-dev] Cookbook entry, concatenating nexus files In-Reply-To: <320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com> References: <4A0BA3D6.5070207@student.otago.ac.nz> <320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com> Message-ID: <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com> 2009/5/14 Peter > On Thu, May 14, 2009 at 5:53 AM, David Winter > wrote: > > > > I have been slowly adding some of the scripts I use most commonly to the > > cookbook section of the wiki ( > http://biopython.org/wiki/Category:Cookbook). > > Since I'm very much a dilettante at this programming business as the > > cookbook is meant as supplementary documentation for Biopython it's > probably > > a good idea for someone that knows what they are doing to look at these > > things (Peter has been really helpful with this thus far, but is seems > > unfair to saddle one man with so much bad programming :) > > > > I've just added a recipe that uses the nexus class to concatenate > multiple > > nexus files and provide some feedback if the taxa are not the same in > each > > one: http://biopython.org/wiki/Concatenate_nexus > > > > Any thoughts? If you think you can make it clearer/quicker/better then > you > > can edit it on the wiki or provide comments here of there. > > What exactly are you trying to achieve? A big Nexus files with lots > of alignments (and trees) in it? The example David has given is very useful and a common procedure for phylogeneticists. Single gene/proteins tend to be aligned in separate alignment files and the concatenated into a so-called 'supermatrix'. One thing I would question is the first line: "It's a good idea, if possible, to make species-level phylogenetic inferences bases on multiple genes because a) demographic processes can lead gene-trees to diverge from species trees and b) journal editors now this." Yes, it is a good idea to make inferences based upon the largest amount of data, but if demographic process have led to some gene(s) that have diverged from the species tree, then this is a reason not to combined them. Phylogenetic inference assumes all data evolved on the same tree - typically one would analyse gene partitions individually to look for incongruence among partitions before combining the data. > When I talked to Frank about Nexus files, he said they should only > ever hold one alignment matrix, Well, that was my understanding as well. But, it may be wrong. I just tried it - p4 will read both matrices no problem, PAUP* (the de facto standard here) will execute both matrices ok presumably leaving just the last as the data in memory. I'll look into this further... Cheers C. -- ____________________________________________________________________ Cymon J. Cox Centro de Ciencias do Mar Faculdade de Ciencias do Mar e Ambiente (FCMA) Universidade do Algarve Campus de Gambelas 8005-139 Faro Portugal Phone: +0351 289800909 ext 7909 Fax: +0351 289800051 Email: cy at cymon.org, cymon at ualg.pt, cymon.cox at gmail.com HomePage : http://biology.duke.edu/bryology/cymon.html -8.63/-6.77 From biopython at maubp.freeserve.co.uk Thu May 14 11:02:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 May 2009 12:02:03 +0100 Subject: [Biopython-dev] Cookbook entry, concatenating nexus files In-Reply-To: <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com> References: <4A0BA3D6.5070207@student.otago.ac.nz> <320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com> <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com> Message-ID: <320fb6e00905140402i5e7d9343p60162f697692b427@mail.gmail.com> On Thu, May 14, 2009 at 10:59 AM, Cymon Cox wrote: >> What exactly are you trying to achieve? ?A big Nexus files with lots >> of alignments (and trees) in it? > > The example David has given is very useful and a common procedure for > phylogeneticists. Single gene/proteins tend to be aligned in separate > alignment files and the concatenated into a so-called 'supermatrix'. Oh right - I hadn't looked at David's example carefully enough earlier to work out which concatenation he was doing (by row or by column). It does make sense on re-reading. Concatenation to give a single supermatrix (same number of taxa, longer sequences) would be most elegantly done by sorting the three alignments (so the taxa are in the same order) and then concatenating them (by column). See Bug 2552, http://bugzilla.open-bio.org/show_bug.cgi?id=2552 Note that this procedure isn't specific to NEXUS files - you could do this with any alignment format. It is just fairly straight forward with the Bio.Nexus module at the moment (at least, until we fix Bug 2552). Peter From biopython at maubp.freeserve.co.uk Thu May 14 11:11:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 May 2009 12:11:30 +0100 Subject: [Biopython-dev] Cookbook entry, concatenating nexus files In-Reply-To: <320fb6e00905140402i5e7d9343p60162f697692b427@mail.gmail.com> References: <4A0BA3D6.5070207@student.otago.ac.nz> <320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com> <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com> <320fb6e00905140402i5e7d9343p60162f697692b427@mail.gmail.com> Message-ID: <320fb6e00905140411med6a548g5f687a53881e24aa@mail.gmail.com> On Thu, May 14, 2009 at 12:02 PM, Peter wrote: > Oh right - I hadn't looked at David's example carefully enough earlier > to work out which concatenation he was doing (by row or by column). > It does make sense on re-reading. I'd rephrase this bit of the intro: It's a good idea, if possible, to make species-level phylogenetic inferences bases on multiple genes because a) demographic processes can lead gene-trees to diverge from species trees and b) journal editors now this. Most of the alignment files supported by Biopython allow you to write multiple alignments to the same file which makes this easy. However, the nexus file format (used by PAUP* and Mr Bayes) does not. In nexus files multiple alignments need to be represented as different 'character partitions' within a data matrix that contains one long sequence for each taxon. Bio.AlignIO will in general write out one or more alignments to a file. It does NOT do any concatenation by column, required to give the "supermatrix" which you want (which is why I get confused on the first reading). How about: It's a good idea, if possible, to make species-level phylogenetic inferences bases on multiple genes because (a) demographic processes can lead gene-trees to diverge from species trees and (b) journal editors know this. [add stuff from Cymon's comment here?] This is usually handled by creating a single "supermatrix" from separate alignments for each gene. i.e. You need a single alignment containing one row for each taxon where the rows are the concatenated pre-aligned sequences. In NEXUS files (used by PAUP* and Mr Bayes) multiple alignments can be explicitly represented as different 'character partitions' within a data matrix that contains one long sequence for each taxon. The Bio.Nexus module makes this relatively straight forward. Peter From cy at cymon.org Thu May 14 11:30:20 2009 From: cy at cymon.org (Cymon Cox) Date: Thu, 14 May 2009 12:30:20 +0100 Subject: [Biopython-dev] Cookbook entry, concatenating nexus files In-Reply-To: <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com> References: <4A0BA3D6.5070207@student.otago.ac.nz> <320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com> <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com> Message-ID: <7265d4f0905140430j47b0a661jd58dbe5749e4a1f7@mail.gmail.com> 2009/5/14 Cymon Cox > 2009/5/14 Peter > >> When I talked to Frank about Nexus files, he said they should only >> ever hold one alignment matrix, > > > Well, that was my understanding as well. But, it may be wrong. I just tried > it - p4 will read both matrices no problem, PAUP* (the de facto standard > here) will execute both matrices ok presumably leaving just the last as the > data in memory. > > I'll look into this further... > After a quick scan of the spec, there appears to be only one oblique reference to this issue: "Although the NEXUS standard does not impose constraints on the number of blocks, particular programs will. For example, MacClade 3.07 does not allow more than one TAXA block in a file." So I read that to mean, you can have any number of similarly named blocks in a NEXUS file, ie multiple DATA, TAXA, CHARACTERS, TREES etc, and its up to an individual application to decide how to deal with them. This seems to be in practice what happens: PAUP* will read multiple blocks of the same name but only the last block of a particular name will remain in memory after the file has been parsed. On the other hand, P4 will read multiple DATA blocks and store the different alignments as separate objects, and read multiple TREES blocks and store all the trees. C. From biopython at maubp.freeserve.co.uk Thu May 14 18:20:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 May 2009 19:20:47 +0100 Subject: [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL Message-ID: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> Hi, This is cross-posted between biopython-dev and biosql-l as it regards parsing the description (DE) lines in SwissProt files and how they are stored in BioSQL. This follows from an earlier discussion on biopython-dev Older SwissProt files just had one or two DE lines, and it made sense to treat this as a simple string mapped onto the description field in the bioentry table in BioSQL. This appears to what happens with BioPerl 1.5.x and in Biopython (although the details regarding white space differ). However, newer SwissProt files have many DE lines with additional structure. The example Michiel gave earlier on the biopython-dev list was: http://www.uniprot.org/uniprot/Q9XHP0.txt This has the following DE lines: DE RecName: Full=11S globulin seed storage protein 2; DE AltName: Full=11S globulin seed storage protein II; DE AltName: Full=Alpha-globulin; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 acidic chain; DE AltName: Full=11S globulin seed storage protein II acidic chain; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 basic chain; DE AltName: Full=11S globulin seed storage protein II basic chain; DE Flags: Precursor; I had to fight with perl to get my old copy of BioPerl working again (some week reference thing), but I managed, and then loaded this file into my test BioSQL database with: $ perl load_seqdatabase.pl --dbname biosql_test --dbuser root --dbpass XXX --namespace biosql_test --format swiss Q9XHP0.txt Then I looked at the resulting description in the main bioentry table: $ mysql --user=root -p biosql_test -e 'SELECT description FROM bioentry WHERE accession="Q9XHP0";' This is stored as one huge long string (without the newlines, I'm not sure if BioPerl strips those in parsing the file, or when loading it into the database): RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S globulin seed storage protein II; AltName: Full=Alpha-globulin; Contains: RecName: Full=11S globulin seed storage protein 2 acidic chain; AltName: Full=11S globulin seed storage protein II acidic chain; Contains: RecName: Full=11S globulin seed storage protein 2 basic chain; AltName: Full=11S globulin seed storage protein II basic chain; Flags: Precursor; For Biopython, I emptied the database then did: >>> from Bio import SeqIO >>> from BioSQL import BioSeqDatabase >>> server = BioSeqDatabase.open_database(driver="MySQLdb", user="root", passwd = "XXX", host = "localhost", db="biosql_test") >>> db = server["biosql-test"] #namespace >>> db.load(SeqIO.parse(open("Q9XHP0.txt"), "swiss")) 1 >>> server.commit() As before, I looked in the table with mysql. Again - this stores the full description from the DE line, although with the newlines embedded. So, Biopython is consistent with my old copy of BioPerl (1.5.x) if we ignore the white space. However, how does this look in BioPerl 1.6? If this is the same, are there any plans to change this? For Biopython we have discussed recording most of the DE information under the annotations instead (keyed off RecName, AltName, Contains, Flags), but I would like to be consistent with BioPerl+BioSQL. Thanks Peter From winda002 at student.otago.ac.nz Thu May 14 22:39:34 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Fri, 15 May 2009 10:39:34 +1200 Subject: [Biopython-dev] Cookbook entry, concatenating nexus files In-Reply-To: <320fb6e00905140411med6a548g5f687a53881e24aa@mail.gmail.com> References: <4A0BA3D6.5070207@student.otago.ac.nz> <320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com> <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com> <320fb6e00905140402i5e7d9343p60162f697692b427@mail.gmail.com> <320fb6e00905140411med6a548g5f687a53881e24aa@mail.gmail.com> Message-ID: <4A0C9DA6.9060403@student.otago.ac.nz> Peter wrote: > On Thu, May 14, 2009 at 12:02 PM, Peter wrote: > >> Oh right - I hadn't looked at David's example carefully enough earlier >> to work out which concatenation he was doing (by row or by column). >> It does make sense on re-reading. >> Well, just about ;) > > I'd rephrase this bit of the intro: > Yep, that's much better. Thanks Peter and Cymon for your feedback on this, I've updated the intro to include it and a couple of specific examples of how you'd use the character partitions. (Have you guys seen this: doi.wiley.com/10.1111/j.1755-0998.2008.02164.x , you could write a paper from one function in your nexus module!) cheers, david From biopython at maubp.freeserve.co.uk Fri May 15 09:05:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 15 May 2009 10:05:59 +0100 Subject: [Biopython-dev] Cookbook entry, concatenating nexus files In-Reply-To: <4A0C9DA6.9060403@student.otago.ac.nz> References: <4A0BA3D6.5070207@student.otago.ac.nz> <320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com> <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com> <320fb6e00905140402i5e7d9343p60162f697692b427@mail.gmail.com> <320fb6e00905140411med6a548g5f687a53881e24aa@mail.gmail.com> <4A0C9DA6.9060403@student.otago.ac.nz> Message-ID: <320fb6e00905150205k31d95c84naac1fa7873461263@mail.gmail.com> On Thu, May 14, 2009 at 11:39 PM, David Winter wrote: >> >> I'd rephrase this bit of the intro: >> > > Yep, that's much better. Thanks Peter and Cymon for your feedback on this, > I've updated the intro to include it and a couple of specific examples of > how you'd use the character partitions. That does look much clearer now :) Could you include the three original alignments in the text? It would help to let the reader see what is going on (and could be used to reproduce the example). > (Have you guys seen ?this: ?doi.wiley.com/10.1111/j.1755-0998.2008.02164.x , > you could write a paper from one function in your nexus module!) >From the abstract that does sound pretty trivial, but I guess that tool would be useful for non-programmers - even if you could probably rewrite it as one short python script using Biopython (or indeed a Perl script using BioPerl etc). Peter From bugzilla-daemon at portal.open-bio.org Sat May 16 00:24:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 15 May 2009 20:24:29 -0400 Subject: [Biopython-dev] [Bug 2829] New: Biosequence.alphabet can be set to unknown after loading a nucleotide SeqRecord Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2829 Summary: Biosequence.alphabet can be set to unknown after loading a nucleotide SeqRecord Product: Biopython Version: 1.49 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: david.wyllie at ndm.ox.ac.uk Hi I have done the following 1 loaded a small nucleotide fasta file with SeqIO, setting the alphabet successfully 2 written it to a test database with BioSQL 3 reloaded it, at which point the reloaded object has a "SingleLetterAlphabet" alphabet and biosequence.alphabet is set to unknown. Is this expected? The overall object was to add some SeqFeatures to the loaded SeqRecord, but it doesn't seem to store correctly even without any manipulations. Below demonstrates the problem. The system is Ubuntu 9 x64/ Python 2.6/ Biopython 1.49. #!/usr/bin/env python from BioSQL import BioSeqDatabase from Bio.Alphabet import generic_nucleotide from Bio import SeqIO from Bio import Seq # define variables needed for testing username="myusername" password="mypassword" hostname="localhost" # we are going to try to load a nucleotide fasta file into a BioSQL database # need a test file, with inputfile the file name; #>test_sequence #ttgaccgatgaccccggttcaggcttcaccacagtgtggaacgcggtcgtctccgaactt inputfile="/home/dwyllie/test.faa" # we want to create a new BioSQL database, called test dbname="test" dbdescription="test of alphabet storage" # we also want to remove one if it exists, for the purposes of testing server = BioSeqDatabase.open_database(driver="MySQLdb", db="bioseqdb", user=username, passwd=password, host=hostname) # if the database doesn't exist, we get an error, so we trap for that try: server.remove_database(dbname) server.adaptor.commit() except KeyError: print "Attempt to remove ",dbname," failed; going on to create a new one" server = BioSeqDatabase.open_database(driver="MySQLdb", db="bioseqdb", user=username, passwd=password, host=hostname) db = server.new_database(dbname, description=dbdescription) server.adaptor.commit() # set up a list to hold the mycobacterial sequences selectedrecords = [] # Setup an empty list which we'll later write # ifh is the input file handle; ifh = open(inputfile, "rU") # set a counter recordsread=0 for record in SeqIO.parse(ifh, "fasta", generic_nucleotide): # increment counter recordsread=recordsread+1 # just so we can reload it easily, we'll assign an id to this record # however, the problem does not depend on this, # nor on the nature of the defline, as far as I can tell record.id="IDENTIFIER_"+str(recordsread) print "** Note the sequence type of the Seq ** " print record # note that to this point it does appear to work, and the alphabet is correct. selectedrecords.append(record) print inputfile, "total found ", recordsread ifh.close() # write it to the bioSQL database print "Writing sequences to database" db.load(selectedrecords) server.adaptor.commit() # subsequent attempts to write the re-loaded object fail because no alphabet is defined print "However, the alphabet hasn't been stored." loadedrecord=db.lookup(gi="IDENTIFIER_1") print "Displaying re-loaded record" print loadedrecord # this can be confirmed by running sqlcmd=""" select * from bioseqdb.biosequence, bioseqdb.bioentry, bioseqdb.biodatabase where biodatabase.biodatabase_id= bioentry.biodatabase_id and biosequence.bioentry_id=bioentry.bioentry_id and biodatabase.name="test" """ print "This can be confirmed by examining bioseqdb.biosequence.alphabet, which is set to unknown; ", sqlcmd -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat May 16 11:37:52 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 16 May 2009 07:37:52 -0400 Subject: [Biopython-dev] [Bug 2829] BioSQL does not record a generic nucleotide alphabet In-Reply-To: Message-ID: <200905161137.n4GBbqKe018688@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2829 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|normal |enhancement Summary|Biosequence.alphabet can be |BioSQL does not record a |set to unknown after loading|generic nucleotide alphabet |a nucleotide SeqRecord | ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-16 07:37 EST ------- Biopython has a relatively rich range of alphabets, including IUPAC ambiguous and unambiguous alphabets, plus ways to indicate gap characters and stop symbols. The BioSQL range is much simpler, so some information is inevitably lost. In BioSQL, all we store is a simple string, "dna", "rna", "protein" or "unknown" (although BioJava used uppercase, so that is effectively allowed too). See: http://www.biosql.org/wiki/Enhancement_Requests#Check_constraint_on_biosequence.alphabet This means if your sequence was using "IUPAC extended protein with a * stop codon", all we can record is "protein". i.e. On retrieval from a BioSQL database, the alphabet is simply a generic protein. Likewise "ambiguous IUAC DNA with minus as the gap character" just becomes generic DNA. Note that as far as I know, currently none of the Bio* languages attempt to record "nucleotide" (i.e. "dna" or "rna"). This is something we should discuss on the BioSQL mailing list as a possible enhancement. So in answer to your question "Is this expected?", yes, a generic nucleotide alphabet isn't "dna", "rna" or "protein" so is currently recorded in the BioSQL database as "unknown". This gets turned into the SingleLetterAlphabet on retrieval. Changing title to "BioSQL does not record a generic nucleotide alphabet" and marking this as an enhancement. Peter P.S. Are you just testing here, or do you really not know if your sequence is DNA or RNA? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat May 16 11:54:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 16 May 2009 07:54:11 -0400 Subject: [Biopython-dev] [Bug 2829] BioSQL does not record a generic nucleotide alphabet In-Reply-To: Message-ID: <200905161154.n4GBsBWZ019474@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2829 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-16 07:54 EST ------- See: http://lists.open-bio.org/pipermail/biosql-l/2009-May/001515.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bartek at rezolwenta.eu.org Sat May 16 17:39:18 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Sat, 16 May 2009 19:39:18 +0200 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <320fb6e00905121157i2068b81fw5a8b21e65c33abe2@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com> <320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com> <320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com> <8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com> <320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com> <8b34ec180905121123w6fe553ahd416b9e1d317dd11@mail.gmail.com> <320fb6e00905121157i2068b81fw5a8b21e65c33abe2@mail.gmail.com> Message-ID: <8b34ec180905161039u529a7093p978aa8d61970ca51@mail.gmail.com> On Tue, May 12, 2009 at 8:57 PM, Peter wrote: > > I'm not happy with the current github repository due to the history > tag issue - but we know we can fix that now. ?Are you going to try > removing the old tags and re-doing them on github? I've finally found some time for it and fixed the tags in the main repository. I was able to run the update and it ran ok, I w2as also able to clone the repo from the official branch and see that they are OK in gitx. If anyone has problems with the tags, please let me know. > > Does anyone know how the git provided "ViewCVS" equivalent shows tags > in a file's history? If you are talking about gitweb, you can see it (for example: Makefile for linux 2.6.17) here: http://git.kernel.org/?p=linux/kernel/git/chrisw/linux-2.6.17.y.git;a=history;f=Makefile;h=79072d86297e78406791f0fc5764c35eb04fd07d;hb=78ace17e51d4968ed2355e8f708d233d1cc37f6d I've also installed gitweb on a copy of biopython repo on my server (not a permanent URL, not updated from trunk) http://83.243.39.60/cgi-bin/gitweb.cgi?p=biopython.git;a=tree;hb=HEAD It shows the tags, but (as usually with git), the tags are only shown for the files which were affected by the particular commit marked with the tag. So this behavior is consistent with kernel.org and github. cheers Bartek From biopython at maubp.freeserve.co.uk Sat May 16 20:35:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 16 May 2009 21:35:36 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <8b34ec180905161039u529a7093p978aa8d61970ca51@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com> <320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com> <320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com> <8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com> <320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com> <8b34ec180905121123w6fe553ahd416b9e1d317dd11@mail.gmail.com> <320fb6e00905121157i2068b81fw5a8b21e65c33abe2@mail.gmail.com> <8b34ec180905161039u529a7093p978aa8d61970ca51@mail.gmail.com> Message-ID: <320fb6e00905161335i28be05fay848dc18f86e728cf@mail.gmail.com> On 5/16/09, Bartek Wilczynski wrote: > On Tue, May 12, 2009 at 8:57 PM, Peter wrote: > > > > I'm not happy with the current github repository due to the history > > tag issue - but we know we can fix that now. Are you going to try > > removing the old tags and re-doing them on github? > > I've finally found some time for it and fixed the tags in the main repository. Great :) > I was able to run the update and it ran ok, I was also able to clone the repo > from the official branch and see that they are OK in gitx. If anyone > has problems with the tags, please let me know. I'll check with my Mac on Monday. > > Does anyone know how the git provided "ViewCVS" equivalent shows > > tags in a file's history? > > If you are talking about gitweb, you can see it (for example: Makefile > for linux 2.6.17) here: > > http://git.kernel.org/?p=linux/kernel/git/chrisw/linux-2.6.17.y.git;a=history;f=Makefile;h=79072d86297e78406791f0fc5764c35eb04fd07d;hb=78ace17e51d4968ed2355e8f708d233d1cc37f6d > > I've also installed gitweb on a copy of biopython repo on my server > (not a permanent URL, not updated from trunk) > http://83.243.39.60/cgi-bin/gitweb.cgi?p=biopython.git;a=tree;hb=HEAD > > It shows the tags, but (as usually with git), the tags are only shown > for the files which were affected by the particular commit marked with > the tag. So this behavior is consistent with kernel.org and github. Thanks for those examples. I see what you mean, looking at Bio/Blast/NCBIXML.py in gitweb for example, no tags show up at all. On the other hand, for the NEWS file, some tags show up. Basically for what I want to use the tags for (identifying changes to a single file between two releases), gitweb doesn't work. Nor does github's history. This is a shame. I think the reason CVS (or SVN) seem to work better in this regard is like python they care about individual files, while git works in terms of changes (which may affect multiple files). I'll see how I get on with the command line or graphical git history viewers and get back to you... Cheers, Peter From hlapp at gmx.net Sat May 16 22:34:57 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 18:34:57 -0400 Subject: [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> Message-ID: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> Don't you love SwissProt (or UniProt as we must call it now I suppose). They (understandably) try to squeeze ever more annotation into the existing tags, rather than adding new tags. So, of the following structure: DE RecName: Full=11S globulin seed storage protein 2; DE AltName: Full=11S globulin seed storage protein II; DE AltName: Full=Alpha-globulin; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 acidic chain; DE AltName: Full=11S globulin seed storage protein II acidic chain; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 basic chain; DE AltName: Full=11S globulin seed storage protein II basic chain; DE Flags: Precursor; really only the first line, with the 'RecName: Full=' removed, is the description line as we know it. The rest, I would say, is annotation, such as two alternative names, amino acid chains contained in the full record (shouldn't this be feature annotation, really? and indeed it is - why it needs to be repeated here is beyond me) and their names as well as alternative names, and the fact that the sequence is a precursor form. Leaving all this in one string has the advantage that we can round- trip it (and there is probably hardly any other way to accomplish that), but clearly in terms of semantics this isn't the sequence description as we know it anymore. Does anyone else think too that completely changing the semantics of sequence annotation fields is a bad idea? My inclination from a BioPerl perspective is to extract the part following 'RecName: Full=' as the description, and attach the rest as annotation. We could in fact use the TagTree class for this. I'm cross- posting to BioPerl too to gather what other BioPerl'ers think about this. -hilmar On May 14, 2009, at 2:20 PM, Peter wrote: > Hi, > > This is cross-posted between biopython-dev and biosql-l as it regards > parsing the description (DE) lines in SwissProt files and how they are > stored in BioSQL. This follows from an earlier discussion on > biopython-dev > > Older SwissProt files just had one or two DE lines, and it made sense > to treat this as a simple string mapped onto the description field in > the bioentry table in BioSQL. This appears to what happens with > BioPerl 1.5.x and in Biopython (although the details regarding white > space differ). However, newer SwissProt files have many DE lines with > additional structure. The example Michiel gave earlier on the > biopython-dev list was: > > http://www.uniprot.org/uniprot/Q9XHP0.txt > > This has the following DE lines: > > DE RecName: Full=11S globulin seed storage protein 2; > DE AltName: Full=11S globulin seed storage protein II; > DE AltName: Full=Alpha-globulin; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 acidic chain; > DE AltName: Full=11S globulin seed storage protein II acidic > chain; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 basic chain; > DE AltName: Full=11S globulin seed storage protein II basic chain; > DE Flags: Precursor; > > I had to fight with perl to get my old copy of BioPerl working again > (some week reference thing), but I managed, and then loaded this file > into my test BioSQL database with: > > $ perl load_seqdatabase.pl --dbname biosql_test --dbuser root --dbpass > XXX --namespace biosql_test --format swiss Q9XHP0.txt > > Then I looked at the resulting description in the main bioentry table: > > $ mysql --user=root -p biosql_test -e 'SELECT description FROM > bioentry WHERE accession="Q9XHP0";' > > This is stored as one huge long string (without the newlines, I'm not > sure if BioPerl strips those in parsing the file, or when loading it > into the database): > > RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S > globulin seed storage protein II; AltName: Full=Alpha-globulin; > Contains: RecName: Full=11S globulin seed storage protein 2 acidic > chain; AltName: Full=11S globulin seed storage protein II acidic > chain; Contains: RecName: Full=11S globulin seed storage protein 2 > basic chain; AltName: Full=11S globulin seed storage protein II basic > chain; Flags: Precursor; > > For Biopython, I emptied the database then did: > >>>> from Bio import SeqIO >>>> from BioSQL import BioSeqDatabase >>>> server = BioSeqDatabase.open_database(driver="MySQLdb", >>>> user="root", passwd = "XXX", host = "localhost", db="biosql_test") >>>> db = server["biosql-test"] #namespace >>>> db.load(SeqIO.parse(open("Q9XHP0.txt"), "swiss")) > 1 >>>> server.commit() > > As before, I looked in the table with mysql. Again - this stores the > full description from the DE line, although with the newlines > embedded. So, Biopython is consistent with my old copy of BioPerl > (1.5.x) if we ignore the white space. > > However, how does this look in BioPerl 1.6? If this is the same, are > there any plans to change this? For Biopython we have discussed > recording most of the DE information under the annotations instead > (keyed off RecName, AltName, Contains, Flags), but I would like to be > consistent with BioPerl+BioSQL. > > Thanks > > Peter > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Sat May 16 23:14:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 17 May 2009 00:14:54 +0100 Subject: [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> Message-ID: <320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com> On 5/16/09, Hilmar Lapp wrote: > > Don't you love SwissProt (or UniProt as we must call it now I suppose). > They (understandably) try to squeeze ever more annotation into the existing > tags, rather than adding new tags. > > So, of the following structure: > > DE RecName: Full=11S globulin seed storage protein 2; > DE AltName: Full=11S globulin seed storage protein II; > DE AltName: Full=Alpha-globulin; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 acidic chain; > DE AltName: Full=11S globulin seed storage protein II acidic chain; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 basic chain; > DE AltName: Full=11S globulin seed storage protein II basic chain; > DE Flags: Precursor; > > really only the first line, with the 'RecName: Full=' removed, is the > description line as we know it. The rest, I would say, is annotation, such > as two alternative names, amino acid chains contained in the full record > (shouldn't this be feature annotation, really? and indeed it is - why it > needs to be repeated here is beyond me) and their names as well as > alternative names, and the fact that the sequence is a precursor form. > > Leaving all this in one string has the advantage that we can round-trip it > (and there is probably hardly any other way to accomplish that), but clearly > in terms of semantics this isn't the sequence description as we know it > anymore. > > Does anyone else think too that completely changing the semantics of > sequence annotation fields is a bad idea? +1 That's pretty much what I thought on seeing this the first time. > My inclination from a BioPerl perspective is to extract the part following > 'RecName: Full=' as the description, and attach the rest as annotation. We > could in fact use the TagTree class for this. I'm cross-posting to BioPerl > too to gather what other BioPerl'ers think about this. Am I right to infer that currently BioPerl 1.6.x, like BioPerl 1.5.x just treats the DE lines as only big long string? Could you translate your idea about the TagTree class into something concrete with BioSQL tables and fields for me? I'm not familiar with the TagTree (or Perl). Over on the Biopython list we'd talked about storing this annotation in a nested structured. However, in order to use the BioSQL annotations mechanisms, I think a simple flat structure is required :( Peter From biopython at maubp.freeserve.co.uk Sat May 16 23:28:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 17 May 2009 00:28:43 +0100 Subject: [Biopython-dev] [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> Message-ID: <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> On 5/17/09, Chris Fields wrote: > > On May 16, 2009, at 5:34 PM, Hilmar Lapp wrote: > > My inclination from a BioPerl perspective is to extract the part following > > 'RecName: Full=' as the description, and attach the rest as annotation. We > > could in fact use the TagTree class for this. I'm cross-posting to BioPerl > > too to gather what other BioPerl'ers think about this. > > > > -hilmar > > > > This is much like the GN issues we've run into before, and we *could* set > this up using TagTree or similar. In the latter case of gene name the data > is stored in a text tree as follows: > > gene_names: > gene_name: > Name: GC1QBP > Synonyms: HABP1 > Synonyms: SF2P32 > Synonyms: C1QBP > > That could be changed to an XML string: > > > > > GC1QBP > HABP1 > SF2P32 > C1QBP > > > > Thinking about this we should attempt to coalesce around a standard instead > of forcing the other Bio* to a specific format. How would you record this in BioSQL? As an XML string for an annotation value? Brad has suggested JSON might be useful for this kind of thing (see also per-letter-annotation discussion). Peter From hlapp at gmx.net Sat May 16 23:37:14 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 19:37:14 -0400 Subject: [Biopython-dev] [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> Message-ID: <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> On May 16, 2009, at 7:28 PM, Peter wrote: >> That could be changed to an XML string: >> >> >> >> >> GC1QBP >> HABP1 >> SF2P32 >> C1QBP >> >> >> >> Thinking about this we should attempt to coalesce around a standard >> instead >> of forcing the other Bio* to a specific format. > > How would you record this in BioSQL? As an XML string for an > annotation value? Yes. A TagTree object can be serialized to XML, and the XML can be stored as the annotation value in BioSQL. As the XML can be read back in, it allows full round-tripping. > Brad has suggested JSON might be useful for this kind of thing (see > also per-letter-annotation discussion). JSON could be another serialization format, but XML is equally or better supported in all languages except JavaScript. Furthermore, you could just send the XML to the browser and have an XSLT (either directly, or indirectly through JavaScript doing the transformation) do the rendering. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sat May 16 23:42:17 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 19:42:17 -0400 Subject: [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com> Message-ID: <8CD4EED1-A689-447F-8F6E-8D2204DD4E86@gmx.net> On May 16, 2009, at 7:14 PM, Peter wrote: > Am I right to infer that currently BioPerl 1.6.x, like BioPerl 1.5.x > just > treats the DE lines as only big long string? Yes. > Could you translate your idea about the TagTree class into something > concrete with BioSQL tables and fields for me? [...] Over on the > Biopython list we'd talked about storing this annotation in a nested > structured. That's more or less what TagTree is. > However, in order to use the BioSQL annotations mechanisms, I think > a simple flat structure is required :( Not necessarily. If you have a flat serialization (such as XML) the nested structure isn't needed. Of course that's not a fully normalized relational representation, but if you had one, how often would it be used, how efficient would those queries be (SQL is poor at nested or recursive data structures), and how much pain would it be to write the object-relational mappings? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Sun May 17 12:40:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 17 May 2009 13:40:47 +0100 Subject: [Biopython-dev] [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> Message-ID: <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com> On 5/17/09, Hilmar Lapp wrote: > > On May 16, 2009, at 7:28 PM, Peter wrote: > > > That could be changed to an XML string: > > > > > > > > > > > > > > > GC1QBP > > > HABP1 > > > SF2P32 > > > C1QBP > > > > > > > > > > > > Thinking about this we should attempt to coalesce around a standard > > > instead of forcing the other Bio* to a specific format. Absolutely - some common standard should be agreed. Would you envision doing this for other structured fields, inventing a new mini XML format each time? That seems open ended and likely to cause a lot of work keeping all the Bio* project synchronised. Here you have mapped RecName and AltName fields in the DE lines to Name and Synonyms (shouldn't that be Synonym singular?). I also don't get why you have used a gene_name entry inside a gene_names list. Would you hold the contains information and the flags information from the DE lines in separate XML entries? I would have gone for something much closer to the original DE line markup i.e. using the field names UniProt use, RecName and AltName, rather than mapping these to Name and Synonym. > > How would you record this in BioSQL? As an XML string for an annotation > > value? > > Yes. A TagTree object can be serialized to XML, and the XML can be stored > as the annotation value in BioSQL. As the XML can be read back in, it allows > full round-tripping. Assuming you stored all the DE markup, then yes, a round trip back to the SwissProt file could be possible. And, depending on the details of the XML structure used, it would be possible to represent this in a python structure too. > > Brad has suggested JSON might be useful for this kind of thing (see > > also per-letter-annotation discussion). > > JSON could be another serialization format, but XML is equally or better > supported in all languages except JavaScript. Furthermore, you could just > send the XML to the browser and have an XSLT (either directly, or indirectly > through JavaScript doing the transformation) do the rendering. I have no strong preference for either XML or JSON (but would rather avoid them if they are not really needed). For other types of annotation there may be a clearer advantage for one over the other, e.g. per letter annotation like the secondary structure of a protein sequence, or the quality scores of a nucleotide contig. On 5/17/09, Hilmar Lapp wrote: > Not necessarily. If you have a flat serialization (such as XML) the nested > structure isn't needed. Of course that's not a fully normalized relational > representation, but if you had one, how often would it be used, how > efficient would those queries be (SQL is poor at nested or recursive data > structures), and how much pain would it be to write the object-relational > mappings? In this example, searching the database using one of the SwissProt AltNames (synonyms), or filtering on the Flags sounds like a reasonable request - but this would be very difficult if the data is stored inside XML strings. Of course, because the RecName and AltName entries are top level, we could just record them as normal - simple strings in the annotations table. This seems much nicer. Likewise the "Flags: Precursor;" line. i.e. listing the tag/value pairs which could be used in the bioentry_qualifier_value table: AltName = "Full=11S globulin seed storage protein II" AltName = "Full=Alpha-globulin" Flags = "Precursor" (the RecName field, "Full=11S globulin seed storage protein 2", could be used for the bioentry.description instead) The above are all pretty easy. We only need to consider nesting (or something like XML or JSON) for some of the DE information, in the example discussed the Contains lines. Even this could be even be done by storing each contains entry as a single long string (holding both the name and synonyms) directly from the DE line itself, something like this: Contains = "RecName: Full=11S globulin seed storage protein 2 acidic chain;\nAltName: Full=11S globulin seed storage protein II acidic chain;" Contains = "RecName: Full=11S globulin seed storage protein 2 basic chain;\nAltName: Full=11S globulin seed storage protein II basic chain;" Peter From hlapp at gmx.net Sun May 17 15:21:59 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 17 May 2009 11:21:59 -0400 Subject: [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com> Message-ID: On May 17, 2009, at 8:40 AM, Peter wrote: > On 5/17/09, Hilmar Lapp wrote: >> >> On May 16, 2009, at 7:28 PM, Peter wrote: >>>> That could be changed to an XML string: >>>> >>>> >>>> >>>> >>>> GC1QBP >>>> HABP1 >>>> SF2P32 >>>> C1QBP >>>> >>>> >>>> >>>> Thinking about this we should attempt to coalesce around a standard >>>> instead of forcing the other Bio* to a specific format. > > [...] Here you have mapped RecName and AltName fields in the DE > lines to > Name and Synonyms (shouldn't that be Synonym singular?). The example is for the GN lines in SwissProt, not the DE lines. > [...] > On 5/17/09, Hilmar Lapp wrote: >> Not necessarily. If you have a flat serialization (such as XML) the >> nested >> structure isn't needed. Of course that's not a fully normalized >> relational >> representation, but if you had one, how often would it be used, how >> efficient would those queries be (SQL is poor at nested or >> recursive data >> structures), and how much pain would it be to write the object- >> relational >> mappings? > > In this example, searching the database using one of the SwissProt > AltNames (synonyms), or filtering on the Flags sounds like a > reasonable request - but this would be very difficult if the data is > stored inside XML strings. Actually no. Modern full-text indexers (inside or outside the database) can index XML text columns right away and very well. In fact, for the last project that I built a full-text search for (on top of a BioSQL database) I did that by writing custom XML documents to a separate table for each record I wanted indexed. Oracle's full text indexer did the rest. I also built a separate identifier/name/ accession index that pulled all the gene names, symbols, accession numbers, identifiers etc into a single table for indexing. What I mean is, a fully normalized relational representation, especially if nested, is often not the most efficient data structure for efficient searching and filtering. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bugzilla-daemon at portal.open-bio.org Sun May 17 22:53:13 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 17 May 2009 18:53:13 -0400 Subject: [Biopython-dev] [Bug 2829] BioSQL does not record a generic nucleotide alphabet In-Reply-To: Message-ID: <200905172253.n4HMrDIX006938@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2829 ------- Comment #3 from david.wyllie at ndm.ox.ac.uk 2009-05-17 18:53 EST ------- (In reply to comment #2) > See: > http://lists.open-bio.org/pipermail/biosql-l/2009-May/001515.html > Hi thank you very much for explaining. I'm not sure this is a bug, it's a design feature due to my not understanding the implications of generic_nucleotide. I know it's DNA, and if one uses generic_dna instead in the testcase, all is well. Alphabets are explained clearly in the documentation. Thank you again. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 18 10:08:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 18 May 2009 06:08:45 -0400 Subject: [Biopython-dev] [Bug 2829] BioSQL does not record a generic nucleotide alphabet In-Reply-To: Message-ID: <200905181008.n4IA8j0J015956@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2829 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |WONTFIX ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-18 06:08 EST ------- (In reply to comment #3) > Hi > > thank you very much for explaining. > > I'm not sure this is a bug, it's a design feature due to my > not understanding the implications of generic_nucleotide. As I argued on the BioSQL mailing list, generic nucleotide sequences are a valid case not catered to at the moment. However, they are a corner case, and have no equivalent in BioPerl (which is happy to guess at DNA or RNA). Marking this bug as WON'T FIX. > I know it's DNA, and if one uses generic_dna instead in > the testcase, all is well. Good - if you know you have DNA, then specifying a DNA alphabet would be my recommended course of action. > Alphabets are explained clearly in the documentation. > Thank you again. Let us know if you find anything that needs further clarification in the documentation. Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon May 18 13:38:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 18 May 2009 14:38:03 +0100 Subject: [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com> Message-ID: <320fb6e00905180638q29de63c4if0627eff416c4481@mail.gmail.com> On Sun, May 17, 2009 at 4:21 PM, Hilmar Lapp wrote: > > On May 17, 2009, at 8:40 AM, Peter wrote: >> >> [...] Here you have mapped RecName and AltName fields in the DE lines to >> Name and Synonyms (shouldn't that be Synonym singular?). > > The example is for the GN lines in SwissProt, not the DE lines. Ah, that probably explains some of my confusion. >> In this example, searching the database using one of the SwissProt >> AltNames (synonyms), or filtering on the Flags sounds like a >> reasonable request - but this would be very difficult if the data is >> stored inside XML strings. > > Actually no. Modern full-text indexers (inside or outside the database) can > index XML text columns right away and very well. In fact, for the last > project that I built a full-text search for (on top of a BioSQL database) I > did that by writing custom XML documents to a separate table for each > record I wanted indexed. Oracle's full text indexer did the rest. I also built a > separate identifier/name/accession index that pulled all the gene names, > symbols, accession numbers, identifiers etc into a single table for > indexing. OK, when I said searching "would be very difficult if the data is stored inside XML strings", maybe it wasn't so difficult for you - but that still sounds complicated! Sticking with the GN lines and the synonym, if this was stored as a simple tag/value as usual in BioSQL, I would write my SQL statement to search the annotation table where the term id was that associated with a GN synonym, and the annotation value was "HABP1". Simple. Using the XML approach, are you suggesting you could do a full text search on the annotation value field, looking for any rows where the field contains "HABP1", where the term id matches the GN lines' XML string? This sounds simplistic and probably rather slow - presumably why you resorted to the more complicated indexing scheme described above? > What I mean is, a fully normalized relational representation, especially if > nested, is often not the most efficient data structure for efficient > searching and filtering. OK. But do we really need to worry about complex nested structures for the SwissProt annotation (or in general)? Peter From biopython at maubp.freeserve.co.uk Tue May 19 14:23:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 May 2009 15:23:58 +0100 Subject: [Biopython-dev] [Biopython] Parsing large blast files In-Reply-To: <320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com> References: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com> <290052.25369.qm@web62407.mail.re1.yahoo.com> <320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com> Message-ID: <320fb6e00905190723u2eca08e6o3f70bf37be79e4bf@mail.gmail.com> Last month on this thread we started talking about the BLAST command line wrappers: http://lists.open-bio.org/pipermail/biopython/2009-April/005134.html On Wed, Apr 29, 2009, Peter wrote: > On Wed, Apr 29, 2009, Michiel de Hoon wrote: >> >> How would users typically use Bio.Blast.Applications? > > In the next release, I would aim to have Bio.Blast.Applications > updated to cover blastall (fully), plus blastpgp and rpsblast > (currently not covered) and for the three helper functions > Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast to all use > Bio.Blast.Applications internally. That should be done now in CVS - it turned out to be a lot more tedious that I had expected, but I think we are OK. I would be very grateful to have a couple of people test this out. At the very least, just update your copy of Biopython and confirm any existing scripts using the Bio.Blast.NCBIStandalone blastall, blastpgp or rpsblast functions still work as expected. Note we still need to agree on the preferred name for each parameter (i.e. what do we use for the python properties) as discussed on this thread: http://lists.open-bio.org/pipermail/biopython-dev/2009-May/005976.html http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006039.html Peter From biopython at maubp.freeserve.co.uk Tue May 19 17:00:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 May 2009 18:00:41 +0100 Subject: [Biopython-dev] Repeated options in command line interfaces Message-ID: <320fb6e00905191000g473b9e68r12b8704652b1ad93@mail.gmail.com> Hello all, Yes - its another thread about command line wrappers! One of the Roche 454 off instrument applications is runMapping, which in the most general situation allows you to map one or more SFF files onto one or more FASTA files, e.g. runMapping -o ~/test -ref example1.fasta example2.fasta -read data1.sff data2.sff Notice that "-ref" and "-read" are not repeated, so we could treat this via the current application wrapper system as follows: #These modules don't exist (yet): from Bio.Sequencing.Applications import RunMappingCommandline cline = RunMappingCommandline() cline.ref = "example1.fasta example2.fasta" cline.read = "data1.sff data2.sff" This isn't very elegant, but would work. Over on Bug 2815, Cymon and I have briefly discussed the --seed parameter in Mafft, which is used to specify one or more alignment files, e.g. mafft ... --seed alignment1 --seed alignment2 --seed alignment3 ... Notice that "--seed" is repeated before each value. I was thinking it would be nice to treat this as a single property (seed) which takes a list of strings as its value: from Bio.Align.Applications import MafftCommandline cline = MafftCommandline() cline.seed = ["alignment1", "alignment2", ...] or, equivalently: from Bio.Align.Applications import MafftCommandline cline = MafftCommandline(seed=["alignment1", "alignment2", ...]) or, using the old set_parameter approach, from Bio.Align.Applications import MafftCommandline cline = MafftCommandline() cline.set_parameter("seed", ["alignment1", "alignment2", ...]) and similarly for a Roche wrapper, e.g. #These modules don't exist (yet): from Bio.Sequencing.Applications import RunMappingCommandline cline = RunMappingCommandline() cline.ref = ["example1.fasta", "example2.fasta"] cline.read = ["data1.sff", "data2.sff"] Doing this nicely would require two _Option subclasses in Bio.Application, one for repeated options like "seed" in Mafft, and one for multiple valued options like "ref" and "read" in the Roche tools. Does this sound sensible? Does anyone have any more examples? Peter From bugzilla-daemon at portal.open-bio.org Wed May 20 16:31:24 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 20 May 2009 12:31:24 -0400 Subject: [Biopython-dev] [Bug 2833] New: Features insertion on previous bioentry_id Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2833 Summary: Features insertion on previous bioentry_id Product: Biopython Version: 1.50 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P1 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: andrea at biodec.com Biopython 1.50 (also 1.50b it's the same code) python2.4 or python2.5 postgresql 8.3 BioSQL Schema 1.0.1 Problem: imagine to have 3 seqrecord (s1,s2,s3), imagine that - s1 == s3 (but from different sources....) in other words s1 and s3 are not the same object - s2 != s1 and s2 != s3 imagine to load a Biosql db in this order: - db.load([s1]) - db.load([s2]) - db.load([s3]) At the end of the loading i will have only 2 bioentry ID BUT the s3.features will be inserted on s2 seqrecord. --------------------------------------------------------------------------------------- More in details (documented behaviour): print s1 ID: ENST00000334859 Name: ENST00000334859 Description: Leucine-rich repeat and calponin homology domain-containing protein 3 precursor. [Source:Uniprot/SWISSPROT;Acc:Q96II8] Number of features: 24 /source= /taxonomy=[] /keywords=[''] /accessions=['ENST00000334859'] /data_file_division=UNK /date=01-JAN-1980 /organism=. . /gi=ENST00000334859 Seq('ATGGCGGCCGCGGGCTTGGTCGCTGTGGCAGCGGCTGCCGAGTACTCTGGCACG...TGA', IUPACAmbiguousDNA()) print s2 ID: ENST00000391466 Name: ENST00000391466 Description: CDNA FLJ44976 fis, clone BRAWH3001833. [Source:Uniprot/SPTREMBL;Acc:Q6ZQT1] Number of features: 8 /source= /taxonomy=[] /keywords=[''] /accessions=['ENST00000391466'] /data_file_division=UNK /date=01-JAN-1980 /organism=. . /gi=ENST00000391466 Seq('ATGACAGTGATTCTCTTTACCCAACTCACCGCACCCATGGCAGTGATTCTCTTT...TAG', IUPACAmbiguousDNA()) print s3 ID: ENST00000334859 Name: ENST00000334859 Description: Leucine-rich repeat and calponin homology domain-containing protein 3 precursor. [Source:Uniprot/SWISSPROT;Acc:Q96II8] Number of features: 24 /source= /taxonomy=[] /keywords=[''] /accessions=['ENST00000334859'] /data_file_division=UNK /date=01-JAN-1980 /organism=. . /gi=ENST00000334859 Seq('ATGGCGGCCGCGGGCTTGGTCGCTGTGGCAGCGGCTGCCGAGTACTCTGGCACG...TGA', IUPACAmbiguousDNA()) As you can see: - s1 and S3 are identical and s2 differs from them. - s1 and s3 has 24 features - s2 has 8 features STEP 1 (biosql insertion of s1) - db.load([s1]) - looking into the db: select bioentry_id, name, accession, identifier from bioentry; bioentry_id | name | accession | identifier | -------------+-----------------+-----------------+-----------------+ 39 | ENST00000334859 | ENST00000334859 | ENST00000334859 | (1 row) select * from seqfeature; select * from seqfeature; seqfeature_id | bioentry_id | type_term_id | source_term_id | display_name | rank ---------------+-------------+--------------+----------------+--------------+------ 291 | 39 | 27 | 15 | | 1 292 | 39 | 27 | 15 | | 2 293 | 39 | 27 | 15 | | 3 294 | 39 | 27 | 15 | | 4 295 | 39 | 27 | 15 | | 5 296 | 39 | 14 | 15 | | 6 297 | 39 | 14 | 15 | | 7 298 | 39 | 30 | 15 | | 8 299 | 39 | 30 | 15 | | 9 300 | 39 | 30 | 15 | | 10 301 | 39 | 30 | 15 | | 11 302 | 39 | 30 | 15 | | 12 303 | 39 | 30 | 15 | | 13 304 | 39 | 30 | 15 | | 14 305 | 39 | 30 | 15 | | 15 306 | 39 | 30 | 15 | | 16 307 | 39 | 30 | 15 | | 17 308 | 39 | 25 | 15 | | 18 309 | 39 | 25 | 15 | | 19 310 | 39 | 25 | 15 | | 20 311 | 39 | 25 | 15 | | 21 312 | 39 | 25 | 15 | | 22 313 | 39 | 26 | 15 | | 23 314 | 39 | 26 | 15 | | 24 (24 rows) STEP 2 (biosql insertion of s2) - db.load([s2]) - looking into the db: select bioentry_id, name, accession, identifier from bioentry; bioentry_id | name | accession | identifier -------------+-----------------+-----------------+----------------- 39 | ENST00000334859 | ENST00000334859 | ENST00000334859 40 | ENST00000391466 | ENST00000391466 | ENST00000391466 (2 rows) select * from seqfeature; seqfeature_id | bioentry_id | type_term_id | source_term_id | display_name | rank ---------------+-------------+--------------+----------------+--------------+------ 291 | 39 | 27 | 15 | | 1 292 | 39 | 27 | 15 | | 2 293 | 39 | 27 | 15 | | 3 294 | 39 | 27 | 15 | | 4 295 | 39 | 27 | 15 | | 5 296 | 39 | 14 | 15 | | 6 297 | 39 | 14 | 15 | | 7 298 | 39 | 30 | 15 | | 8 299 | 39 | 30 | 15 | | 9 300 | 39 | 30 | 15 | | 10 301 | 39 | 30 | 15 | | 11 302 | 39 | 30 | 15 | | 12 303 | 39 | 30 | 15 | | 13 304 | 39 | 30 | 15 | | 14 305 | 39 | 30 | 15 | | 15 306 | 39 | 30 | 15 | | 16 307 | 39 | 30 | 15 | | 17 308 | 39 | 25 | 15 | | 18 309 | 39 | 25 | 15 | | 19 310 | 39 | 25 | 15 | | 20 311 | 39 | 25 | 15 | | 21 312 | 39 | 25 | 15 | | 22 313 | 39 | 26 | 15 | | 23 314 | 39 | 26 | 15 | | 24 315 | 40 | 28 | 15 | | 1 316 | 40 | 28 | 15 | | 2 317 | 40 | 28 | 15 | | 3 318 | 40 | 28 | 15 | | 4 319 | 40 | 28 | 15 | | 5 320 | 40 | 28 | 15 | | 6 321 | 40 | 28 | 15 | | 7 322 | 40 | 28 | 15 | | 8 (32 rows) STEP 3 (biosql insertion of s3) - db.load([s3]) - looking into the db: select bioentry_id, name, accession, identifier from bioentry; bioentry_id | name | accession | identifier -------------+-----------------+-----------------+----------------- 39 | ENST00000334859 | ENST00000334859 | ENST00000334859 40 | ENST00000391466 | ENST00000391466 | ENST00000391466 (2 rows) select * from seqfeature; seqfeature_id | bioentry_id | type_term_id | source_term_id | display_name | rank ---------------+-------------+--------------+----------------+--------------+------ 291 | 39 | 27 | 15 | | 1 292 | 39 | 27 | 15 | | 2 293 | 39 | 27 | 15 | | 3 294 | 39 | 27 | 15 | | 4 295 | 39 | 27 | 15 | | 5 296 | 39 | 14 | 15 | | 6 297 | 39 | 14 | 15 | | 7 298 | 39 | 30 | 15 | | 8 299 | 39 | 30 | 15 | | 9 300 | 39 | 30 | 15 | | 10 301 | 39 | 30 | 15 | | 11 302 | 39 | 30 | 15 | | 12 303 | 39 | 30 | 15 | | 13 304 | 39 | 30 | 15 | | 14 305 | 39 | 30 | 15 | | 15 306 | 39 | 30 | 15 | | 16 307 | 39 | 30 | 15 | | 17 308 | 39 | 25 | 15 | | 18 309 | 39 | 25 | 15 | | 19 310 | 39 | 25 | 15 | | 20 311 | 39 | 25 | 15 | | 21 312 | 39 | 25 | 15 | | 22 313 | 39 | 26 | 15 | | 23 314 | 39 | 26 | 15 | | 24 315 | 40 | 28 | 15 | | 1 316 | 40 | 28 | 15 | | 2 317 | 40 | 28 | 15 | | 3 318 | 40 | 28 | 15 | | 4 319 | 40 | 28 | 15 | | 5 320 | 40 | 28 | 15 | | 6 321 | 40 | 28 | 15 | | 7 322 | 40 | 28 | 15 | | 8 323 | 40 | 27 | 15 | | 1 324 | 40 | 27 | 15 | | 2 325 | 40 | 27 | 15 | | 3 326 | 40 | 27 | 15 | | 4 327 | 40 | 27 | 15 | | 5 328 | 40 | 14 | 15 | | 6 329 | 40 | 14 | 15 | | 7 330 | 40 | 30 | 15 | | 8 331 | 40 | 30 | 15 | | 9 332 | 40 | 30 | 15 | | 10 333 | 40 | 30 | 15 | | 11 334 | 40 | 30 | 15 | | 12 335 | 40 | 30 | 15 | | 13 336 | 40 | 30 | 15 | | 14 337 | 40 | 30 | 15 | | 15 338 | 40 | 30 | 15 | | 16 339 | 40 | 30 | 15 | | 17 340 | 40 | 25 | 15 | | 18 341 | 40 | 25 | 15 | | 19 342 | 40 | 25 | 15 | | 20 343 | 40 | 25 | 15 | | 21 344 | 40 | 25 | 15 | | 22 345 | 40 | 26 | 15 | | 23 346 | 40 | 26 | 15 | | 24 (56 rows) As you can easily see the 24 feature of s3 seqrecord has been added to the bioentry_id 40 (that was s2). ------------------------------------------------------------------------------------ The problem is not so easy to understand. I tried to have a look into the code of Loader.py and i found something: the code works in this way: 1) it tries to load the seqrecord using: load_seqrecord(self, record) this method as first thing tries to load the bioentry table with the method: _load_bioentry_table(self, record) this method at last thing tries to get the bioentry_id of the "just inserted" record with the db method: self.adaptor.last_id('bioentry') 2) then with the bioentry_id recovered from the first method it tries to fill the other tables...and also the seqfeature... 3) In biosql (the schema), if you try to insert a record into the bioentry table that has the same Identifier or Accession of an existing record it doesn't do anything.... and it tells you "INSERT 0 0" 4) So, if you try to insert the s3 record that has the same Accession and Identifier of the s1... the bioentry_id the load_seqrecord(self, record) method will return the bioentry_id of the s2 record (it will be the self.adaptor.last_id('bioentry') output) Maybe other information will be transferred to s2 (not only the features...). For example also "dbxrefs" could suffer of the same problem.... I think the solution depend on what we expect from the code: - if we expect a behaviour like "don't do anything with identical Accession/Identifier" it is better to check the last_id before and after insertion and return None if it is identical... than manage a "None" bioentry_id like a block in the other biosql insertions.... - if we expect a "Merge" behaviour it is better to retrive the bioentry_id of the object with the same Accession/Identifier and than verify if the 2 seqrecord has identical sequence and than merge features/annotations/dbxrefs.... etc. - other behaviours... other solutions... Andrea -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 20 20:25:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 20 May 2009 16:25:39 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905202025.n4KKPdYT020904@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-20 16:25 EST ------- (In reply to comment #0) > Biopython 1.50 (also 1.50b it's the same code) > python2.4 or python2.5 > postgresql 8.3 > BioSQL Schema 1.0.1 > > Problem: > imagine to have 3 seqrecord (s1,s2,s3), ... load a Biosql db in this order: > - db.load([s1]) > - db.load([s2]) > - db.load([s3]) > > At the end of the loading i will have only 2 bioentry ID > BUT the s3.features will be inserted on s2 seqrecord. BioSQL will allow you to have multiple versions of the same record but they must have different versions (e.g. s1.id="ENST00000334859.0" and s3.id="ENST00000334859.1" should work). The problem with your data is s1.id == s3.id, so I would expect them to get the same accession and version (taken as zero). Therefore s3 should *fail* to load. I can try and reproduce this using the information given, but it would help if you could attach the original sequence files to this bug. Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 20 21:07:08 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 20 May 2009 17:07:08 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905202107.n4KL78te024053@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-20 17:07 EST ------- (In reply to comment #0) > Biopython 1.50 (also 1.50b it's the same code) > python2.4 or python2.5 > postgresql 8.3 > BioSQL Schema 1.0.1 What version of psycopg are you using? i.e. The python library for talking to PostgreSQL. Have you tried running Biopython's BioSQL unit tests? You'll need to configure your settings in setup_BioSQL.py first. If that looks good could you try updating to the latest Biopython from CVS and retesting? I've added a basic check in test_BioSQL.py for duplicated entries (using a GenBank file) which works on my machine using MySQL. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 21 10:31:42 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 May 2009 06:31:42 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905211031.n4LAVgvW019852@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #3 from andrea at biodec.com 2009-05-21 06:31 EST ------- Created an attachment (id=1299) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1299&action=view) Pickled Seqrecord s1 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 21 10:32:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 May 2009 06:32:12 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905211032.n4LAWBXC019888@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #4 from andrea at biodec.com 2009-05-21 06:32 EST ------- Created an attachment (id=1300) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1300&action=view) Pickled Seqrecord s2 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 21 10:32:28 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 May 2009 06:32:28 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905211032.n4LAWSlA019903@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #5 from andrea at biodec.com 2009-05-21 06:32 EST ------- Created an attachment (id=1301) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1301&action=view) Pickled Seqrecord s3 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 21 10:34:46 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 May 2009 06:34:46 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905211034.n4LAYkhC020056@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #6 from andrea at biodec.com 2009-05-21 06:34 EST ------- Hi Peter, i did 4 tests: [python2.4,python2.5]*[psycopg,psycopg2] with - biopython from "this morning" cvs. - psycopg.__version__ '1.1.21' - psycopg2.__version__ '2.0.7 (dec mx dt ext pq3)' in any case i've the same results: Make sure all records are correctly loaded. ... ok Make sure can't import records twice. ... FAIL Indepth check that SeqFeatures are transmitted through the db. ... ok Load SeqRecord objects into a BioSQL database. ... ok Get a list of all items in the database. ... ok Test retrieval of items using various ids. ... ok Check can add DBSeq objects together. ... ok Check can turn a DBSeq object into a Seq or MutableSeq. ... ok Make sure Seqs from BioSQL implement the right interface. ... ok Check SeqFeatures of a sequence. ... ok Make sure SeqRecords from BioSQL implement the right interface. ... ok Check that slices of sequences are retrieved properly. ... ok ====================================================================== FAIL: Make sure can't import records twice. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 374, in test_reload self.assert_("duplicate" in str(err).lower()) AssertionError ---------------------------------------------------------------------- Ran 12 tests in 23.815s FAILED (failures=1) i've 1 failure in "Make sure can't import records twice. ..." it seems interesting for the problem... Then i tried with python2.4, python2.5, psycopg, psycopg2 i attached the pickles of the 3 seqrecords so you can try by yourself... ########################################################### from BioSQL import BioSeqDatabase import cPickle server = BioSeqDatabase.open_database(driver = "psycopg2", user = 'postgres', passwd = "hidden", host = "dbservertest", db = 'test_biosql' ) ## LOAD SeqRecords from pickle s1=cPickle.load(open('s1.cpk')) s2=cPickle.load(open('s2.cpk')) s3=cPickle.load(open('s3.cpk')) ## LOAD INTO DB db=server.new_database('test') server.commit() db.load([s1]) db.load([s2]) db.load([s3]) db.adaptor.commit() ########################################################### I had always the same problem. So i prepare a buildout environment with the last Biopython and with a new psycopg2 library (for psycopg i had the latest). psycopg2.__version__ '2.0.11 (dt dec ext pq3)' The result from the test was the same The result from the upload (based on pickled seqrecords) was the same Thanks Andrea -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 21 10:39:18 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 May 2009 06:39:18 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905211039.n4LAdIit020365@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-21 06:39 EST ------- (In reply to comment #6) > Hi Peter, > i did 4 tests: [python2.4,python2.5]*[psycopg,psycopg2] > with > - biopython from "this morning" cvs. > - psycopg.__version__ '1.1.21' > - psycopg2.__version__ '2.0.7 (dec mx dt ext pq3)' > > in any case i've the same results: > > Make sure all records are correctly loaded. ... ok > Make sure can't import records twice. ... FAIL > ... > ====================================================================== > FAIL: Make sure can't import records twice. > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_BioSQL.py", line 374, in test_reload > self.assert_("duplicate" in str(err).lower()) > AssertionError OK - the unit test is doing what I expected, and the duplicate insertion is failing. Its just the error message is different to what I expected, which should be trivial to fix. This means inserting the same GenBank record twice fails (which is good). However, the unit test doesn't reproduce your original issue. Hopefully your pickled SeqRecord objects will help there... Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 21 11:36:34 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 May 2009 07:36:34 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905211136.n4LBaYO8024199@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-21 07:36 EST ------- (In reply to comment #7) > However, the unit test doesn't reproduce your original issue. Hopefully > your pickled SeqRecord objects will help there... Based on your example script in comment 6 with the pickled SeqRecord objects, but using MySQL, I get an IntegrityError as expected: Traceback (most recent call last): ... IntegrityError: (1062, "Duplicate entry 'ENST00000334859-2-0' for key 2") I get the same error with simplified records lacking any annotation or features (I just saved your three records to a FASTA file and reloaded them). So what ever is going wrong seems to be PostgreSQL specific (or at least, does not affect MySQL). I have updated test_BioSQL.py in CVS to cover more variations (revision 1.33), and hopefully the error message check should work on PostgreSQL as well. It would be very helpful if you could test that. Part of the new tests is a slight variation on your original example. Could you try this: db.load([s1]) server.commit() db.load([s2]) server.commit() db.load([s3]) server.commit() This might tell us if the issue is with PostgreSQL not checking the key constraints until the commit. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Thu May 21 12:29:27 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 21 May 2009 08:29:27 -0400 Subject: [Biopython-dev] Repeated options in command line interfaces In-Reply-To: <320fb6e00905191000g473b9e68r12b8704652b1ad93@mail.gmail.com> References: <320fb6e00905191000g473b9e68r12b8704652b1ad93@mail.gmail.com> Message-ID: <20090521122927.GM84112@sobchak.mgh.harvard.edu> Hi Peter; > Yes - its another thread about command line wrappers! It seems like y'all are unearthing every single crazy command line option choice out there. Great to have this fleshed out. > One of the Roche 454 off instrument applications is runMapping, > which in the most general situation allows you to map one or > more SFF files onto one or more FASTA files, e.g. > > runMapping -o ~/test -ref example1.fasta example2.fasta -read > data1.sff data2.sff [...] > the --seed parameter in Mafft, which is used to specify one or more > alignment files, e.g. > > mafft ... --seed alignment1 --seed alignment2 --seed alignment3 ... > > Notice that "--seed" is repeated before each value. > > I was thinking it would be nice to treat this as a single > property (seed) which takes a list of strings as its value: > > from Bio.Align.Applications import MafftCommandline > cline = MafftCommandline() > cline.seed = ["alignment1", "alignment2", ...] [...] > #These modules don't exist (yet): > from Bio.Sequencing.Applications import RunMappingCommandline > cline = RunMappingCommandline() > cline.ref = ["example1.fasta", "example2.fasta"] > cline.read = ["data1.sff", "data2.sff"] This makes good sense to me. It hides the actual nastiness a bit and makes it clear in the code what is happening -- assigning multiple parameters to a single option. It sounds like a great way to handle it. Brad From bugzilla-daemon at portal.open-bio.org Thu May 21 15:04:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 May 2009 11:04:40 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905211504.n4LF4ej0015238@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #9 from andrea at biodec.com 2009-05-21 11:04 EST ------- (In reply to comment #8) > (In reply to comment #7) > > However, the unit test doesn't reproduce your original issue. Hopefully > > your pickled SeqRecord objects will help there... > > Based on your example script in comment 6 with the pickled SeqRecord objects, > but using MySQL, I get an IntegrityError as expected: > > Traceback (most recent call last): > ... > IntegrityError: (1062, "Duplicate entry 'ENST00000334859-2-0' for key 2") > > I get the same error with simplified records lacking any annotation or features > (I just saved your three records to a FASTA file and reloaded them). So what > ever is going wrong seems to be PostgreSQL specific (or at least, does not > affect MySQL). According to me it's postgres specific the fact that i don't have any error at all. If biopython expects from postgres an error in this situation there are some problem in postgres (or in mine). > > I have updated test_BioSQL.py in CVS to cover more variations (revision 1.33), > and hopefully the error message check should work on PostgreSQL as well. It > would be very helpful if you could test that. This is te results of the test: it's the same on python2.4 and python2.5: Make sure can't import records with same ID (in one go). ... FAIL Make sure can't import records with same ID (in steps). ... FAIL Make sure can't import records with same ID (in steps with commit). ... FAIL Make sure can't import a single record twice (in one go). ... FAIL Make sure can't import a single record twice (in steps). ... FAIL Make sure can't import a single record twice (in steps with commit). ... FAIL Make sure all records are correctly loaded. ... ok Make sure can't reimport existing records. ... FAIL Indepth check that SeqFeatures are transmitted through the db. ... ok Load SeqRecord objects into a BioSQL database. ... ok Get a list of all items in the database. ... ok Test retrieval of items using various ids. ... ok Check can add DBSeq objects together. ... ok Check can turn a DBSeq object into a Seq or MutableSeq. ... ok Make sure Seqs from BioSQL implement the right interface. ... ok Check SeqFeatures of a sequence. ... ok Make sure SeqRecords from BioSQL implement the right interface. ... ok Check that slices of sequences are retrieved properly. ... ok ====================================================================== FAIL: Make sure can't import records with same ID (in one go). ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 397, in test_duplicate_id_load err.__class__.__name__ + "\n" + str(err)) AssertionError: Exception Should have failed! ====================================================================== FAIL: Make sure can't import records with same ID (in steps). ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 410, in test_duplicate_id_load2 err.__class__.__name__ + "\n" + str(err)) AssertionError: Exception Should have failed! ====================================================================== FAIL: Make sure can't import records with same ID (in steps with commit). ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 424, in test_duplicate_id_load3 err.__class__.__name__ + "\n" + str(err)) AssertionError: Exception Should have failed! ====================================================================== FAIL: Make sure can't import a single record twice (in one go). ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 361, in test_duplicate_load err.__class__.__name__ + "\n" + str(err)) AssertionError: Exception Should have failed! ====================================================================== FAIL: Make sure can't import a single record twice (in steps). ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 373, in test_duplicate_load2 err.__class__.__name__ + "\n" + str(err)) AssertionError: Exception Should have failed! ====================================================================== FAIL: Make sure can't import a single record twice (in steps with commit). ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 386, in test_duplicate_load3 err.__class__.__name__ + "\n" + str(err)) AssertionError: Exception Should have failed! ====================================================================== FAIL: Make sure can't reimport existing records. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 463, in test_reload err.__class__.__name__ + "\n" + str(err)) AssertionError: OperationalError currval of sequence "bioentry_pk_seq" is not yet defined in this session ---------------------------------------------------------------------- Ran 18 tests in 26.938s FAILED (failures=7) > > Part of the new tests is a slight variation on your original example. Could > you try this: > > db.load([s1]) > server.commit() > db.load([s2]) > server.commit() > db.load([s3]) > server.commit() > >>> ## LOAD INTO DB >>> db.load([s1]) 1 >>> server.commit() >>> db.load([s2]) 1 >>> server.commit() >>> db.load([s3]) 1 >>> server.commit() >>> i don't have any errors!!! > This might tell us if the issue is with PostgreSQL not checking the key > constraints until the commit. > it seems that. If i try to do the insertion via SQL i don't have any errors. I just have a message of the type: INSERT 0 0 due to the fact the postgres doesn't insert anything. Andrea -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 21 17:05:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 May 2009 13:05:12 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905211705.n4LH5Ca6028981@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #10 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-21 13:05 EST ------- Well, some progress :) (In reply to comment #9) > This is te results of the test: it's the same on python2.4 and python2.5: > Make sure can't import records with same ID (in one go). ... FAIL > Make sure can't import records with same ID (in steps). ... FAIL > Make sure can't import records with same ID (in steps with commit). ... FAIL > Make sure can't import a single record twice (in one go). ... FAIL > Make sure can't import a single record twice (in steps). ... FAIL > Make sure can't import a single record twice (in steps with commit). ... FAIL > Make sure all records are correctly loaded. ... ok > Make sure can't reimport existing records. ... FAIL > Indepth check that SeqFeatures are transmitted through the db. ... ok > Load SeqRecord objects into a BioSQL database. ... ok > Get a list of all items in the database. ... ok > Test retrieval of items using various ids. ... ok > Check can add DBSeq objects together. ... ok > Check can turn a DBSeq object into a Seq or MutableSeq. ... ok > Make sure Seqs from BioSQL implement the right interface. ... ok > Check SeqFeatures of a sequence. ... ok > Make sure SeqRecords from BioSQL implement the right interface. ... ok > Check that slices of sequences are retrieved properly. ... ok > > ====================================================================== > FAIL: Make sure can't import records with same ID (in one go). > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_BioSQL.py", line 397, in test_duplicate_id_load > err.__class__.__name__ + "\n" + str(err)) > AssertionError: Exception > Should have failed! > ... Also the error formatting wasn't quite what I had intended, fixed in CVS. However, most of the tests are allowing duplicates to be recorded without any error (on PostgreSQL). This is bad. > ====================================================================== > FAIL: Make sure can't reimport existing records. > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_BioSQL.py", line 463, in test_reload > err.__class__.__name__ + "\n" + str(err)) > AssertionError: OperationalError > currval of sequence "bioentry_pk_seq" is not yet defined in this session Interestingly the final test gives us an OperationalError about the bioentry table's primary key (presumably from our last_id method which would call the SQL statement "select currval('bioentry_pk_seq')"). This suggests some clues about what is going wrong. http://www.postgresql.org/docs/8.3/static/functions-sequence.html http://www.postgresql.org/docs/8.3/static/sql-createsequence.html See also: http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/sql/biosqldb-pg.sql CREATE SEQUENCE bioentry_pk_seq; CREATE TABLE bioentry ( bioentry_id INTEGER DEFAULT nextval ( 'bioentry_pk_seq' ) NOT NULL , biodatabase_id INTEGER NOT NULL , taxon_id INTEGER , name VARCHAR ( 40 ) NOT NULL , accession VARCHAR ( 128 ) NOT NULL , identifier VARCHAR ( 40 ) , division VARCHAR ( 6 ) , description TEXT , version INTEGER NOT NULL , PRIMARY KEY ( bioentry_id ) , UNIQUE ( accession , biodatabase_id , version ) , -- CONFIG: uncomment one (and only one) of the two lines below. The -- first puts a uniqueness constraint on the identifier column alone; -- the other one puts a uniqueness constraint on identifier only -- within a namespace. -- UNIQUE ( identifier ) UNIQUE ( identifier , biodatabase_id ) ) ; CREATE INDEX bioentry_name ON bioentry ( name ); CREATE INDEX bioentry_db ON bioentry ( biodatabase_id ); CREATE INDEX bioentry_tax ON bioentry ( taxon_id ); I'm a little surprised all the other duplicate record tests show different behaviour. I have updated test_BioSQL.py to perform all these new duplicate tests on a clean database - which I probably should have done in the first place (CVS revision 1.35). [All these tests are passing on MySQL. Trying the example by hand triggers an IntegrityError.] Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu May 21 22:22:18 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 May 2009 18:22:18 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905212222.n4LMMIls028194@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #11 from andrea at biodec.com 2009-05-21 18:22 EST ------- So the problem is related to the different behaviur adopted by postgres loaded with the biosql schema, with respect to mysql. Sorry because i thought the problem was due to BioSQL because i didn't know wich was the "expected database behaviour". Since we expect an error during insertion of a "duplicate" or "quite duplicate" record... we have only to focus on the postgres biosql schema, and why/where it differs from the mysql one. I didn't have time to have a look to the difference between the various "duplicate record tests". I will do. [i've tried postgres 8.4... and it's exactly the same] -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From cy at cymon.org Thu May 21 22:52:39 2009 From: cy at cymon.org (Cymon Cox) Date: Thu, 21 May 2009 23:52:39 +0100 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: <200905212222.n4LMMIls028194@portal.open-bio.org> References: <200905212222.n4LMMIls028194@portal.open-bio.org> Message-ID: <7265d4f0905211552pddc3180ua3a3deb6ba8102bc@mail.gmail.com> 2009/5/21 > http://bugzilla.open-bio.org/show_bug.cgi?id=2833 > > > > > > ------- Comment #11 from andrea at biodec.com 2009-05-21 18:22 EST ------- > So the problem is related to the different behaviur adopted by postgres > loaded > with the biosql schema, with respect to mysql. > > Sorry because i thought the problem was due to BioSQL because i didn't know > wich was the "expected database behaviour". > > Since we expect an error during insertion of a "duplicate" or "quite > duplicate" > record... we have only to focus on the postgres biosql schema, and > why/where it > differs from the mysql one. > > I didn't have time to have a look to the difference between the various > "duplicate record tests". I will do. > > [i've tried postgres 8.4... and it's exactly the same] Hi Andrea, The problem appears to be related to the BioSQL schema/PostGreSQL. As you indicated, adding a duplicate entry to bioentry returns a "INSERT 0 0" and doesnt throw an IntegrityError which is what the code is looking from and presumably what MySQL throws. The reason it doesnt throw an error is because of one (or both) of the RULES in the schema: rule_bioentry_i1 and/or rule_bioentry_i2 If you delete these two rules, load the schema and try to do a duplicate entry: mytest=# insert into bioentry(bioentry_id, biodatabase_id, name, accession, version) values (2, 1, 'blah1', 'test4', 1); INSERT 0 1 mytest=# select * from bioentry; bioentry_id | biodatabase_id | taxon_id | name | accession | identifier | division | description | version -------------+----------------+----------+-------+-----------+------------+----------+-------------+--------- 2 | 1 | | blah1 | test4 | | | | 1 (1 row) mytest=# insert into bioentry(bioentry_id, biodatabase_id, name, accession, version) values (2, 1, 'blah1', 'test4', 1); ERROR: duplicate key value violates unique constraint "bioentry_pkey" we have an error rather than a "INSERT 0 0" I'm going to assume that psycopg2 would pick-up this error and throw an IntegrityError, but I havent taken it any further to check. Cheers, C. From hlapp at gmx.net Fri May 22 02:05:17 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 21 May 2009 22:05:17 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: <7265d4f0905211552pddc3180ua3a3deb6ba8102bc@mail.gmail.com> References: <200905212222.n4LMMIls028194@portal.open-bio.org> <7265d4f0905211552pddc3180ua3a3deb6ba8102bc@mail.gmail.com> Message-ID: <8C0BF1E3-15DF-4F89-AB57-7AE09B86BCCE@gmx.net> On May 21, 2009, at 6:52 PM, Cymon Cox wrote: > [...] > > Hi Andrea, > > The problem appears to be related to the BioSQL schema/PostGreSQL. > > As you indicated, adding a duplicate entry to bioentry returns a > "INSERT 0 > 0" and doesnt throw an IntegrityError which is what the code is > looking from > and presumably what MySQL throws. > > The reason it doesnt throw an error is because of one (or both) of > the RULES > in the schema: Indeed, I'd almost forgotten. The rules are there mostly as a remnant from earlier versions of PostgreSQL to support transactional loading the way bioperl-db (the object-relational mapping for BioPerl) is optimized. You probably don't need them anywhere else. -hilmar Bioperl-db is optimized such that entities that very likely don't exist yet in the database are attempted for insert right away. If the insert fails due to a unique key violation, the record is looked up (and then expected to be found). In Oracle and MySQL you can do this and the transaction remains healthy; i.e., you can commit the transaction later and all statements except those that failed will be committed. In PostgreSQL any failed statement dooms the entire transaction, and the only way out is a rollback. In this case, if you want the loading of one sequence record as one transaction, failing to insert a single feature record will doom the entire sequence load and you would need to start over with the sequence. To fix this, I wrote the rules, which in essence do do the lookups for PostgreSQL that the bioperl-db code would otherwise avoid, and on insert do nothing if the record is found, which results in zero rows affected when you would expect one (which is what bioperl-db cues off of and then triggers a lookup). The right way to do this meanwhile is to use nested transactions, which PostgreSQL supports since v8.0.x, but I haven't gotten around to implement support for that in Bioperl-db. -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bugzilla-daemon at portal.open-bio.org Fri May 22 03:56:13 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 May 2009 23:56:13 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905220356.n4M3uDfM021127@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #12 from cymon.cox at gmail.com 2009-05-21 23:56 EST ------- After deleting the RULES in the BioSQL schema, all the new unittests pass. (All the RULES can be deleted as they are all there to circumvent the problem in Bioperl-db described by Hilmar Lapp on the biopython-dev list: http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006084.html See also the comment in the schema.) C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 22 08:41:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 May 2009 04:41:39 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905220841.n4M8fd3w015716@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #13 from andrea at biodec.com 2009-05-22 04:41 EST ------- (In reply to comment #12) > After deleting the RULES in the BioSQL schema, all the new unittests pass. > > (All the RULES can be deleted as they are all there to circumvent the problem > in Bioperl-db described by Hilmar Lapp on the biopython-dev list: > > http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006084.html > > See also the comment in the schema.) > > C. I've deleted the two rules, rule_bioentry_i1 rule_bioentry_i2 and then i run the tests: Make sure can't import records with same ID (in one go). ... ok Make sure can't import records with same ID (in steps). ... ok Make sure can't import records with same ID (in steps with commit). ... ok Make sure can't import a single record twice (in one go). ... ok Make sure can't import a single record twice (in steps). ... ok Make sure can't import a single record twice (in steps with commit). ... ok Make sure all records are correctly loaded. ... ok Make sure can't reimport existing records. ... ok Indepth check that SeqFeatures are transmitted through the db. ... ok Load SeqRecord objects into a BioSQL database. ... ok Get a list of all items in the database. ... ok Test retrieval of items using various ids. ... ok Check can add DBSeq objects together. ... ok Check can turn a DBSeq object into a Seq or MutableSeq. ... ok Make sure Seqs from BioSQL implement the right interface. ... ok Check SeqFeatures of a sequence. ... ok Make sure SeqRecords from BioSQL implement the right interface. ... ok Check that slices of sequences are retrieved properly. ... ok ---------------------------------------------------------------------- Ran 18 tests in 58.371s OK with pythhon2.4, python2.5, psycopg, psycopg2. Everything seems to be ok. I don't know which other possible effects could be triggered by this deletion. But i think it should be inserted as soon as possbile into the BioSQL Schema/PostGreSQL (updating also the Test BioSQL schema/PostGreSQL). After removing the rules i've run my own tests: ..... >>> ## LOAD INTO DB >>> db.load([s1]) 1 >>> db.load([s2]) 1 >>> db.load([s3]) Traceback (most recent call last): File "", line 1, in ? File "../BioSQL/BioSeqDatabase.py", line 442, in load File "../BioSQL/Loader.py", line 50, in load_seqrecord File "../BioSQL/Loader.py", line 550, in _load_bioentry_table File "../BioSQL/BioSeqDatabase.py", line 301, in execute IntegrityError: duplicate key value violates unique constraint "bioentry_accession_key" And i've got the error, that is what it is expected as a normal behaviour. So now i've only to trap the exception or pre-check duplications. Many Thanks Andrea -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 22 12:06:36 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 May 2009 08:06:36 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905221206.n4MC6aWo000368@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #14 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-22 08:06 EST ------- (In reply to comment #13) > (In reply to comment #12) > > After deleting the RULES in the BioSQL schema, all the new unittests pass. > > > > (All the RULES can be deleted as they are all there to circumvent the > > problem in Bioperl-db described by Hilmar Lapp on the biopython-dev list: > > > > http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006084.html > > > > See also the comment in the schema.) > > > > C. Well spotted Cymon - I'd missed that. > I've deleted the two rules, > rule_bioentry_i1 > rule_bioentry_i2 > > ... > with pythhon2.4, python2.5, psycopg, psycopg2. > Everything seems to be ok. > ... > After removing the rules i've run my own tests: > ..... > >>> ## LOAD INTO DB > >>> db.load([s1]) > 1 > >>> db.load([s2]) > 1 > >>> db.load([s3]) > Traceback (most recent call last): > File "", line 1, in ? > File "../BioSQL/BioSeqDatabase.py", line 442, in load > File "../BioSQL/Loader.py", line 50, in load_seqrecord > File "../BioSQL/Loader.py", line 550, in _load_bioentry_table > File "../BioSQL/BioSeqDatabase.py", line 301, in execute > IntegrityError: duplicate key value violates unique constraint > "bioentry_accession_key" > > And i've got the error, that is what it is expected as a normal behaviour. > So now i've only to trap the exception or pre-check duplications. Great. It will be down to BioSQL to change the schema (in conjunction with BioPerl), but Hilmar seems to be looking into this: http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006084.html I suppose in the short term we could change our local copy of the schema used in the Biopython unit tests... Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri May 22 12:27:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 22 May 2009 13:27:06 +0100 Subject: [Biopython-dev] RULES in BioSQL PostgreSQL schema Message-ID: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> Hi all, This is a continuation of a thread / bug report from Biopython (Bug 2833) where attempting to import duplicate entries into BioSQL did not raise an error on PostgreSQL (but does on MySQL). Cymon traced this to the RULES present in the schema to help bioperl-db. On Fri, May 22, 2009 at 3:05 AM, Hilmar Lapp wrote: > > On May 21, 2009, at 6:52 PM, Cymon Cox wrote: > >> [...] >> >> Hi Andrea, >> >> The problem appears to be related to the BioSQL schema/PostGreSQL. >> >> As you indicated, adding a duplicate entry to bioentry returns a "INSERT 0 >> 0" and doesnt throw an IntegrityError which is what the code is looking >> from and presumably what MySQL throws. >> >> The reason it doesnt throw an error is because of one (or both) of the >> RULES in the schema: > > Indeed, I'd almost forgotten. The rules are there mostly as a remnant from > earlier versions of PostgreSQL to support transactional loading the way > bioperl-db (the object-relational mapping for BioPerl) is optimized. You > probably don't need them anywhere else. > > ? ? ? ?-hilmar > > > Bioperl-db is optimized such that entities that very likely don't exist yet > in the database are attempted for insert right away. If the insert fails due > to a unique key violation, the record is looked up (and then expected to be > found). In Oracle and MySQL you can do this and the transaction remains > healthy; i.e., you can commit the transaction later and all statements > except those that failed will be committed. In PostgreSQL any failed > statement dooms the entire transaction, and the only way out is a rollback. > In this case, if you want the loading of one sequence record as one > transaction, failing to insert a single feature record will doom the entire > sequence load and you would need to start over with the sequence. To fix > this, I wrote the rules, which in essence do do the lookups for PostgreSQL > that the bioperl-db code would otherwise avoid, and on insert do nothing if > the record is found, which results in zero rows affected when you would > expect one (which is what bioperl-db cues off of and then triggers a > lookup). > The right way to do this meanwhile is to use nested transactions, which > PostgreSQL supports since v8.0.x, but I haven't gotten around to implement > support for that in Bioperl-db. > Hilmar, It seems for Biopython to work properly with BioSQL on PostgreSQL these bioentry rules should be removed from the schema (as the comments in the schema do suggest). Obviously doing this would break any installation also using the current version of bioperl-db. Do the RULES affect BioJava or BioRuby using BioSQL on PostgreSQL? Are you happy to remove these RULES in BioSQL v1.0.x (after making the outlined transactional changes in bioperl-db)? Thanks, Peter From hlapp at gmx.net Fri May 22 15:03:11 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 22 May 2009 11:03:11 -0400 Subject: [Biopython-dev] RULES in BioSQL PostgreSQL schema In-Reply-To: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> Message-ID: On May 22, 2009, at 8:27 AM, Peter wrote: > Are you happy to remove these RULES in BioSQL v1.0.x (after > making the outlined transactional changes in bioperl-db)? In principle yes. It would also mean dropping support for PostgreSQL v7.x, but I would hope that that's a non-issue. But if anyone here is still using and relying on PostgreSQL v7.x (or earlier?) do let us know, please. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Fri May 22 15:57:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 22 May 2009 16:57:38 +0100 Subject: [Biopython-dev] RULES in BioSQL PostgreSQL schema In-Reply-To: References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> Message-ID: <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com> On Fri, May 22, 2009 at 4:03 PM, Hilmar Lapp wrote: > > On May 22, 2009, at 8:27 AM, Peter wrote: > >> Are you happy to remove these RULES in BioSQL v1.0.x (after >> making the outlined transactional changes in bioperl-db)? > > In principle yes. It would also mean dropping support for PostgreSQL v7.x, > but I would hope that that's a non-issue. > > But if anyone here is still using and relying on PostgreSQL v7.x (or > earlier?) do let us know, please. Great. In the meantime could you add a big warning about this issue to the INSTALL notes for PostgreSQL (i.e. recommend removing the RULES section if not using bioper-db)? http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/INSTALL Peter From biopython at maubp.freeserve.co.uk Fri May 22 16:06:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 22 May 2009 17:06:21 +0100 Subject: [Biopython-dev] Peter at a conference next week Message-ID: <320fb6e00905220906l2446afbfk9804599db74a4d66@mail.gmail.com> Hi all, Just to let you know I will be at a conference next week, so don't expect (Biopython) email replies as promptly as usual. I may even leave my laptop at home ;) Peter From hlapp at gmx.net Fri May 22 18:20:58 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 22 May 2009 14:20:58 -0400 Subject: [Biopython-dev] RULES in BioSQL PostgreSQL schema In-Reply-To: <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com> References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com> Message-ID: <410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net> Yes, agree. Would you mind filing this in Bugzilla for BioSQL? -hilmar On May 22, 2009, at 11:57 AM, Peter wrote: > On Fri, May 22, 2009 at 4:03 PM, Hilmar Lapp wrote: >> >> On May 22, 2009, at 8:27 AM, Peter wrote: >> >>> Are you happy to remove these RULES in BioSQL v1.0.x (after >>> making the outlined transactional changes in bioperl-db)? >> >> In principle yes. It would also mean dropping support for >> PostgreSQL v7.x, >> but I would hope that that's a non-issue. >> >> But if anyone here is still using and relying on PostgreSQL v7.x (or >> earlier?) do let us know, please. > > Great. > > In the meantime could you add a big warning about this issue to the > INSTALL notes for PostgreSQL (i.e. recommend removing the RULES > section if not using bioper-db)? > http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/INSTALL > > Peter -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bugzilla-daemon at portal.open-bio.org Fri May 22 18:37:21 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 May 2009 14:37:21 -0400 Subject: [Biopython-dev] [Bug 2837] New: Reading Roche 454 SFF sequence read files in Bio.SeqIO Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2837 Summary: Reading Roche 454 SFF sequence read files in Bio.SeqIO Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk Roche 454 sequencing returns the read data in SFF files, a documented binary format, capturing the sequence letters and qualities together with trimming information. It would be nice to support reading (and in the longer term also writing) these files directly with Bio.SeqIO. See this thread for background: http://lists.open-bio.org/pipermail/biopython/2009-April/005083.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 22 18:39:26 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 May 2009 14:39:26 -0400 Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read files in Bio.SeqIO In-Reply-To: Message-ID: <200905221839.n4MIdQU5008555@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2837 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-22 14:39 EST ------- Created an attachment (id=1303) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1303&action=view) Bio/SeqIO/RocheSffIO.py This is a rough SeqIO parser constructing SeqRecord objects using a parser contributed by Jose Blanca. Additional work would be required for paired end reads - and even more work to be able to write out these files. Potentially Jose's parser could be exposed as a public module under Bio.Sequencing, but here is it just two private classes. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri May 22 18:40:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 22 May 2009 19:40:45 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> <200904171246.46568.jblanca@btc.upv.es> <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> Message-ID: <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> On Fri, Apr 17, 2009 at 12:08 PM, Peter wrote: > On Fri, Apr 17, 2009 at 11:46 AM, Jose Blanca wrote: >> Hi Peter: >> Here you have some code to read the sff files. > > Thanks - I'm not sure when I'll get to look at this, maybe next week. > >> For the time being it creates a dict for the sequences. I'm not sure about >> how to integrate the generated data in BioPython. The sequence and >> qualities should go to a SeqRecord, but there is also the information >> about the clipping. > > For Bio.SeqIO, we would need to use a SeqRecord. ?Ideally we'd want to > be able to read and write SFF files, and to do that we'll have to record all > the essential annotation (i.e. clipping) somehow. I've had a look at your code this evening, and written a rough SeqIO module using it, available here on enhancement Bug 2837, http://bugzilla.open-bio.org/show_bug.cgi?id=2837 > Can you write SFF files? > >> For my work I use a kind of SeqRecord with a mask property and the >> mask is a Location that shows which part of the sequence is ok. I don't >> know if that's a valid model for BioPython. > > A mask could be done as a list of booleans, and we can treat it as > another per-letter-annotation in the SeqRecord. ?I'm not sure if this > is helpful or not. > > The Roche tools let you choose to extract trimmed reads as FASTA > and QUAL, or untrimmed. ?Perhaps for reading SFF files with > Bio.SeqIO we should get the user to choose between these > options (e.g. format names "roche-sff" and "roche-sff-notrim")? This would work... > Roche's FASTA files use upper case for the trimmed region, and > lower case for the start/end which would get trimmed off. This is > simple and we could do this for Biopython too - meaning you'd get > the same data if you read the SFF file directly, or used Roche's > FASTA+QUAL files with SeqIO. ?Note that when reading an SFF > file directly, we should probably record the real trim data as well. In my current code, I decided to use the same quality trimming representation that Roche use if converting the SFF file into FASTA format (the leading and trailing trim regions are in lower case). We may want to record the trim positions in the SeqRecord's annotation as well. >> There's also a couple of more tricks with the clipping. >> In theory there's clip_qual and clip_adapter, but in the files >> we've seen clip_adapter is always zero and clip_quality is used >> instead for both quality and adapter. I think we could generate >> one clipping combining both. Let me know what do you think. >> Also take into account that in some cases the generated clipping >> from the 454 software are just wrong. > > I'll need to learn more about the details before coming to any > conclusions about how to deal with this information in Biopython. Right now I have not looked at the left/right adaptor clipping information, as you found, in the example file I have looked at these fields are zero. Note I will be away for the next week, so am unlikely to respond to any emails on this. Peter From bugzilla-daemon at portal.open-bio.org Fri May 22 19:23:44 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 May 2009 15:23:44 -0400 Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read files in Bio.SeqIO In-Reply-To: Message-ID: <200905221923.n4MJNiAe013574@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2837 spenthil at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |spenthil at gmail.com -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 22 21:16:07 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 May 2009 17:16:07 -0400 Subject: [Biopython-dev] [Bug 2838] New: If a SeqRecord containing Genbank information is read from BioSQL, it cannot be written to another BioSQL database Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2838 Summary: If a SeqRecord containing Genbank information is read from BioSQL, it cannot be written to another BioSQL database Product: Biopython Version: 1.49 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: david.wyllie at ndm.ox.ac.uk I've been trying to annotate some microbial sequences; some are from genbank. So the proposed series of events was: 1) get sequences from genbank 2) store in BioSQL database called One 3) recover them from BioSql 4) annotate the recovered SeqRecords [this works, but isn't necessary for this problem to be reproduced - here, I'm making no changes at all to the SeqRecord] 5) store the annotated SeqRecords in a different BioSQL database called Two. The problem is that Step 5 fails when the original record was recovered from Genbank. The traceback (below) indicates a problem with the BioSQL loader in _load_bioentry_date Here is the screen output, including traceback. The program (attached) first loads a record from Genbank, writes it to One, recovers it from One; at this point it has changed, in particular in the way date fields are represented. the entrez load has a /date feature which is not a list /date=26-MAY-2005 while the reloaded version has two date fields /dates=['26-MAY-2005'] /date=['26-MAY-2005'] Whether this is relevant I'm not sure. The subsequent write of the recovered version to Two fails. As a control, I've checked that the original version can be written to Two successfully. I'm a novice with Python and Biopython so please accept my apologies if there is something obvious and very stupid responsible for this. --------------------------------------------------------------------------- dwyllie at dwyllie:~/programs/Project/src$ python dbtestcase.py OK, going to recover record 28804743 from genbank.... Record loaded looks like this: ID: AB098727.1 Name: AB098727 Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal protein S11, cytochromoe b/f complex subunit IV, partial cds. Number of features: 5 /sequence_version=1 /source=chloroplast Ceratodon purpureus /taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales', 'Ditrichaceae', 'Ceratodon'] /keywords=[''] /references=[, , , ] /accessions=['AB098727'] /data_file_division=PLN /date=26-MAY-2005 /organism=Ceratodon purpureus /gi=28804743 Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA', IUPACAmbiguousDNA()) ======================================================================== Load from Entrez completed, records= 1 Here is the loaded record: ======================================================================== ID: AB098727.1 Name: AB098727 Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal protein S11, cytochromoe b/f complex subunit IV, partial cds. Number of features: 5 /sequence_version=1 /source=chloroplast Ceratodon purpureus /taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales', 'Ditrichaceae', 'Ceratodon'] /keywords=[''] /references=[, , , ] /accessions=['AB098727'] /data_file_division=PLN /date=26-MAY-2005 /organism=Ceratodon purpureus /gi=28804743 Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA', IUPACAmbiguousDNA()) ======================================================================== Now loading these records into a BioSQL database One. /var/lib/python-support/python2.6/MySQLdb/__init__.py:34: DeprecationWarning: the sets module is deprecated from sets import ImmutableSet Creating a new database One ======================================================================== Load from database One completed, records= 1 ======================================================================== Here is the record recovered from database One: ID: AB098727.1 Name: AB098727 Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal protein S11, cytochromoe b/f complex subunit IV, partial cds. Number of features: 5 /dates=['26-MAY-2005'] /ncbi_taxid=3225 /date=['26-MAY-2005'] /taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Bryopsida', 'Dicranidae', 'Dicranales', 'Ditrichaceae', 'Ceratodon', 'Ceratodon purpureus'] /source=['chloroplast Ceratodon purpureus'] /references=[, , , ] /gi=28804743 /data_file_division=PLN /keywords=[''] /organism=Ceratodon purpureus /sequence_version=['1'] /accessions=['AB098727'] DBSeq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA', DNAAlphabet()) ======================================================================== Creating a new database Two Traceback (most recent call last): File "dbtestcase.py", line 206, in from dbtestcase import AuthDetails File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 225, in DemonstrateProblem(problemgi,ad) File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 199, in DemonstrateProblem db2.load(listtoload) File "/var/lib/python-support/python2.6/BioSQL/BioSeqDatabase.py", line 430, in load db_loader.load_seqrecord(cur_record) File "/var/lib/python-support/python2.6/BioSQL/Loader.py", line 50, in load_seqrecord self._load_bioentry_date(record, bioentry_id) File "/var/lib/python-support/python2.6/BioSQL/Loader.py", line 577, in _load_bioentry_date self.adaptor.execute(sql, (bioentry_id, date_id, date)) File "/var/lib/python-support/python2.6/BioSQL/BioSeqDatabase.py", line 289, in execute self.cursor.execute(sql, args or ()) File "/var/lib/python-support/python2.6/MySQLdb/cursors.py", line 166, in execute self.errorhandler(self, exc, value) File "/var/lib/python-support/python2.6/MySQLdb/connections.py", line 35, in defaulterrorhandler raise errorclass, errorvalue _mysql_exceptions.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '), 1)' at line 1") dwyllie at dwyllie:~/programs/Project/src$ -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 22 21:19:03 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 May 2009 17:19:03 -0400 Subject: [Biopython-dev] [Bug 2838] If a SeqRecord containing Genbank information is read from BioSQL, it cannot be written to another BioSQL database In-Reply-To: Message-ID: <200905222119.n4MLJ3d3026350@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2838 ------- Comment #1 from david.wyllie at ndm.ox.ac.uk 2009-05-22 17:19 EST ------- Created an attachment (id=1304) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1304&action=view) A python script which reproduces the error. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri May 22 22:46:04 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 May 2009 18:46:04 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200905222246.n4MMk4QO000548@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn| |2839 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri May 22 22:46:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 22 May 2009 23:46:54 +0100 Subject: [Biopython-dev] RULES in BioSQL PostgreSQL schema In-Reply-To: <410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net> References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com> <410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net> Message-ID: <320fb6e00905221546i26edc7a2u2a02fb0d01c374ea@mail.gmail.com> On 5/22/09, Hilmar Lapp wrote: > Yes, agree. Would you mind filing this in Bugzilla for BioSQL? -hilmar I've filed Bug 2839, hopefully this is what you had in mind: http://bugzilla.open-bio.org/show_bug.cgi?id=2839 Peter From chapmanb at 50mail.com Fri May 22 22:54:32 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 22 May 2009 18:54:32 -0400 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> <200904171246.46568.jblanca@btc.upv.es> <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> Message-ID: <20090522225432.GU84112@sobchak.mgh.harvard.edu> Peter and Jose; I haven't used SFF files myself as we don't have a 454 machine, but do know of a couple of implementations of SFF TO Fastq/Fasta. Flower is a Haskell implementation: http://blog.malde.org/index.php/flower/ And PyroBayes is a 454 base caller: http://bioinformatics.bc.edu/marthlab/PyroBayes Depending on what you all end up doing, these might be useful as comparison points, or for wrapping with Application command lines. Brad > On Fri, Apr 17, 2009 at 12:08 PM, Peter wrote: > > On Fri, Apr 17, 2009 at 11:46 AM, Jose Blanca wrote: > >> Hi Peter: > >> Here you have some code to read the sff files. > > > > Thanks - I'm not sure when I'll get to look at this, maybe next week. > > > >> For the time being it creates a dict for the sequences. I'm not sure about > >> how to integrate the generated data in BioPython. The sequence and > >> qualities should go to a SeqRecord, but there is also the information > >> about the clipping. > > > > For Bio.SeqIO, we would need to use a SeqRecord. ?Ideally we'd want to > > be able to read and write SFF files, and to do that we'll have to record all > > the essential annotation (i.e. clipping) somehow. > > I've had a look at your code this evening, and written a rough SeqIO > module using it, available here on enhancement Bug 2837, > http://bugzilla.open-bio.org/show_bug.cgi?id=2837 > > > Can you write SFF files? > > > >> For my work I use a kind of SeqRecord with a mask property and the > >> mask is a Location that shows which part of the sequence is ok. I don't > >> know if that's a valid model for BioPython. > > > > A mask could be done as a list of booleans, and we can treat it as > > another per-letter-annotation in the SeqRecord. ?I'm not sure if this > > is helpful or not. > > > > The Roche tools let you choose to extract trimmed reads as FASTA > > and QUAL, or untrimmed. ?Perhaps for reading SFF files with > > Bio.SeqIO we should get the user to choose between these > > options (e.g. format names "roche-sff" and "roche-sff-notrim")? > > This would work... > > > Roche's FASTA files use upper case for the trimmed region, and > > lower case for the start/end which would get trimmed off. This is > > simple and we could do this for Biopython too - meaning you'd get > > the same data if you read the SFF file directly, or used Roche's > > FASTA+QUAL files with SeqIO. ?Note that when reading an SFF > > file directly, we should probably record the real trim data as well. > > In my current code, I decided to use the same quality trimming > representation that Roche use if converting the SFF file into FASTA > format (the leading and trailing trim regions are in lower case). We > may want to record the trim positions in the SeqRecord's annotation > as well. > > >> There's also a couple of more tricks with the clipping. > >> In theory there's clip_qual and clip_adapter, but in the files > >> we've seen clip_adapter is always zero and clip_quality is used > >> instead for both quality and adapter. I think we could generate > >> one clipping combining both. Let me know what do you think. > >> Also take into account that in some cases the generated clipping > >> from the 454 software are just wrong. > > > > I'll need to learn more about the details before coming to any > > conclusions about how to deal with this information in Biopython. > > Right now I have not looked at the left/right adaptor clipping information, > as you found, in the example file I have looked at these fields are zero. > > Note I will be away for the next week, so am unlikely to respond to > any emails on this. > > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From bugzilla-daemon at portal.open-bio.org Fri May 22 22:58:24 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 May 2009 18:58:24 -0400 Subject: [Biopython-dev] [Bug 2838] If a SeqRecord containing Genbank information is read from BioSQL, it cannot be written to another BioSQL database In-Reply-To: Message-ID: <200905222258.n4MMwOXA001311@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2838 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-22 18:58 EST ------- (In reply to comment #0) > I've been trying to annotate some microbial sequences; some are from genbank. > So the proposed series of events was: > 1) get sequences from genbank > 2) store in BioSQL database called One > 3) recover them from BioSql > 4) annotate the recovered SeqRecords [this works, but isn't > necessary for this problem to be reproduced - here, I'm > making no changes at all to the SeqRecord] > 5) store the annotated SeqRecords in a different BioSQL database called Two. > > The problem is that Step 5 fails when the original record was recovered from > Genbank. > > The traceback (below) indicates a problem with the BioSQL loader in > _load_bioentry_date > ... > I'm a novice with Python and Biopython so please accept my apologies if > there is something obvious and very stupid responsible for this. What you are trying to do sounds very reasonable (although I have never actually needed to or tried to do this myself). You were right about the date thing, the loader code only expected a string, not a list. Fixed in CVS revision 1.40 of BioSQL/Loader.py, and I have also added a unit test for this use case in Tests/test_BioSQL.py revision 1.36. Note there is a known minor discrepancy with dates (see Bug 2681) when comparing the original SeqRecord to the DBSeqRecord after loading/retrieving from BioSQL. If you could confirm this solves your problem, I think we can close this bug. Thank you! Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Fri May 22 22:54:32 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 22 May 2009 18:54:32 -0400 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> <200904171246.46568.jblanca@btc.upv.es> <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> Message-ID: <20090522225432.GU84112@sobchak.mgh.harvard.edu> Peter and Jose; I haven't used SFF files myself as we don't have a 454 machine, but do know of a couple of implementations of SFF TO Fastq/Fasta. Flower is a Haskell implementation: http://blog.malde.org/index.php/flower/ And PyroBayes is a 454 base caller: http://bioinformatics.bc.edu/marthlab/PyroBayes Depending on what you all end up doing, these might be useful as comparison points, or for wrapping with Application command lines. Brad > On Fri, Apr 17, 2009 at 12:08 PM, Peter wrote: > > On Fri, Apr 17, 2009 at 11:46 AM, Jose Blanca wrote: > >> Hi Peter: > >> Here you have some code to read the sff files. > > > > Thanks - I'm not sure when I'll get to look at this, maybe next week. > > > >> For the time being it creates a dict for the sequences. I'm not sure about > >> how to integrate the generated data in BioPython. The sequence and > >> qualities should go to a SeqRecord, but there is also the information > >> about the clipping. > > > > For Bio.SeqIO, we would need to use a SeqRecord. ?Ideally we'd want to > > be able to read and write SFF files, and to do that we'll have to record all > > the essential annotation (i.e. clipping) somehow. > > I've had a look at your code this evening, and written a rough SeqIO > module using it, available here on enhancement Bug 2837, > http://bugzilla.open-bio.org/show_bug.cgi?id=2837 > > > Can you write SFF files? > > > >> For my work I use a kind of SeqRecord with a mask property and the > >> mask is a Location that shows which part of the sequence is ok. I don't > >> know if that's a valid model for BioPython. > > > > A mask could be done as a list of booleans, and we can treat it as > > another per-letter-annotation in the SeqRecord. ?I'm not sure if this > > is helpful or not. > > > > The Roche tools let you choose to extract trimmed reads as FASTA > > and QUAL, or untrimmed. ?Perhaps for reading SFF files with > > Bio.SeqIO we should get the user to choose between these > > options (e.g. format names "roche-sff" and "roche-sff-notrim")? > > This would work... > > > Roche's FASTA files use upper case for the trimmed region, and > > lower case for the start/end which would get trimmed off. This is > > simple and we could do this for Biopython too - meaning you'd get > > the same data if you read the SFF file directly, or used Roche's > > FASTA+QUAL files with SeqIO. ?Note that when reading an SFF > > file directly, we should probably record the real trim data as well. > > In my current code, I decided to use the same quality trimming > representation that Roche use if converting the SFF file into FASTA > format (the leading and trailing trim regions are in lower case). We > may want to record the trim positions in the SeqRecord's annotation > as well. > > >> There's also a couple of more tricks with the clipping. > >> In theory there's clip_qual and clip_adapter, but in the files > >> we've seen clip_adapter is always zero and clip_quality is used > >> instead for both quality and adapter. I think we could generate > >> one clipping combining both. Let me know what do you think. > >> Also take into account that in some cases the generated clipping > >> from the 454 software are just wrong. > > > > I'll need to learn more about the details before coming to any > > conclusions about how to deal with this information in Biopython. > > Right now I have not looked at the left/right adaptor clipping information, > as you found, in the example file I have looked at these fields are zero. > > Note I will be away for the next week, so am unlikely to respond to > any emails on this. > > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From biopython at maubp.freeserve.co.uk Fri May 22 23:09:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 23 May 2009 00:09:56 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <20090522225432.GU84112@sobchak.mgh.harvard.edu> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> <200904171246.46568.jblanca@btc.upv.es> <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> <20090522225432.GU84112@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> On 5/22/09, Brad Chapman wrote: > Peter and Jose; > I haven't used SFF files myself as we don't have a 454 machine, We don't have one in house either, and have instead out-sourced to a couple of sequencing centres in the UK with 454 machines. > but do know of a couple of implementations of SFF TO > Fastq/Fasta. > Flower is a Haskell implementation: > > http://blog.malde.org/index.php/flower/ > > And PyroBayes is a 454 base caller: > > http://bioinformatics.bc.edu/marthlab/PyroBayes > > Depending on what you all end up doing, these might be useful as > comparison points, or for wrapping with Application command lines. I would say Roche's own tools are the best reference, but these only output FASTA and QUAL, not FASTQ files (at the moment at least). So yes, being able to compare a Biopython SFF to FASTQ conversion with that by Flower (or anything else) would be handy. Peter From spenthil at gmail.com Fri May 22 23:52:30 2009 From: spenthil at gmail.com (Senthil Palanisami) Date: Fri, 22 May 2009 16:52:30 -0700 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> <200904171246.46568.jblanca@btc.upv.es> <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> <20090522225432.GU84112@sobchak.mgh.harvard.edu> <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> Message-ID: <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> I have been working with SFF files for the past month, and can say it's definitely frustrating working with custom binary formats. Take a look at sff_extract which is written in python. It converts sff files into fasta and xml or caf files: http://bioinf.comav.upv.es/sff_extract/index.html You can find detailed specs of the format @ http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#header-global -- Senthil Palanisami http://spenthil.com On Fri, May 22, 2009 at 4:09 PM, Peter wrote: > On 5/22/09, Brad Chapman wrote: > > Peter and Jose; > > I haven't used SFF files myself as we don't have a 454 machine, > > We don't have one in house either, and have instead out-sourced to a > couple of sequencing centres in the UK with 454 machines. > > > but do know of a couple of implementations of SFF TO > > Fastq/Fasta. > > Flower is a Haskell implementation: > > > > http://blog.malde.org/index.php/flower/ > > > > And PyroBayes is a 454 base caller: > > > > http://bioinformatics.bc.edu/marthlab/PyroBayes > > > > Depending on what you all end up doing, these might be useful as > > comparison points, or for wrapping with Application command lines. > > I would say Roche's own tools are the best reference, but these only > output FASTA and QUAL, not FASTQ files (at the moment at least). So > yes, being able to compare a Biopython SFF to FASTQ conversion with > that by Flower (or anything else) would be handy. > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython at maubp.freeserve.co.uk Sat May 23 00:10:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 23 May 2009 01:10:57 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> <200904171246.46568.jblanca@btc.upv.es> <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> <20090522225432.GU84112@sobchak.mgh.harvard.edu> <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> Message-ID: <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com> On 5/23/09, Senthil Palanisami wrote: > I have been working with SFF files for the past month, and can say it's > definitely frustrating working with custom binary formats. At least in this case it is publicly documented. Have you needed to write out (or edit) an SFF file yet? Have you used any paired end reads in SFF format? > Take a look at sff_extract which is written in python. It converts sff files > into fasta and xml or caf files: > http://bioinf.comav.upv.es/sff_extract/index.html That is what this code is based on - Jose Blanca is one of the authors of sff_extract. > You can find detailed specs of the format @ > http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#header-global I think you must have missed this thread last month ;) http://lists.open-bio.org/pipermail/biopython/2009-April/005084.html Peter From bugzilla-daemon at portal.open-bio.org Sat May 23 01:16:54 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 May 2009 21:16:54 -0400 Subject: [Biopython-dev] [Bug 2838] If a SeqRecord containing Genbank information is read from BioSQL, it cannot be written to another BioSQL database In-Reply-To: Message-ID: <200905230116.n4N1GsRl010917@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2838 ------- Comment #3 from david.wyllie at ndm.ox.ac.uk 2009-05-22 21:16 EST ------- Thank you! Unfortunately I'm not sure it's fixed, or maybe there is another problem: I have uninstalled the BioPython package using Synaptic package manager (previously I was using 1.49), downloaded from cvs checkout. Thanks for your message http://osdir.com/ml/python.bio.general/2008-07/msg00035.html I can confirm that the default ubuntu 9.0 install lacks the python-dev package, with the necessary Python.h headers. After python-dev is installed, build is OK, Tests pass running test test_Ace ... ok test_AlignIO ... ok test_BioSQL ... /var/lib/python-support/python2.6/MySQLdb/__init__.py:34: DeprecationWarning: the sets module is deprecated from sets import ImmutableSet /home/dwyllie/biopython/build/lib.linux-x86_64-2.6/BioSQL/BioSeqDatabase.py:144: Warning: 'TYPE=storage_engine' is deprecated; use 'ENGINE=storage_engine' instead self.adaptor.cursor.execute(sql_line) ok test_BioSQL_SeqIO ... ok test_CAPS ... ok test_Clustalw ... ok .. and install is OK too. This is all new to me but it seems to work OK. I have checked the source code and I think your modification is correctly in place I think I have your patch in place: def _load_bioentry_date(self, record, bioentry_id): """Add the effective date of the entry into the database. record - a SeqRecord object with an annotated date bioentry_id - corresponding database identifier """ # dates are GenBank style, like: # 14-SEP-2000 date = record.annotations.get("date", strftime("%d-%b-%Y", gmtime()).upper()) if isinstance(date, list) : date = date[0] annotation_tags_id = self._get_ontology_id("Annotation Tags") date_id = self._get_term_id("date_changed", annotation_tags_id) sql = r"INSERT INTO bioentry_qualifier_value" \ r" (bioentry_id, term_id, value, rank)" \ r" VALUES (%s, %s, %s, 1)" self.adaptor.execute(sql, (bioentry_id, date_id, date)) Now when I re-run dbtestcase.py (attached previously) I get a different error message. dwyllie at dwyllie:~/programs/CheckleyProject/src$ python dbtestcase.py OK, going to recover record 28804743 from genbank.... Record loaded looks like this: ID: AB098727.1 Name: AB098727 Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal protein S11, cytochromoe b/f complex subunit IV, partial cds. Number of features: 5 /sequence_version=1 /source=chloroplast Ceratodon purpureus /taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales', 'Ditrichaceae', 'Ceratodon'] /keywords=[''] /references=[, , , ] /accessions=['AB098727'] /data_file_division=PLN /date=26-MAY-2005 /organism=Ceratodon purpureus /gi=28804743 Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA', IUPACAmbiguousDNA()) ======================================================================== Load from Entrez completed, records= 1 Here is the loaded record: ======================================================================== ID: AB098727.1 Name: AB098727 Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal protein S11, cytochromoe b/f complex subunit IV, partial cds. Number of features: 5 /sequence_version=1 /source=chloroplast Ceratodon purpureus /taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales', 'Ditrichaceae', 'Ceratodon'] /keywords=[''] /references=[, , , ] /accessions=['AB098727'] /data_file_division=PLN /date=26-MAY-2005 /organism=Ceratodon purpureus /gi=28804743 Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA', IUPACAmbiguousDNA()) ======================================================================== Now loading these records into a BioSQL database One. /var/lib/python-support/python2.6/MySQLdb/__init__.py:34: DeprecationWarning: the sets module is deprecated from sets import ImmutableSet Creating a new database One ======================================================================== Load from database One completed, records= 1 ======================================================================== Here is the record recovered from database One: Traceback (most recent call last): File "dbtestcase.py", line 165, in from dbtestcase import AuthDetails File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 182, in DemonstrateProblem(problemgi,ad) File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 138, in DemonstrateProblem print recordrecovered File "/usr/local/lib/python2.6/dist-packages/Bio/SeqRecord.py", line 489, in __str__ if self.letter_annotations : File "/usr/local/lib/python2.6/dist-packages/Bio/SeqRecord.py", line 165, in fget=lambda self : self._per_letter_annotations, AttributeError: 'DBSeqRecord' object has no attribute '_per_letter_annotations' dwyllie at dwyllie:~/programs/CheckleyProject/src$ Have I failed to install something? Unfortunately, I wasn't running off CVS before your change. Best wishes d -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From spenthil at gmail.com Sat May 23 01:48:24 2009 From: spenthil at gmail.com (Senthil Palanisami) Date: Fri, 22 May 2009 18:48:24 -0700 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> <200904171246.46568.jblanca@btc.upv.es> <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> <20090522225432.GU84112@sobchak.mgh.harvard.edu> <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com> Message-ID: <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> Sorry, I only recently joined this list - should have gone through the archives first. I have done some minimal SFF tweaking, but only by first converting them to CA format. No paired end reads yet, but I do know my PI wants me to start looking at some in the next month or two. -- Senthil Palanisami http://spenthil.com On Fri, May 22, 2009 at 5:10 PM, Peter wrote: > On 5/23/09, Senthil Palanisami wrote: > > I have been working with SFF files for the past month, and can say it's > > definitely frustrating working with custom binary formats. > > At least in this case it is publicly documented. Have you needed to > write out (or edit) an SFF file yet? Have you used any paired end > reads in SFF format? > > > Take a look at sff_extract which is written in python. It converts sff > files > > into fasta and xml or caf files: > > http://bioinf.comav.upv.es/sff_extract/index.html > > That is what this code is based on - Jose Blanca is one of the authors > of sff_extract. > > > You can find detailed specs of the format @ > > > http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#header-global > > I think you must have missed this thread last month ;) > http://lists.open-bio.org/pipermail/biopython/2009-April/005084.html > > Peter > From biopython at maubp.freeserve.co.uk Sat May 23 11:28:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 23 May 2009 12:28:36 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> <200904171246.46568.jblanca@btc.upv.es> <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> <20090522225432.GU84112@sobchak.mgh.harvard.edu> <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com> <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> Message-ID: <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> On Sat, May 23, 2009 at 2:48 AM, Senthil Palanisami wrote: > Sorry, I only recently joined this list - should have gone through the > archives first. Don't worry - and if I sounded grumpy, sorry - I was up late last night. > I have done some minimal SFF tweaking, but only by first converting them > to CA format. What do you mean by CA format? I don't recall seeing that abbreviation before. > No paired end reads yet, but I do know my PI wants me to start looking > at some in the next month or two. I haven't had any paired end 454 reads to work with personally, but I'm sure there are some examples available online somewhere. Peter From bugzilla-daemon at portal.open-bio.org Sat May 23 11:49:18 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 23 May 2009 07:49:18 -0400 Subject: [Biopython-dev] [Bug 2838] If a SeqRecord containing Genbank information is read from BioSQL, it cannot be written to another BioSQL database In-Reply-To: Message-ID: <200905231149.n4NBnIEQ023192@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2838 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-23 07:49 EST ------- (In reply to comment #3) > Thank you! > > Unfortunately I'm not sure it's fixed, or maybe there is another problem: > ... > Now when I re-run dbtestcase.py (attached previously) I get a different error > message. > ... > Traceback (most recent call last): > ... > File "/usr/local/lib/python2.6/dist-packages/Bio/SeqRecord.py", line 489, in > __str__ > if self.letter_annotations : > File "/usr/local/lib/python2.6/dist-packages/Bio/SeqRecord.py", line 165, in > > fget=lambda self : self._per_letter_annotations, > AttributeError: 'DBSeqRecord' object has no attribute '_per_letter_annotations' > dwyllie at dwyllie:~/programs/CheckleyProject/src$ > > > Have I failed to install something? No - everything looks OK, and the deprecation warnings are known about and not in Biopython anyway. > Unfortunately, I wasn't running off CVS before your change. The original problem is fixed. However, you've found a new bug in the __str__ method for the DBSeqRecord related to the fact there is no per-letter-annotation (this would have been introduced in Biopython 1.50 when I added the letter_annotations dictionary to the SeqRecord class). I'm a little surprised that our unit tests didn't catch this - but its fixed now: Tests/test_BioSQL.py CVS revision 1.37 BioSQL/BioSeq.py CVS revision 1.36 Note BioSQL doesn't yet support recording anything more complicated than strings, although we've started talking about using XML or JSON for this. As a result, Biopython does not attempt to record any per-letter-annotation in the BioSQL database. With the fix the DBSeqRecord now has an empty per-letter-annotation dictionary. Before it didn't, hense the AttributeError. Hopefully you won't find any more issues, but if you do, please file another bug - I'm marking this one as fixed. Thanks for your report and time David, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From spenthil at gmail.com Sat May 23 16:11:22 2009 From: spenthil at gmail.com (Senthil Palanisami) Date: Sat, 23 May 2009 09:11:22 -0700 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <200904171246.46568.jblanca@btc.upv.es> <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> <20090522225432.GU84112@sobchak.mgh.harvard.edu> <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com> <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> Message-ID: <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com> You didn't sound particularly grumpy, I am just aware of the annoyances related to people too lazy to do a quick search of through a mailing list before spamming. I pulled 'CA' straight out of a wgs assembler program: http://apps.sourceforge.net/mediawiki/wgs-assembler/index.php?title=Formatting_Inputs#sffToCA I think 'frg' is the real file format name. -- Senthil Palanisami http://spenthil.com On Sat, May 23, 2009 at 4:28 AM, Peter wrote: > On Sat, May 23, 2009 at 2:48 AM, Senthil Palanisami > wrote: > > Sorry, I only recently joined this list - should have gone through the > > archives first. > > Don't worry - and if I sounded grumpy, sorry - I was up late last night. > > > I have done some minimal SFF tweaking, but only by first converting them > > to CA format. > > What do you mean by CA format? I don't recall seeing that abbreviation > before. > > > No paired end reads yet, but I do know my PI wants me to start looking > > at some in the next month or two. > > I haven't had any paired end 454 reads to work with personally, but I'm > sure there are some examples available online somewhere. > > Peter > From mjldehoon at yahoo.com Sun May 24 04:10:28 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 23 May 2009 21:10:28 -0700 (PDT) Subject: [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL Message-ID: <867081.50034.qm@web62404.mail.re1.yahoo.com> I suggest that for the short term, we store the DE lines as one string in the same way as Bioperl 1.5 and 1.6, until we decide on a more advanced way to treat these lines. Currently Bio.SeqIO and Bio.SwissProt use different ways to handle the DE lines, and neither of them agrees with Bioperl. --Michiel. --- On Mon, 5/18/09, Peter wrote: > From: Peter > Subject: Re: [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL > To: "Hilmar Lapp" > Cc: "Chris Fields" , "BioPerl List" , "biosql-l" , biopython-dev at biopython.org > Date: Monday, May 18, 2009, 9:38 AM > On Sun, May 17, 2009 at 4:21 PM, > Hilmar Lapp > wrote: > > > > On May 17, 2009, at 8:40 AM, Peter wrote: > >> > >> [...] Here you have mapped RecName and AltName > fields in the DE lines to > >> Name and Synonyms (shouldn't that be Synonym > singular?). > > > > The example is for the GN lines in SwissProt, not the > DE lines. > > Ah, that probably explains some of my confusion. > > >> In this example, searching the database using one > of the SwissProt > >> AltNames (synonyms), or filtering on the Flags > sounds like a > >> reasonable request - but this would be very > difficult if the data is > >> stored inside XML strings. > > > > Actually no. Modern full-text indexers (inside or > outside the database) can > > index XML text columns right away and very well. In > fact, for the last > > project that I built a full-text search for (on top of > a BioSQL database) I > > did that by writing custom XML documents to a separate > table for each > > record I wanted indexed. Oracle's full text indexer > did the rest. I also built a > > separate identifier/name/accession index that pulled > all the gene names, > > symbols, accession numbers, identifiers etc into a > single table for > > indexing. > > OK, when I said searching "would be very difficult if the > data is > stored inside XML strings", maybe it wasn't so difficult > for you - but > that still sounds complicated! > > Sticking with the GN lines and the synonym, if this was > stored as a > simple tag/value as usual in BioSQL, I would write my SQL > statement to > search the annotation table where the term id was that > associated with > a GN synonym, and the annotation value was "HABP1".? > Simple. > > Using the XML approach, are you suggesting you could do a > full text > search on the annotation value field, looking for any rows > where the > field contains "HABP1", > where the term id matches > the GN lines' XML string? This sounds simplistic and > probably rather > slow - presumably why you resorted to the more complicated > indexing > scheme described above? > > > What I mean is, a fully normalized relational > representation, especially if > > nested, is often not the most efficient data structure > for efficient > > searching and filtering. > > OK.? But do we really need to worry about complex > nested structures > for the SwissProt annotation (or in general)? > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython at maubp.freeserve.co.uk Sun May 24 10:42:14 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 24 May 2009 11:42:14 +0100 Subject: [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <867081.50034.qm@web62404.mail.re1.yahoo.com> References: <867081.50034.qm@web62404.mail.re1.yahoo.com> Message-ID: <320fb6e00905240342t7d59f783t8203cce581256f88@mail.gmail.com> On Sun, May 24, 2009 at 5:10 AM, Michiel de Hoon wrote: > > I suggest that for the short term, we store the DE lines as one > string in the same way as Bioperl 1.5 and 1.6, until we decide > on a more advanced way to treat these lines. Agreed. > Currently Bio.SeqIO and Bio.SwissProt use different ways to > handle the DE lines, and neither of them agrees with Bioperl. Well, Bio.SeqIO agrees with BioPerl modulo the white space - but we might as well agree with the current BioPerl behaviour until something is settled for storing more complex objects than strings in BioSQL. As I mentioned earlier, I'll be away for this week, so feel free to press ahead with this. Peter From bugzilla-daemon at portal.open-bio.org Mon May 25 18:21:26 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 May 2009 14:21:26 -0400 Subject: [Biopython-dev] [Bug 2840] New: When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord fails in _load_reference Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2840 Summary: When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord fails in _load_reference Product: Biopython Version: Not Applicable Platform: All OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: david.wyllie at ndm.ox.ac.uk Hi I have been trying to load SeqRecords from BioSQL, annotate them, and then write them to a different BioSQL database. Reloading the record to the second database fails. This isn't to do with annotation - none is performed. This issue is different from #2838, which has been addressed (thank you). The sequence of events is 1) eFetch a SeqRecord from Genbank (succeeds) 2) write to BioSQL (succeeds) 3) recover from BioSQL (succeeds) 4) write to BioSQL (fails, although no modifications have been made). The current problem seems related to references: Loader.load_seqrecord._load_reference. Error says: _load_reference start = 1 + int(str(reference.location[0].start)) ValueError: invalid literal for int() with base 10: 'None' Testing has been done on Ubuntu 9 x64 with Python 2.6 (debian package), python-dev (debian package), load from CVS as of 24.5.09, and a testcase program, dbtestcase.py, attached to the now fixed bug #2838. To run dbtestcase.py, the mysql details will have to be altered on line beginning ad=AuthDetails(... but otherwise it should I think run. Traceback and program output from dbtestcase.py follow. dwyllie at dwyllie:~/programs/CheckleyProject/src$ python dbtestcase.py OK, going to recover record 28804743 from genbank.... Record loaded looks like this: ID: AB098727.1 Name: AB098727 Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal protein S11, cytochromoe b/f complex subunit IV, partial cds. Number of features: 5 /sequence_version=1 /source=chloroplast Ceratodon purpureus /taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales', 'Ditrichaceae', 'Ceratodon'] /keywords=[''] /references=[, , , ] /accessions=['AB098727'] /data_file_division=PLN /date=26-MAY-2005 /organism=Ceratodon purpureus /gi=28804743 Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA', IUPACAmbiguousDNA()) ======================================================================== Load from Entrez completed, records= 1 Here is the loaded record: ======================================================================== ID: AB098727.1 Name: AB098727 Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal protein S11, cytochromoe b/f complex subunit IV, partial cds. Number of features: 5 /sequence_version=1 /source=chloroplast Ceratodon purpureus /taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales', 'Ditrichaceae', 'Ceratodon'] /keywords=[''] /references=[, , , ] /accessions=['AB098727'] /data_file_division=PLN /date=26-MAY-2005 /organism=Ceratodon purpureus /gi=28804743 Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA', IUPACAmbiguousDNA()) ======================================================================== Now loading these records into a BioSQL database One. /var/lib/python-support/python2.6/MySQLdb/__init__.py:34: DeprecationWarning: the sets module is deprecated from sets import ImmutableSet Creating a new database One ======================================================================== Load from database One completed, records= 1 ======================================================================== Here is the record recovered from database One: ID: AB098727.1 Name: AB098727 Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal protein S11, cytochromoe b/f complex subunit IV, partial cds. Number of features: 5 /dates=['26-MAY-2005'] /ncbi_taxid=3225 /date=['26-MAY-2005'] /taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Bryopsida', 'Dicranidae', 'Dicranales', 'Ditrichaceae', 'Ceratodon', 'Ceratodon purpureus'] /source=['chloroplast Ceratodon purpureus'] /references=[, , , ] /gi=28804743 /data_file_division=PLN /keywords=[''] /organism=Ceratodon purpureus /sequence_version=['1'] /accessions=['AB098727'] DBSeq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA', DNAAlphabet()) ======================================================================== Creating a new database Two Traceback (most recent call last): File "dbtestcase.py", line 165, in from dbtestcase import AuthDetails File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 182, in DemonstrateProblem(problemgi,ad) File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 158, in DemonstrateProblem db2.load(listtoload) File "/usr/local/lib/python2.6/dist-packages/BioSQL/BioSeqDatabase.py", line 442, in load db_loader.load_seqrecord(cur_record) File "/usr/local/lib/python2.6/dist-packages/BioSQL/Loader.py", line 57, in load_seqrecord self._load_reference(reference, rank, bioentry_id) File "/usr/local/lib/python2.6/dist-packages/BioSQL/Loader.py", line 733, in _load_reference start = 1 + int(str(reference.location[0].start)) ValueError: invalid literal for int() with base 10: 'None' dwyllie at dwyllie:~/programs/CheckleyProject/src$ -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 25 18:23:52 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 May 2009 14:23:52 -0400 Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord in _load_reference In-Reply-To: Message-ID: <200905251823.n4PINq60005295@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2840 david.wyllie at ndm.ox.ac.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|When a record has been |When a record has been |loaded from BioSQL, trying |loaded from BioSQL, trying |to save it to another |to save it to another |database fails with loader |database fails with loader |db_loader.load_seqrecord |db_loader.load_seqrecord in |fails in _load_reference |_load_reference -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 25 22:23:20 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 May 2009 18:23:20 -0400 Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord in _load_reference In-Reply-To: Message-ID: <200905252223.n4PMNKL7023601@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2840 ------- Comment #1 from david.wyllie at ndm.ox.ac.uk 2009-05-25 18:23 EST ------- I have modified the dbtestcase.py script to show the contents of the reference of the record downloaded from genbank, and from the record recovered from BioSQL. Here is a print out of the last two references before saving to BioSQL: authors: Sugita,M., Sugiura,C., Arikawa,T. and Higuchi,M. title: Molecular evidence of an rpoA gene in the basal moss chloroplast genomes: rpoA is a useful molecular marker for phylogenetic analysis of mosses journal: Hikobia 14, 171-175 (2004) medline id: pubmed id: comment: location: [0:789] authors: Sugita,M. title: Direct Submission journal: Submitted (25-DEC-2002) Mamoru Sugita, Nagoya University, Center for Gene Research; Chikusa-ku, Nagoya, Aichi 464-8602, Japan (E-mail:sugita at gene.nagoya-u.ac.jp, Tel:81-52-789-3080(ex.3080), Fax:81-52-789-3080) medline id: pubmed id: comment: --- note: no location in the first one; only a location in the last reference (why? - should references have a location? I suppose they might, if they referred to a part of a chromosome?) Now, after saving to BioSQL and recovering, all the records have a location, but in some cases, it is [None:None]; here are the same two records. location: [None:None] authors: Sugita,M., Sugiura,C., Arikawa,T. and Higuchi,M. title: Molecular evidence of an rpoA gene in the basal moss chloroplast genomes: rpoA is a useful molecular marker for phylogenetic analysis of mosses journal: Hikobia 14, 171-175 (2004) medline id: pubmed id: comment: location: [0:789] authors: Sugita,M. title: Direct Submission journal: Submitted (25-DEC-2002) Mamoru Sugita, Nagoya University, Center for Gene Research; Chikusa-ku, Nagoya, Aichi 464-8602, Japan (E-mail:sugita at gene.nagoya-u.ac.jp, Tel:81-52-789-3080(ex.3080), Fax:81-52-789-3080) medline id: pubmed id: comment: After this, the db.load method calls _load_reference. I think the problem is because the last line doesn't cope with none values. If one edits _load_reference to put the last reference inside a test for the null condition if (start is not None and end is not None): sql = "INSERT INTO bioentry_reference (bioentry_id, reference_id," \ " start_pos, end_pos, rank)" \ " VALUES (%s, %s, %s, %s, %s)" self.adaptor.execute(sql, (bioentry_id, reference_id, start, end, rank + 1)) Then the problem is solved, but I'm not sure how this fits in the bigger scheme of things. d -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon May 25 22:26:21 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 May 2009 18:26:21 -0400 Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord in _load_reference In-Reply-To: Message-ID: <200905252226.n4PMQK9o023893@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2840 ------- Comment #2 from david.wyllie at ndm.ox.ac.uk 2009-05-25 18:26 EST ------- Created an attachment (id=1305) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1305&action=view) A program which tests for the problem. Alter the ad=AuthDetails line to include MySQl passwords for your system; using root and no password in the script as is. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 26 00:14:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 May 2009 20:14:40 -0400 Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord in _load_reference In-Reply-To: Message-ID: <200905260014.n4Q0EeBh030704@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2840 ------- Comment #3 from cymon.cox at gmail.com 2009-05-25 20:14 EST ------- (In reply to comment #1) > I have modified the dbtestcase.py script to show the contents of the reference > of the record downloaded from genbank, and from the record recovered from > BioSQL. > > Here is a print out of the last two references before saving to BioSQL: > > authors: Sugita,M., Sugiura,C., Arikawa,T. and Higuchi,M. > title: Molecular evidence of an rpoA gene in the basal moss chloroplast > genomes: rpoA is a useful molecular marker for phylogenetic analysis of mosses > journal: Hikobia 14, 171-175 (2004) > medline id: > pubmed id: > comment: > > location: [0:789] > authors: Sugita,M. > title: Direct Submission > journal: Submitted (25-DEC-2002) Mamoru Sugita, Nagoya University, Center for > Gene Research; Chikusa-ku, Nagoya, Aichi 464-8602, Japan > (E-mail:sugita at gene.nagoya-u.ac.jp, Tel:81-52-789-3080(ex.3080), > Fax:81-52-789-3080) > medline id: > pubmed id: > comment: > > --- note: no location in the first one; only a location in the last reference > (why? - should references have a location? I suppose they might, if they > referred to a part of a chromosome?) > > Now, after saving to BioSQL and recovering, all the records have a location, > but in some cases, it is [None:None]; here are the same two records. > > location: [None:None] > authors: Sugita,M., Sugiura,C., Arikawa,T. and Higuchi,M. > title: Molecular evidence of an rpoA gene in the basal moss chloroplast > genomes: rpoA is a useful molecular marker for phylogenetic analysis of mosses > journal: Hikobia 14, 171-175 (2004) > medline id: > pubmed id: > comment: > > location: [0:789] > authors: Sugita,M. > title: Direct Submission > journal: Submitted (25-DEC-2002) Mamoru Sugita, Nagoya University, Center for > Gene Research; Chikusa-ku, Nagoya, Aichi 464-8602, Japan > (E-mail:sugita at gene.nagoya-u.ac.jp, Tel:81-52-789-3080(ex.3080), > Fax:81-52-789-3080) > medline id: > pubmed id: > comment: > > > After this, the db.load method calls _load_reference. > > I think the problem is because the last line doesn't cope with none values. > If one edits > _load_reference to put the last reference inside a test for the null condition > > if (start is not None and end is not None): > sql = "INSERT INTO bioentry_reference (bioentry_id, reference_id," > \ > " start_pos, end_pos, rank)" \ > " VALUES (%s, %s, %s, %s, %s)" > self.adaptor.execute(sql, (bioentry_id, reference_id, > start, end, rank + 1)) > > Then the problem is solved, but I'm not sure how this fits in the bigger scheme > of things. > > d > The BioSQL loader uses None for "start" and "end" if a reference doesn't have a location. When the reference is retrieved the location remains set to ["None","None"] Try this alteration to BioSeq.py, it should solve your problem: cymon at gyra:~/git/github-master/BioSQL$ git diff BioSeq.py diff --git a/BioSQL/BioSeq.py b/BioSQL/BioSeq.py index cc47cf4..8d1e02a 100644 --- a/BioSQL/BioSeq.py +++ b/BioSQL/BioSeq.py @@ -351,8 +351,11 @@ def _retrieve_reference(adaptor, primary_id): references = [] for start, end, location, title, authors, dbname, accession in refs: reference = SeqFeature.Reference() - if start: start -= 1 - reference.location = [SeqFeature.FeatureLocation(start, end)] + if start: + start -= 1 + reference.location = [SeqFeature.FeatureLocation(start, end)] + else: + reference.location = [] #Don't replace the default "" with None. if authors : reference.authors = authors if title : reference.title = title Heres a patch for the unittest to compare locations of injected and retrieved records: diff --git a/Tests/test_BioSQL_SeqIO.py b/Tests/test_BioSQL_SeqIO.py index 2d8caf8..9479e02 100644 --- a/Tests/test_BioSQL_SeqIO.py +++ b/Tests/test_BioSQL_SeqIO.py @@ -360,6 +360,19 @@ def compare_records(old, new) : assert len(old.annotations[key]) == len(new.annotations[key]) for old_r, new_r in zip(old.annotations[key], new.annotations[key]) : compare_references(old_r, new_r) + for old_ref, new_ref in zip(old.annotations[key], + new.annotations[key]): + if old_ref.location == []: + assert new_ref.location == [], "old_reference.location %s !=" \ + "new_reference location %s" % (old_ref.location, + new_ref.location) + else: + assert old_ref.location[0].start == new_ref.location[0].start, \ + "old ref.location[0].start %s != new ref.location[0].start %s" % \ + (old_ref.location[0].start, new_ref.location[0].start) + assert old_ref.location[0].end == new_ref.location[0].end, \ + "old ref.location[0].end %s != new ref.location[0].end %s" % \ + (old_ref.location[0].end, new_ref.location[0].end) elif key == "comment": if isinstance(old.annotations[key], list): old_comment = [comm.replace("\n", " ") for comm in \ Cheers, Cymon -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 26 14:17:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 26 May 2009 10:17:48 -0400 Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord in _load_reference In-Reply-To: Message-ID: <200905261417.n4QEHmf9007821@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2840 ------- Comment #4 from cymon.cox at gmail.com 2009-05-26 10:17 EST ------- (In reply to comment #3) > (In reply to comment #1) The functions in old Tests/BioSQL_Seq.py have moved to seq_tests_common.py. So ive updated the seq_tests_common: diff --git a/Tests/seq_tests_common.py b/Tests/seq_tests_common.py index d3b7fb4..392a96c 100644 --- a/Tests/seq_tests_common.py +++ b/Tests/seq_tests_common.py @@ -40,10 +40,17 @@ def compare_references(old_r, new_r) : #allow us to store a consortium. assert new_r.consrtm == "" - #TODO - reference location? - #The parser seems to give a location object (i.e. which - #nucleotides from the file is the reference for), while the - #we seem to use the database to hold the journal details (!) + # Reference location + if old_r.location == []: + assert new_r.location == [], "old_r.location %s != " \ + "new_r.location %s" % (old_r.location, new_r.location) + else: + assert old_r.location[0].start == new_r.location[0].start, \ + "old_r.location[0].start %s != new_r.location[0].start %s" % \ + (old_r.location[0].start, new_r.location[0].start) + assert old_r.location[0].end == new_r.location[0].end, \ + "old_r.location[0].end %s != new_r.location[0].end %s" % \ + (old_r.location[0].end, new_r.location[0].end) return True Pushed to http://github.com/cymon/biopython-github-master/tree/bug2840 C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue May 26 17:32:34 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 26 May 2009 13:32:34 -0400 Subject: [Biopython-dev] [Bug 2841] New: SeqFeature constructor ignores qualifiers and sub_features arguments Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2841 Summary: SeqFeature constructor ignores qualifiers and sub_features arguments Product: Biopython Version: 1.50 Platform: All OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: n.j.loman at bham.ac.uk The constructor to Bio.SeqFeature.SeqFeature ignores qualifiers and sub_features, although the prototype to the constructor allows these keyword arguments to be specified. I see in the code there is a reason for it to be ignored: # XXX right now sub_features and qualifiers cannot be set # from the initializer because this causes all kinds # of recursive import problems. I can't understand why this is # at all :-< self.qualifiers = {} self.sub_features = [] However, would it not be better to get rid of the keyword arguments from the constructor prototype to stop people getting confused? I keep stumbling over this problem myself and forgetting about it. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed May 27 07:57:05 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 27 May 2009 03:57:05 -0400 Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord in _load_reference In-Reply-To: Message-ID: <200905270757.n4R7v5iv004300@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2840 ------- Comment #5 from david.wyllie at ndm.ox.ac.uk 2009-05-27 03:57 EST ------- Thank you very much! I haven't tested the unit tests but the patch in #3 resolves the problem. With best wishes -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Sat May 30 09:37:35 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 30 May 2009 02:37:35 -0700 (PDT) Subject: [Biopython-dev] More SwissProt inconsistencies Message-ID: <880385.97797.qm@web62401.mail.re1.yahoo.com> Looking some more at how Bio.SeqIO and Bio.SwissProt store the information in a SwissProt file, I found the following two inconsistencies: 1) A multi-line author list such as the following: RA Pain A., Renauld H., Berriman M., Murphy L., Yeats C.A., Weir W., RA Kerhornou A., Aslett M., Bishop R., Bouchier C., Cochet M., RA Coulson R.M.R., Cronin A., de Villiers E.P., Fraser A., Fosker N., RA Gardner M., Goble A., Griffiths-Jones S., Harris D.E., Katzer F., RA Larke N., Lord A., Maser P., McKellar S., Mooney P., Morton F., RA Nene V., O'Neil S., Price C., Quail M.A., Rabbinowitsch E., RA Rawlings N.D., Rutter S., Saunders D., Seeger K., Shah T., Squares R., RA Squares S., Tivey A., Walker A.R., Woodward J., Dobbelaere D.A.E., RA Langsley G., Rajandream M.A., McKeever D., Shiels B., Tait A., RA Barrell B.G., Hall N.; is stored without newlines by Bio.SeqIO: >>> seq_record.annotations['references'][0].authors "Pain A., Renauld H., Berriman M., Murphy L., Yeats C.A., Weir W.,Kerhornou A., Aslett M., Bishop R., Bouchier C., Cochet M.,Coulson R.M.R., Cronin A., de Villiers E.P., Fraser A., Fosker N.,Gardner M., Goble A., Griffiths-Jones S., Harris D.E., Katzer F.,Larke N., Lord A., Maser P., McKellar S., Mooney P., Morton F.,Nene V., O'Neil S., Price C., Quail M.A., Rabbinowitsch E.,Rawlings N.D., Rutter S., Saunders D., Seeger K., Shah T., Squares R.,Squares S., Tivey A., Walker A.R., Woodward J., Dobbelaere D.A.E.,Langsley G., Rajandream M.A., McKeever D., Shiels B., Tait A.,Barrell B.G., Hall N.;" but with newlines by Bio.SwissProt: >>> swiss_record.references[0].authors "Pain A., Renauld H., Berriman M., Murphy L., Yeats C.A., Weir W.,\nKerhornou A., Aslett M., Bishop R., Bouchier C., Cochet M.,\nCoulson R.M.R., Cronin A., de Villiers E.P., Fraser A., Fosker N.,\nGardner M., Goble A., Griffiths-Jones S., Harris D.E., Katzer F.,\nLarke N., Lord A., Maser P., McKellar S., Mooney P., Morton F.,\nNene V., O'Neil S., Price C., Quail M.A., Rabbinowitsch E.,\nRawlings N.D., Rutter S., Saunders D., Seeger K., Shah T., Squares R.,\nSquares S., Tivey A., Walker A.R., Woodward J., Dobbelaere D.A.E.,\nLangsley G., Rajandream M.A., McKeever D., Shiels B., Tait A.,\nBarrell B.G., Hall N.;" To me, the Bio.SeqIO approach seems more reasonable. I think we should add a space though at places where there is a newline in the file. The same happens for multiline RL such as RL (In) Baker M.J., Crush J.R., Humphreys L.R. (eds.); RL Proceedings of the XVII international grassland congress, RL pp.2:1033-1034, Dunmore Press, Palmerston North (1993). and for multiline RT lines such as RT "Genome of the host-cell transforming parasite Theileria annulata RT compared with T. parva."; This is stored by Bio.SeqIO as '"Genome of the host-cell transforming parasite Theileria annulatacompared with T. parva.";' and by Bio.SwissProt as '"Genome of the host-cell transforming parasite Theileria annulata\ncompared with T. parva.";' whereas I think that both should be stored as '"Genome of the host-cell transforming parasite Theileria annulata compared with T. parva.";' 2) Comments in a references such as the following: RC STRAIN=cv. VF36; TISSUE=Anther; are stored as a single string by Bio.SeqIO: >>> seq_record.annotations['references'][i].comment 'STRAIN=cv. VF36; TISSUE=Anther;' but as a list of (key, value) pairs by Bio.SwissProt: [('STRAIN', 'cv. VF36'), ('TISSUE', 'Anther')] Whereas I think both are reasonable, Bio.SeqIO drops the space between two (key, value) pairs if they are on two separate lines: RC STRAIN=C57BL/6J; RC TISSUE=Bone marrow, Embryo, Kidney, Liver, Thymus, and Visual cortex; is stored as >>> seq_record.annotations['references'][i].comment 'STRAIN=C57BL/6J;TISSUE=Bone marrow, Embryo, Kidney, Liver, Thymus, and Visual cortex;' I think we should add a space here, or just store these as (key, value) pairs as Bio.SwissProt is doing. Any objections or comments? --Michiel