From chapmanb at 50mail.com  Fri May  1 08:11:25 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 1 May 2009 08:11:25 -0400
Subject: [Biopython-dev] MUMmer
In-Reply-To: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca>
References: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca>
Message-ID: <20090501121125.GD50777@sobchak.mgh.harvard.edu>

Marcin;

> I guess I should start with a nice 'hi' to everybody, now that I am
> sending my first message to this group. So: Hi, Everybody! 

Welcome. We are happy to have you.

> Now, that we have the formality out of the way, I will get to the point.
> Recently, I have written some Python code for parsing and processing the
> output of MUMmer tool (http://mummer.sourceforge.net/). More
> specifically, the code I have manages invocations and handles outputs of
> the nucmer pipeline (alignment of multiple closely related nucleotide
> sequences) and of mummer itself (short exact matches). Obviously, the
> results are ultimately rendered as pairs of biopython's Seq objects. 

This is great -- we don't have support for MUMmer alignments so this
is very welcome.

> I use this stuff only myself, in work on bacterial genomes, but I would
> be more than willing to contribute it to the project. It may be rough
> around the edges at the moment, but I think I could easily give it the
> necessary polish if there is interest in having it included. 

As Bartek mentioned, the first step is to organize the code you have
and start it as a branch on GitHub. Being able to see the code will
help us make specific suggestions. Generally, based on what you've
written it sounds like this will fit into the alignment interfaces.
Peter and Cymon have been working on organizing this. Support for
command lines and running programs lives in:

http://github.com/biopython/biopython/tree/master/Bio/Align/Applications

Parsing output and returning alignment objects is organized in the
AlignIO module:

http://github.com/biopython/biopython/tree/master/Bio/AlignIO
http://www.biopython.org/wiki/AlignIO

Tests are an important part of the submission process and many
examples are found here:

http://github.com/biopython/biopython/tree/master/Tests

test_Clustalw.py is an example of a print and compare style test,
and test_Mafft_tool.py is a unittest style test. We are more
concerned with good testing coverage then how exactly the tests get
written.

We can definitely help with more specific feedback but hopefully
this gives you a general idea to get started.

Looking forward to seeing the code,
Brad

From chapmanb at 50mail.com  Fri May  1 08:28:06 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 1 May 2009 08:28:06 -0400
Subject: [Biopython-dev] XML parsing library for new modules
In-Reply-To: <3f6baf360904291228m1bba6008kb26e345510772e7a@mail.gmail.com>
References: <3f6baf360904291228m1bba6008kb26e345510772e7a@mail.gmail.com>
Message-ID: <20090501122806.GE50777@sobchak.mgh.harvard.edu>

Eric;
Thanks for summarizing the issues. I know Peter is taking a few well
deserved days off but I suspect he will have some thoughts when he
returns. We'd love to hear the experience of others who have used
different python XML parsers.

My lean is towards ElementTree for reasons of code clarity. SAX
parsers require a lot of boilerplate style code. They also can be
tricky with nested elements; I always find myself using a lot of "if
in_tag; else if in_tag" style code. ElementTree eliminates a lot of
these issues which should result in easier to maintain code.

Brad

> I'm writing a parser for the PhyloXML format for Google Summer of Code this
> year, and as the name would imply, it requires parsing some large XML files.
> The existing modules in Biopython for parsing XML formats seem to use
> xml.sax in the standard library. In Python 2.5, a faster and more Pythonic
> parser was added to the standard lib: ElementTree (xml.etree), in
> pure-Python and C-enhanced flavors. How do you feel about each of these
> libraries as the basis for a new Biopython module?
> 
> Here are some interesting benchmarks:
> http://effbot.org/zone/celementtree.htm#benchmarks
> 
> The ElementTree library is also available as a standalone package,
> compatible back to Python 2.1, and the lxml package also offers an
> independent implementation. So maintaining compatibility with Python 2.4
> would require the availability of one of these third-party packages, and my
> code would try each of these imports in order:
> 
> from xml.etree import cElementTree as ElementTree
> from xml.etree import ElementTree
> # Separate lxml package
> from lxml.etree import ElementTree
> # Standalone elementtree package
> import cElementTree as ElementTree
> from elementtree import ElementTree
> 
> Then one day, when Python 2.4 is no longer supported, only the first two
> lines would be needed. (The second line is for sites that disable C
> extensions, like Google App Engine, or alternate Python implementations like
> Jython.)
> 
> Another option is xml.parsers.expat, but just Googling around, it appears
> that the Python zeitgeist is strongly in favor of xml.etree for new code.
> 
> Thoughts?
> 
> Thanks,
> Eric
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev

From marcin.swiatek at mail.mcgill.ca  Fri May  1 14:17:14 2009
From: marcin.swiatek at mail.mcgill.ca (Marcin Swiatek)
Date: Fri, 1 May 2009 14:17:14 -0400
Subject: [Biopython-dev] MUMmer
In-Reply-To: <8b34ec180904300950n2c75f010oed27493f52d0da14@mail.gmail.com>
References: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca>
	<8b34ec180904300950n2c75f010oed27493f52d0da14@mail.gmail.com>
Message-ID: <176A06E658ED0745965C072C5F2C116A037F084C@EXCHANGE2VS2.campus.mcgill.ca>

Bartek, Brad,

Thank you for the suggestions. I will set myself up as proposed and see
what I can do to align my code with local customs and traditions. If
questions arise, I will post again. 

As for the use of alignment object, I have actually chosen to represent
'candidate' matches by my own simplistic class. Nucmer, the way I use
it, generates lots of spurious matches, which I always need to somehow
filter. Thus, it seemed perfectly reasonable at the time to create the
proper representation of alignment later on, in a separate function
call. Following your suggestion I will probably change it to return an
alignment object, rather than a pair of sequences. But details are best
discussed once the code is available, so I think we will return to this
matter later. 

Regards,

Marcin


-----Original Message-----
From: barwil at gmail.com [mailto:barwil at gmail.com] On Behalf Of Bartek
Wilczynski
Sent: Thursday, April 30, 2009 12:51 PM
To: Marcin Swiatek
Cc: biopython-dev at biopython.org
Subject: Re: [Biopython-dev] MUMmer

Hi Marcin,

On Thu, Apr 30, 2009 at 5:23 PM, Marcin Swiatek
<marcin.swiatek at mail.mcgill.ca> wrote:
> Hello,
>
>
>
> I use this stuff only myself, in work on bacterial genomes, but I
would
> be more than willing to contribute it to the project. It may be rough
> around the edges at the moment, but I think I could easily give it the
> necessary polish if there is interest in having it included.
>
Contributions are always welome

>
>
> Should that be the case, could one of the project leads point me in
the
> right direction, please? How should I go about the submission?
>
>
I don't think I qualify as a lead, but nonetheless I think I can help
here.

I think that the best way to submit your code currently is to create a
branch (fork) of
biopython on github and submit your changes there and then notify
people on biopython-dev
that there is new code to review. You can also submit an enhancement
bug to bugzilla.

There are a couple of wiki pages which might be of interest  to you:
- http://biopython.org/wiki/Contributing
- http://biopython.org/wiki/GitUsage

If you have any questions or problems during the process, ask on the
list.

As for the code, I'm not sure, but maybe instead of returning a pair
of sequences, an alignment object might be a better choice?

You might want to also check out a recent code on application wrappers:
http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005766.html

cheers
  Bartek


From bugzilla-daemon at portal.open-bio.org  Fri May  1 14:16:57 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 1 May 2009 14:16:57 -0400
Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest
In-Reply-To: <bug-2820-42@http.bugzilla.open-bio.org/>
Message-ID: <200905011816.n41IGvXO012709@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2820


------- Comment #8 from eric.talevich at gmail.com  2009-05-01 14:16 EST -------
(In reply to comment #7)
> (In reply to comment #2)
> > Python 2.6 includes a context manager that makes all these problems
> > *completely* go away, by catching all of the warnings raised within a
> > context and optionally storing them as a list of warning objects that
> > can be inspected.
> 
> That sounds much better :)
> 
> > Would you be interested in having a unit test that does a more thorough
> > check of the warnings system, but only runs on Py2.6? I'm guessing no,
> > but hey, worth a shot.
> 
> Yes - other than using the old print-and-compare test, this seems worth doing
> in order to actually test the warnings we expect are being issued.  It could be
> a whole new file, test_PDB_warnings.py which required Python 2.6+, but as its
> just one or two tests, maybe just use conditional method(s) within the
> test_PDB_unit.py file.
> 
> Peter
> 

I have something that works on both Py2.5 and Py2.6 now:
http://github.com/etal/biopython/tree/pdbtidy

I added a new file called _PDB_extra.py which test_PDB_unit.py imports if an
attribute called 'catch_warnings' is available in the current warnings module.
If so, the method test_warnings is added to the class, otherwise nothing
happens. So Py2.6 runs 9 tests in test_PDB_unit.py, while Py2.5 only runs 8.

This seemed easier than creating a whole separate unittest suite for one tricky
test, but I defer to you on the organization and naming. I think I'll need to
do a similar separation of tests for PhyloXML, so I'd like to have a consistent
pattern to follow here.

Also, apparently tests are run in alphabetical order, and Exposure was jumping
ahead of PDBExceptionTest. I renamed PDBExceptionTest to ExceptionTest to
restore the natural order of things and stop setting off the warnings
prematurely. Maybe test suites with multiple TestCase classes should be
arranged alphabetically in the code to avoid confusion in the future.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon May  4 06:57:33 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 4 May 2009 06:57:33 -0400
Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline -
	properties and kwargs
In-Reply-To: <bug-2822-42@http.bugzilla.open-bio.org/>
Message-ID: <200905041057.n44AvXil006684@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2822


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1288 is|0                           |1
           obsolete|                            |


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-04 06:57 EST -------
Created an attachment (id=1289)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1289&action=view)
Patch to add keyword arguments and properties to command line wrappers

Brad likes the idea, and as the Bio.Application module owner that's good :)
http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005963.html

This patch makes a very slight difference to reduce the changes needed to old
code (i.e. in the __init__ method use self.parameters = [...] as before) with
the bonus that the base class and subclasses have the same __init__ signature
(argument list).

This patch also now covers Bio.Align.Applications, Bio.Motif.Applications and
Bio.AlignAce.Applications as well as Bio.Emboss.Applications (i.e. all affected
files).

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From p.j.a.cock at googlemail.com  Mon May  4 08:02:59 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 4 May 2009 13:02:59 +0100
Subject: [Biopython-dev] MUMmer
In-Reply-To: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca>
References: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca>
Message-ID: <320fb6e00905040502y4785a0f9t4475ab0868a791c@mail.gmail.com>

On Thu, Apr 30, 2009 at 4:23 PM, Marcin Swiatek
<marcin.swiatek at mail.mcgill.ca> wrote:
> Hello,
>
> I guess I should start with a nice 'hi' to everybody, now that I am
> sending my first message to this group. So: Hi, Everybody!

Hi!

> Now, that we have the formality out of the way, I will get to the point.
> Recently, I have written some Python code for parsing and processing the
> output of MUMmer tool (http://mummer.sourceforge.net/). More
> specifically, the code I have manages invocations and handles outputs of
> the nucmer pipeline (alignment of multiple closely related nucleotide
> sequences) and of mummer itself (short exact matches). Obviously, the
> results are ultimately rendered as pairs of biopython's Seq objects.
>
> I use this stuff only myself, in work on bacterial genomes, but I would
> be more than willing to contribute it to the project. It may be rough
> around the edges at the moment, but I think I could easily give it the
> necessary polish if there is interest in having it included.

Great!  I assume your OK with our licence, and there are no problems
from your employer/University with a contribution like this?

> Should that be the case, could one of the project leads point me in the
> right direction, please? How should I go about the submission?

In terms of showing us the code, how do you feel about trying out
github (see Bartek's email)?  Alternatively file and enhancement bug
on our bugzilla and upload your current python file (or a zip file if this
is split up into several modules).

>From your description above it sounds like you have two main lumps
of code: a pairwise alignment parser, and some command line tool
wrappers.

Brad and Bartek have already mentioned returning Alignment objects,
that would let us integrate MUMmer as an input format for Bio.AlignIO,
http://biopython.org/wiki/AlignIO
It may be helpful to have a look at how we parse FASTA output into
pairwise alignments, and also the EMBOSS "pairs" files from needle
and water.

Although (as Brad mentioned), this is currently undergoing a little flux,
for the command line wrappers I'd like this to use our Bio.Application
framework to represent the command line object, giving a string the
user can then invoke as the prefer.  Having the MUMmer wrapper
under Bio.Align.Applications seems sensible at this point.

If you have been lurking on the dev mailing list for a while, these
topics may be familiar already.  If not, have a look over the last
month or so in the archives here:
http://lists.open-bio.org/pipermail/biopython-dev/

Thanks,

Peter

From p.j.a.cock at googlemail.com  Mon May  4 08:15:04 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 4 May 2009 13:15:04 +0100
Subject: [Biopython-dev] XML parsing library for new modules
In-Reply-To: <20090501122806.GE50777@sobchak.mgh.harvard.edu>
References: <3f6baf360904291228m1bba6008kb26e345510772e7a@mail.gmail.com>
	<20090501122806.GE50777@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00905040515q63312065oec904ed3f5ffab93@mail.gmail.com>

On Fri, May 1, 2009 at 1:28 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Eric;
> Thanks for summarizing the issues. I know Peter is taking a few well
> deserved days off but I suspect he will have some thoughts when he
> returns. We'd love to hear the experience of others who have used
> different python XML parsers.

I would be interested to hear Michiel's views on this, as he knows
more about the specifics of the existing XML parsers in Biopython
(e.g. Bio.Entrez).

> My lean is towards ElementTree for reasons of code clarity. SAX
> parsers require a lot of boilerplate style code. They also can be
> tricky with nested elements; I always find myself using a lot of "if
> in_tag; else if in_tag" style code. ElementTree eliminates a lot of
> these issues which should result in easier to maintain code.

We have been trying to avoid external library dependencies where
possible (moving away from Martel for parsing has really helped here).
Given ElementTree and cElementTree are included with Python 2.5+,
this is only an issue for Biopython running on Python 2.4.  Both
ElementTree and cElementTree are available as separate downloads
(with Windows installers).  I think under their licence we could even
bundle it with Biopython if need be.

So, while it is a shame ElementTree isn't part of Python 2.4, if it is
the best technical solution, that shouldn't stop us from using it.  Note
we should ONLY use those core features which are included with
Python 2.5+ inself.

Peter

P.S. I wonder if our BLAST XML parser would get a big speed boost
if we switched it to ElementTree instead of xml.sax?

From bugzilla-daemon at portal.open-bio.org  Mon May  4 09:47:25 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 4 May 2009 09:47:25 -0400
Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline -
	properties and kwargs
In-Reply-To: <bug-2822-42@http.bugzilla.open-bio.org/>
Message-ID: <200905041347.n44DlPQD018238@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2822


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1289 is|0                           |1
           obsolete|                            |


------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-04 09:47 EST -------
Created an attachment (id=1290)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1290&action=view)
Patch to add keyword arguments, properties and __repr__ to command line
wrappers

Extended to include __repr__ support (using the new keyword arguments support).

Note that the Muscle wrapper will need an alternative python valid identifier
for the -in argument, e.g. "input", because we can't use just "in" as a
property or keyword argument.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon May  4 10:07:57 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 4 May 2009 10:07:57 -0400
Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline -
	properties and kwargs
In-Reply-To: <bug-2822-42@http.bugzilla.open-bio.org/>
Message-ID: <200905041407.n44E7vI9020041@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2822


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1290 is|0                           |1
           obsolete|                            |


------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-04 10:07 EST -------
Created an attachment (id=1291)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1291&action=view)
Patch to add keyword arguments, properties and __repr__ to command line
wrappers

As in previous patch but with support for clearing parameters by "deleting" the
property, and some basic doctests in Bio.Application.

Still need to co-ordinate with Cymon to give the Muscle wrapper a valid python
identifier as an alias for the -in argument, e.g. "input", because we can't use
just "in" as a property or keyword argument.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Mon May  4 10:48:53 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 4 May 2009 15:48:53 +0100
Subject: [Biopython-dev] Properties in Bio.Application interface?
In-Reply-To: <20090430120532.GA50777@sobchak.mgh.harvard.edu>
References: <320fb6e00904260546s2d3eeb73iec08df89d93f9908@mail.gmail.com>
	<320fb6e00904290325w756493f2i264f79cd3f7e08b8@mail.gmail.com>
	<320fb6e00904290834g7a73c7au487564e3b103250@mail.gmail.com>
	<20090430120532.GA50777@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00905040748w7a0b940aub82220b9c78e7dc3@mail.gmail.com>

On Thu, Apr 30, 2009 at 1:05 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> I love what you are doing here. The keywords and properties make
> it much more Pythonic; the old way reeks of Java-style get/sets. My
> vote is to put them both in.

Cool - I was hoping people would agree it is more pythonic.

I have some follow up thoughts, or points for discussion ...

Peter

From biopython at maubp.freeserve.co.uk  Mon May  4 10:53:37 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 4 May 2009 15:53:37 +0100
Subject: [Biopython-dev]  Properties names in command line wrappers
Message-ID: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com>

On Mon, May 4, 2009 at 3:48 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Thu, Apr 30, 2009 at 1:05 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>> I love what you are doing here. The keywords and properties make
>> it much more Pythonic; the old way reeks of Java-style get/sets. My
>> vote is to put them both in.
>
> Cool - I was hoping people would agree it is more pythonic.
>
> I have some follow up thoughts, or points for discussion ...
>

I updated the patch on Bug 2822 to cover all the Bio.Application
command line wrapper subclasses, and included __repr__ support.
However, that has raised a real example of a parameter where the
current "human readable" name is not a valid python identifier ("in",
for "-in" in Muscle).  I think the pragmatic solution is to add a
sensible alternative which we can use for the property and keyword
argument name (e.g. "input" in this case) while in general keeping
these names as close as possible to the actual parameter name as used
at the command line.

On the other hand, some might argue for giving all the options
meaningful names.  The (hardly used) existing blastall wrapper in
Bio/Blast/Applications.py gives the "-a" argument a human readable
name of "nprocessors", and "-A" gets "window_size". With the old
set_parameter call either alias could be used.  However, with a python
property we need to pick one as a preferred name - and I'm not 100%
sure being helpful and using "nprocessors" (e.g. cline.nprocessors=4)
is actually better than using the actual argument name (e.g. cline.a =
4).

My instinct is that these are low level wrappers, which don't try to
second guess the user.  To take full advantage of any command line
tool you will need to read the tool's documentation to know what the
arguments are - and having Biopython making up its own aliases just
makes things more complicated.  Therefore I think the property names
in the command line wrapper objects should be as close as possible to
the actual command line arguments.  In this case, for blastall use "a"
for number of processors and "A" for window size.

However, I see the existing "helper functions" in
Bio/Blast/NCBIStandalone.py as a higher level wrapper, which tries to
insulate the user from the precise details of the command line string,
and here using an argument name "nprocessors" makes more sense
(although again, it differs from the actual command line making cross
referencing to the NCBI documentation more difficult).

What are your thoughts Brad?

Peter

From biopython at maubp.freeserve.co.uk  Mon May  4 11:03:17 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 4 May 2009 16:03:17 +0100
Subject: [Biopython-dev]  Switches in the Bio.Application interface
Message-ID: <320fb6e00905040803v78029f26k48b8e6908e40cc5@mail.gmail.com>

On Mon, May 4, 2009 at 3:48 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Thu, Apr 30, 2009 at 1:05 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>> I love what you are doing here. The keywords and properties make
>> it much more Pythonic; the old way reeks of Java-style get/sets. My
>> vote is to put them both in.
>
> Cool - I was hoping people would agree it is more pythonic.
>
> I have some follow up thoughts, or points for discussion ...
>
> Peter
>

It seems sensible to me to allow "deleting" a property to clear it.
There is an example in the proposed Bio/Application/__init__.py
docstring of how this would work:

>>> from Bio.Emboss.Applications import WaterCommandline
>>> cline = WaterCommandline(gapopen=10, gapextend=0.5)
>>> cline
WaterCommandline(cmd='water', gapopen=10, gapextend=0.5)

You can also manipulate the parameters via their properties, e.g.

>>> cline.gapopen
10
>>> cline.gapopen = 20
>>> cline
WaterCommandline(cmd='water', gapopen=20, gapextend=0.5)

You can clear a parameter you have already added by 'deleting' the
corresponding property:

>>> del cline.gapopen
>>> cline.gapopen
>>> cline
WaterCommandline(cmd='water', gapextend=0.5)

That does seem to work and covers most situation, however there is a
special case of command line "switches" (arguments which don't take an
argument, like -kimura in ClustalW, or -l in ls).  There are a lot of
these cases in Cymon's new alignment wrappers.  These worked OK when
used with set_parameter("kimura"), the value is omitted and defaults
to None.  Using the current patch, to set this via the keyword
argument or property, it must explicitly be set to None, which is
ugly:

>>> from Bio.Align.Applications import ClustalwCommandline
>>> print ClustalwCommandline(gapopen=2, gapext=0.5, infile="demo.fasta")
clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5
>>> print ClustalwCommandline(gapopen=2, gapext=0.5, kimura=None, infile="demo.fasta")
clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5 -kimura

For these "switch" arguments, perhaps the value should be interpreted
as a boolean (should the switch be added or not?).  This would be a
change to the current API, but I don't think any of the existing
wrappers actually have this kind of parameter, so there shouldn't be a
backwards compatibility issue here.  Instead I want to do this:

>>> from Bio.Align.Applications import ClustalwCommandline
>>> print ClustalwCommandline(gapopen=2, gapext=0.5, infile="demo.fasta")
clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5
>>> print ClustalwCommandline(gapopen=2, gapext=0.5, kimura=True, infile="demo.fasta")
clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5 -kimura
>>> print ClustalwCommandline(gapopen=2, gapext=0.5, kimura=False, infile="demo.fasta")
clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5

An example use case is to allow parameter searches e.g.

from Bio.Align.Applications import ClustalwCommandline
for gap_open in [0, 1, 2, 10] :
    for gap_extend in [0, 0.25, 0.5] :
        for use_kimura in [True, False] :
            #Won't work yet!:
            cline = ClustalwCommandline(gapopen=gap_open,
gapext=gap_extend, kimura=use_kimura, infile="demo.fasta")
            print cline

Or, modifying and reusing a single command line wrapper object:

from Bio.Align.Applications import ClustalwCommandline
#Set standard options:
cline = ClustalwCommandline(infile="demo.fasta")
#Do parameter sweep:
for gap_open in [0, 1, 2, 10] :
    cline.gapopen = gap_open
    for gap_extend in [0, 0.25, 0.5] :
        cline.gapext = gap_extend
        for use_kimura in [True, False] :
            cline.kimura = use_kimura #Won't work yet!
            print cline


Peter

From bugzilla-daemon at portal.open-bio.org  Mon May  4 11:29:33 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 4 May 2009 11:29:33 -0400
Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline -
	properties and kwargs
In-Reply-To: <bug-2822-42@http.bugzilla.open-bio.org/>
Message-ID: <200905041529.n44FTXr9025530@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2822


------- Comment #7 from cymon.cox at gmail.com  2009-05-04 11:29 EST -------
(In reply to comment #6)
> Created an attachment (id=1291)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1291&action=view) [details]
> Patch to add keyword arguments, properties and __repr__ to command line
> wrappers
> 
> As in previous patch but with support for clearing parameters by "deleting" the
> property, and some basic doctests in Bio.Application.
> 
> Still need to co-ordinate with Cymon to give the Muscle wrapper a valid python
> identifier as an alias for the -in argument, e.g. "input", because we can't use
> just "in" as a property or keyword argument.

"input" for -in and maybe also "input1" "input2" as alternatives for -in1 -in2,
might the the way to go, and document it.

C. 


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From mjldehoon at yahoo.com  Mon May  4 11:25:17 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Mon, 4 May 2009 08:25:17 -0700 (PDT)
Subject: [Biopython-dev] XML parsing library for new modules
In-Reply-To: <320fb6e00905040515q63312065oec904ed3f5ffab93@mail.gmail.com>
Message-ID: <3493.66471.qm@web62406.mail.re1.yahoo.com>


--- On Mon, 5/4/09, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> > My lean is towards ElementTree for reasons of code
> clarity. SAX
> > parsers require a lot of boilerplate style code. They
> also can be
> > tricky with nested elements; I always find myself
> using a lot of "if
> > in_tag; else if in_tag" style code. ElementTree
> eliminates a lot of
> > these issues which should result in easier to maintain
> code.

This is partially true. SAX parsers can be complicated, but with some dedication reasonably clear code is also possible. The SAX parser in Bio.Entrez is not all that bad, and it can handle all kinds of different XML pages as long as a DTD is available. The prime motivation for ElementTree is that it's mutable; I don't know if that is really needed in this case. Another thing to consider is what to do with the result returned by ElementTree. Whereas it will contain all the information in the XML file, it may not represent it in a user-friendly way. You may want to take the output from ElementTree and store it in a more biopython-like object. Also keep in mind memory usage: ElementTree will keep the complete XML file in memory, whereas the SAX parser gives you more flexibility here (see below).

That said, I don't have any fundamental objections against using ElementTree.

> 
> We have been trying to avoid external library dependencies
> where
> possible (moving away from Martel for parsing has really
> helped here).
> Given ElementTree and cElementTree are included with Python
> 2.5+,
> this is only an issue for Biopython running on Python 2.4. 

I think it's OK to require Python 2.5 or later for Biopython.

> P.S. I wonder if our BLAST XML parser would get a big speed
> boost if we switched it to ElementTree instead of xml.sax?

I doubt it, since the SAX parser is pretty straightforward -- the hard part is to go through the DTD and find out how to interpret each element in the XML (this is not time-consuming though). The key point though is memory usage. With the SAX parser, you can parse the XML file in chunks, and use an iterator to return individual Blast records -- you don't need to keep the full XML file in memory. The Blast parser NCBIXML.parse does exactly that. With ElementTree, as far as I understand you read in the full XML file and keep it in memory.

--Michiel.


From cy at cymon.org  Mon May  4 11:34:52 2009
From: cy at cymon.org (Cymon Cox)
Date: Mon, 4 May 2009 16:34:52 +0100
Subject: [Biopython-dev] Switches in the Bio.Application interface
In-Reply-To: <320fb6e00905040803v78029f26k48b8e6908e40cc5@mail.gmail.com>
References: <320fb6e00905040803v78029f26k48b8e6908e40cc5@mail.gmail.com>
Message-ID: <7265d4f0905040834p419f49c5ka33ceef6f7dcab19@mail.gmail.com>

2009/5/4 Peter <biopython at maubp.freeserve.co.uk>

> On Mon, May 4, 2009 at 3:48 PM, Peter <biopython at maubp.freeserve.co.uk>
> wrote:
>
> That does seem to work and covers most situation, however there is a
> special case of command line "switches" (arguments which don't take an
> argument, like -kimura in ClustalW, or -l in ls).  There are a lot of
> these cases in Cymon's new alignment wrappers.  These worked OK when
> used with set_parameter("kimura"), the value is omitted and defaults
> to None.  Using the current patch, to set this via the keyword
> argument or property, it must explicitly be set to None, which is
> ugly:
>
> >>> from Bio.Align.Applications import ClustalwCommandline
> >>> print ClustalwCommandline(gapopen=2, gapext=0.5, infile="demo.fasta")
> clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5
> >>> print ClustalwCommandline(gapopen=2, gapext=0.5, kimura=None,
> infile="demo.fasta")
> clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5 -kimura


Ugly, and very confusing.


> For these "switch" arguments, perhaps the value should be interpreted
> as a boolean (should the switch be added or not?).


This is what i did in my Muscle helper functions - so makes sense to me...

C.
-- 
____________________________________________________________________

Cymon J. Cox

Centro de Ciencias do Mar
Faculdade de Ciencias do Mar e Ambiente (FCMA)
Universidade do Algarve
Campus de Gambelas
8005-139 Faro
Portugal

Phone: +0351 289800909 ext 7909
Fax: +0351 289800051
Email: cy at cymon.org, cymon at ualg.pt, cymon.cox at gmail.com
HomePage : http://biology.duke.edu/bryology/cymon.html
-8.63/-6.77

From p.j.a.cock at googlemail.com  Mon May  4 11:45:12 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 4 May 2009 16:45:12 +0100
Subject: [Biopython-dev] XML parsing library for new modules
In-Reply-To: <3493.66471.qm@web62406.mail.re1.yahoo.com>
References: <320fb6e00905040515q63312065oec904ed3f5ffab93@mail.gmail.com>
	<3493.66471.qm@web62406.mail.re1.yahoo.com>
Message-ID: <320fb6e00905040845g67219f36r563f4dfa1b080125@mail.gmail.com>

Brad wrote:
>>> My lean[ing] is towards ElementTree for reasons of code
>>> clarity. SAX parsers require a lot of boilerplate style code.
>>> They also can be tricky with nested elements; I always
>>> find myself using a lot of "if in_tag; else if in_tag" style
>>> code. ElementTree eliminates a lot of these issues
>>> which should result in easier to maintain code.

Michiel wrote:
> This is partially true. SAX parsers can be complicated, but
> with some dedication reasonably clear code is also possible.
> The SAX parser in Bio.Entrez is not all that bad, and it can
> handle all kinds of different XML pages as long as a DTD
> is available. The prime motivation for ElementTree is that
> it's mutable; I don't know if that is really needed in this case.

Eric will have to answer that regarding PhyloXML, but if the
aim is to turn it into one of our existing tree objects, then
having the XML structure mutable is irrelevant.

> Another thing to consider is what to do with the result
> returned by ElementTree. Whereas it will contain all the
> information in the XML file, it may not represent it in a
> user-friendly way. You may want to take the output from
> ElementTree and store it in a more biopython-like object.
> Also keep in mind memory usage: ElementTree will keep
> the complete XML file in memory, whereas the SAX
> parser gives you more flexibility here (see below).

Something for Eric to consider.

Michiel wrote:
> That said, I don't have any fundamental objections
> against using ElementTree.

Peter wrote:
>> We have been trying to avoid external library dependencies
>> where possible (moving away from Martel for parsing has
>> really helped here). Given ElementTree and cElementTree
>> are included with Python 2.5+, this is only an issue for
>> Biopython running on Python 2.4.
>
> I think it's OK to require Python 2.5 or later for Biopython.

As this stage I disagree, Python 2.4 would still be widely
used on production servers running stable distributions.
Also we'd have to give a couple of releases notice about
dropping Python 2.4 support.  In any case, if we want to
use ElementTree with Python 2.4 this is possible.

Peter wrote:
>> P.S. I wonder if our BLAST XML parser would get a big speed
>> boost if we switched it to ElementTree instead of xml.sax?
>
> I doubt it, since the SAX parser is pretty straightforward --
> the hard part is to go through the DTD and find out how to
> interpret each element in the XML (this is not
> time-consuming though). The key point though is memory
> usage. With the SAX parser, you can parse the XML file in
> chunks, and use an iterator to return individual Blast records
> -- you don't need to keep the full XML file in memory. The
> Blast parser NCBIXML.parse does exactly that. With
> ElementTree, as far as I understand you read in the full
> XML file and keep it in memory.

Keeping a full BLAST XML file in memory would be a bad idea,
and would spoil the memory savings of the iterator approach
to parsing it.  So ElementTree isn't suitable for everything ;)

Peter

From biopython at maubp.freeserve.co.uk  Mon May  4 11:47:58 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 4 May 2009 16:47:58 +0100
Subject: [Biopython-dev] Switches in the Bio.Application interface
In-Reply-To: <7265d4f0905040834p419f49c5ka33ceef6f7dcab19@mail.gmail.com>
References: <320fb6e00905040803v78029f26k48b8e6908e40cc5@mail.gmail.com>
	<7265d4f0905040834p419f49c5ka33ceef6f7dcab19@mail.gmail.com>
Message-ID: <320fb6e00905040847s32bc9e4fr3f7fb045b2d3429b@mail.gmail.com>

On Mon, May 4, 2009 at 4:34 PM, Cymon Cox <cy at cymon.org> wrote:
>
>> For these "switch" arguments, perhaps the value should be interpreted
>> as a boolean (should the switch be added or not?).
>
> This is what i did in my Muscle helper functions - so makes sense to me...
>

Good :)

Peter

From bugzilla-daemon at portal.open-bio.org  Mon May  4 12:29:10 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 4 May 2009 12:29:10 -0400
Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline -
	properties and kwargs
In-Reply-To: <bug-2822-42@http.bugzilla.open-bio.org/>
Message-ID: <200905041629.n44GTAeq030521@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2822


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1291 is|0                           |1
           obsolete|                            |


------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-04 12:29 EST -------
(From update of attachment 1291)
Checked into CVS:

Checking in Tests/test_Prank_tool.py;
/home/repository/biopython/biopython/Tests/test_Prank_tool.py,v  <-- 
test_Prank_tool.py
new revision: 1.5; previous revision: 1.4
done
Checking in Tests/test_Muscle_tool.py;
/home/repository/biopython/biopython/Tests/test_Muscle_tool.py,v  <-- 
test_Muscle_tool.py
new revision: 1.7; previous revision: 1.6
done
Checking in Tests/test_Emboss.py;
/home/repository/biopython/biopython/Tests/test_Emboss.py,v  <-- 
test_Emboss.py
new revision: 1.20; previous revision: 1.19
done
Checking in Tests/test_Clustalw_tool.py;
/home/repository/biopython/biopython/Tests/test_Clustalw_tool.py,v  <-- 
test_Clustalw_tool.py
new revision: 1.13; previous revision: 1.12
done
Checking in Bio/Application/__init__.py;
/home/repository/biopython/biopython/Bio/Application/__init__.py,v  <-- 
__init__.py
new revision: 1.15; previous revision: 1.14
done
Checking in Bio/Emboss/Applications.py;
/home/repository/biopython/biopython/Bio/Emboss/Applications.py,v  <-- 
Applications.py
new revision: 1.23; previous revision: 1.22
done
Checking in Bio/AlignAce/Applications.py;
/home/repository/biopython/biopython/Bio/AlignAce/Applications.py,v  <-- 
Applications.py
new revision: 1.5; previous revision: 1.4
done
Checking in Bio/Motif/Applications/_AlignAce.py;
/home/repository/biopython/biopython/Bio/Motif/Applications/_AlignAce.py,v  <--
 _AlignAce.py
new revision: 1.3; previous revision: 1.2
done
Checking in Bio/Align/Applications/_Clustalw.py;
/home/repository/biopython/biopython/Bio/Align/Applications/_Clustalw.py,v  <--
 _Clustalw.py
new revision: 1.5; previous revision: 1.4
done
Checking in Bio/Align/Applications/_Mafft.py;
/home/repository/biopython/biopython/Bio/Align/Applications/_Mafft.py,v  <-- 
_Mafft.py
new revision: 1.4; previous revision: 1.3
done
Checking in Bio/Align/Applications/_Muscle.py;
/home/repository/biopython/biopython/Bio/Align/Applications/_Muscle.py,v  <-- 
_Muscle.py
new revision: 1.6; previous revision: 1.5
done
Checking in Bio/Align/Applications/_Prank.py;
/home/repository/biopython/biopython/Bio/Align/Applications/_Prank.py,v  <-- 
_Prank.py
new revision: 1.4; previous revision: 1.3
done

(In reply to comment #7)
> (In reply to comment #6)
> > Still need to co-ordinate with Cymon to give the Muscle wrapper a valid
> > python identifier as an alias for the -in argument, e.g. "input", because
> > we can't use just "in" as a property or keyword argument.
> 
> "input" for -in and maybe also "input1" "input2" as alternatives for -in1
> -in2, might the the way to go, and document it.

I've used "input" as the preferred alias for "-in".

Leaving this bug open to cover dealing with "switch" arguments like -kimura in
clustalw, where it makes sense to treat the value as a boolean (see dev mailing
list).

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon May  4 13:48:28 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 4 May 2009 13:48:28 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905041748.n44HmSaN003712@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #23 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-04 13:48 EST -------
In Prank, should realbranches take no arguments?  i.e. use the new _Switch
class?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon May  4 13:49:20 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 4 May 2009 13:49:20 -0400
Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline -
	properties and kwargs
In-Reply-To: <bug-2822-42@http.bugzilla.open-bio.org/>
Message-ID: <200905041749.n44HnK8j003766@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2822


------- Comment #9 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-04 13:49 EST -------
(In reply to comment #8)
> Leaving this bug open to cover dealing with "switch" arguments like -kimura in
> clustalw, where it makes sense to treat the value as a boolean (see dev mailing
> list).

Done in CVS, I think.  Next, more test and documentation...

Checking in Bio/Application/__init__.py;
/home/repository/biopython/biopython/Bio/Application/__init__.py,v  <-- 
__init__.py
new revision: 1.16; previous revision: 1.15
done
Checking in Bio/Align/Applications/_Clustalw.py;
/home/repository/biopython/biopython/Bio/Align/Applications/_Clustalw.py,v  <--
 _Clustalw.py
new revision: 1.6; previous revision: 1.5
done
Checking in Bio/Align/Applications/_Mafft.py;
/home/repository/biopython/biopython/Bio/Align/Applications/_Mafft.py,v  <-- 
_Mafft.py
new revision: 1.5; previous revision: 1.4
done
Checking in Bio/Align/Applications/_Muscle.py;
/home/repository/biopython/biopython/Bio/Align/Applications/_Muscle.py,v  <-- 
_Muscle.py
new revision: 1.7; previous revision: 1.6
done
Checking in Bio/Align/Applications/_Prank.py;
/home/repository/biopython/biopython/Bio/Align/Applications/_Prank.py,v  <-- 
_Prank.py
new revision: 1.5; previous revision: 1.4
done
Checking in Tests/test_Clustalw_tool.py;
/home/repository/biopython/biopython/Tests/test_Clustalw_tool.py,v  <-- 
test_Clustalw_tool.py
new revision: 1.14; previous revision: 1.13
done
Checking in Tests/test_Muscle_tool.py;
/home/repository/biopython/biopython/Tests/test_Muscle_tool.py,v  <-- 
test_Muscle_tool.py
new revision: 1.8; previous revision: 1.7
done
Checking in Tests/test_Prank_tool.py;
/home/repository/biopython/biopython/Tests/test_Prank_tool.py,v  <-- 
test_Prank_tool.py
new revision: 1.6; previous revision: 1.5
done


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Tue May  5 08:04:09 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 5 May 2009 08:04:09 -0400
Subject: [Biopython-dev] [Bug 2294] Writing GenBank files with Bio.SeqIO
In-Reply-To: <bug-2294-42@http.bugzilla.open-bio.org/>
Message-ID: <200905051204.n45C4987022142@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2294


------- Comment #14 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-05 08:04 EST -------
Created an attachment (id=1292)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1292&action=view)
Patch to Bio/SeqIO/InsdcIO.py to write GenBank features

This patch adds basic support for writing features in GenBank files.

There is still plenty to do:
* Full testing, both manual and with extended unit test coverage
* Wrapping long feature locations
* Writing references
* Extending to cover writing EBML files

Note that this requires the latest Bio.GenBank code from CVS, as during this
work I found and fixed two small issues with the location parsing.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From chapmanb at 50mail.com  Tue May  5 08:36:57 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 5 May 2009 08:36:57 -0400
Subject: [Biopython-dev] Properties names in command line wrappers
In-Reply-To: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com>
References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com>
Message-ID: <20090505123656.GB15113@sobchak.mgh.harvard.edu>

Hi Peter;
Nice to have you back. Hope you had a relaxing few days away.

> I updated the patch on Bug 2822 to cover all the Bio.Application
> command line wrapper subclasses, and included __repr__ support.
> However, that has raised a real example of a parameter where the
> current "human readable" name is not a valid python identifier ("in",
> for "-in" in Muscle).  I think the pragmatic solution is to add a
> sensible alternative which we can use for the property and keyword
> argument name (e.g. "input" in this case) while in general keeping
> these names as close as possible to the actual parameter name as used
> at the command line.

Agreed. This is the best solution for these few conflicting cases.

> On the other hand, some might argue for giving all the options
> meaningful names.  The (hardly used) existing blastall wrapper in
> Bio/Blast/Applications.py gives the "-a" argument a human readable
> name of "nprocessors", and "-A" gets "window_size". With the old
> set_parameter call either alias could be used.  However, with a python
> property we need to pick one as a preferred name - and I'm not 100%
> sure being helpful and using "nprocessors" (e.g. cline.nprocessors=4)
> is actually better than using the actual argument name (e.g. cline.a =
> 4).

Could we support both the original argument and optional human
readable arguments? I know the code in Application is a bit
hard coded for the first argument as the real name and the last
argument as the readable name; the cleanest solution would be to
generalize this to have multiple names where it makes sense.

More practically, it always makes sense to have the low level
standard arguments from the program itself. Even if it is
non-intuitive like BLASTs switches, people who already understand
the program can just use their existing knowledge without any
specific knowledge of how Biopython. Where someone wants to 
support more useful names, they can add those in.

You have been digging around in this so probably have a good idea
how hard this is to implement practically. If it's a pain, I'd argue
to just have the original arguments now, and the useful names can do
on a todo list.

Brad

From chapmanb at 50mail.com  Tue May  5 08:50:59 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 5 May 2009 08:50:59 -0400
Subject: [Biopython-dev] XML parsing library for new modules
In-Reply-To: <320fb6e00905040845g67219f36r563f4dfa1b080125@mail.gmail.com>
References: <320fb6e00905040515q63312065oec904ed3f5ffab93@mail.gmail.com>
	<3493.66471.qm@web62406.mail.re1.yahoo.com>
	<320fb6e00905040845g67219f36r563f4dfa1b080125@mail.gmail.com>
Message-ID: <20090505125058.GC15113@sobchak.mgh.harvard.edu>

Peter, Michiel and Eric;

> > Another thing to consider is what to do with the result
> > returned by ElementTree. Whereas it will contain all the
> > information in the XML file, it may not represent it in a
> > user-friendly way. You may want to take the output from
> > ElementTree and store it in a more biopython-like object.

Agreed. Most of the fun creative parts of the project, as opposed to
the parsing nuts and bolts, will be in developing the object
representations.

> > Also keep in mind memory usage: ElementTree will keep
> > the complete XML file in memory, whereas the SAX
> > parser gives you more flexibility here (see below).

ElementTree can do incremental parsing, so you can also deal with
large files using it:

http://effbot.org/zone/element-iterparse.htm

Brad

From biopython at maubp.freeserve.co.uk  Tue May  5 09:58:04 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 5 May 2009 14:58:04 +0100
Subject: [Biopython-dev] Properties names in command line wrappers
In-Reply-To: <20090505123656.GB15113@sobchak.mgh.harvard.edu>
References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com>
	<20090505123656.GB15113@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00905050658h2cabf55dhfbb467042135843a@mail.gmail.com>

On Tue, May 5, 2009 at 1:36 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> Could we support both the original argument and optional human
> readable arguments? I know the code in Application is a bit
> hard coded for the first argument as the real name and the last
> argument as the readable name; the cleanest solution would be to
> generalize this to have multiple names where it makes sense.

You mean for these BLAST examples, create two properties "a" and
"nprocessors", both controlling the "-a" parameter, and also two
properties "A" and "window_size" both controlling "-A"?  From a code
point of view, this would be moderately straight forward - but I'm not
convinced about this.

> More practically, it always makes sense to have the low level
> standard arguments from the program itself. Even if it is
> non-intuitive like BLASTs switches, people who already understand
> the program can just use their existing knowledge without any
> specific knowledge of how Biopython.

Yes :)

Personally I initially found it very frustrating when using the
Bio.Blast.NCBIStandalone.blastall wrapper because the NCBI switches
had all been given friendly names, and it wasn't clear without looking
at the source code what mapped to what.  As a minor change, I think
the Bio.Blast.NCBIStandalone.blastall docstring should actually
include the real NCBI switch used by each Biopython keyword.

> Where someone wants to support more useful names, they can
> add those in.

So that we cater to those familiar with the NCBI command line
arguments, but also give a more human alternative?  On the downside,
it means there are two ways to set these parameters.  Also, if we go
down this route for consistency for all command line wrappers we may
want to invent more human readable aliases (if the tool arguments are
too cryptic).  We are also opening up a potential problem if the tool
later adds a new argument whose name clashes with one of our
inventions.  Also would we care about the lack of consistency between
tools (e.g. infile versus input?), and should we try and be consistent
in our new names?

I favour using only a single property for each parameter, with the
name as similar as possible to the actual command line switch (i.e.
property name "a" for "-a", not "nprocessors").  Note each property
would have a docstring which will say what is it for ("Number of
processors to use.").

In the case of the existing blastall wrapper in
Bio.Blast.Applications, I would use change names=["-a", "nprocessors"]
to ["-a", "nprocessors", "a"], meaning "a" (last entry) would be the
property name used, "-a" (first entry) would be used for the actual
command line string.  I would keep the "nprocessors" alias for
backwards compatibility only - all three aliases would be available to
the (legacy) method set_parameter.

> You have been digging around in this so probably have a good idea
> how hard this is to implement practically. If it's a pain, I'd argue
> to just have the original arguments now, and the useful names can do
> on a todo list.

It is certainly possible, although probably a bit tedious due to
changing the "boilerplate" code.

Peter

From bugzilla-daemon at portal.open-bio.org  Tue May  5 10:37:56 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 5 May 2009 10:37:56 -0400
Subject: [Biopython-dev] [Bug 2294] Writing GenBank files with Bio.SeqIO
In-Reply-To: <bug-2294-42@http.bugzilla.open-bio.org/>
Message-ID: <200905051437.n45EbuNA006427@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2294


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1292 is|0                           |1
           obsolete|                            |


------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-05 10:37 EST -------
(From update of attachment 1292)
Checked into CVS now.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From p.j.a.cock at googlemail.com  Tue May  5 11:26:20 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 5 May 2009 16:26:20 +0100
Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was
	Bio.GFF)
In-Reply-To: <320fb6e00904210655u5f497227o24457ec2ee3d711f@mail.gmail.com>
References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com>
	<320fb6e00904210655u5f497227o24457ec2ee3d711f@mail.gmail.com>
Message-ID: <320fb6e00905050826y5ae0b13eiaa6d9e56fd9049e9@mail.gmail.com>

On Tue, Apr 21, 2009 at 2:55 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> I have also been thinking about how I would (re)design the SeqFeature
>> and FeatureLocation objects. ?In particular I would want to put the
>> strand as part of the same object as the location, and also any
>> join-locations. ?I would still want to cope with fuzzy locations, but
>> make the non-fuzzy approximations more prominent in comparison. ?Also,
>> I really don't like the way joins are currently stored as more
>> SeqFeatures in the sub_features list (plus this kind of blocks
>> alternative usage for child/parent nesting that might be nice for GFF
>> files).
>>
>> The prime use case to keep in mind is taking a feature location (even
>> a join), and using this to extract that region of nucleotides from the
>> parent sequence (i.e. a Seq object or a SeqRecord object, as now both
>> can be sliced).

I've written code to do this in test_SeqIO_features.py, which cross
checks the nucleotides pulled out from a GenBank files based on the
SeqFeature, against what the NCBI provide in FASTA format.  This seems
to work OK, but has not been tested extensively (e.g. running it on
drosophila or arabidopsis would be good).

It could make sense to expose this functionality directly in
Biopython, maybe as a method of the SeqRecord taking a SeqFeature (or
the index of a feature in that record), returning a Seq object (or
perhaps a SeqRecord using the feature's annotation).

e.g.
>>> from Bio import SeqIO
>>> record = SeqIO.read(open("NC_005816.gb"),"genbank")
>>> record.extract_feature_seq(6)
Seq('GTGAACAAACAACAACAAACTGCGCTGAATATGGCGCGATTTATCAGAAGCCAG...TAA',
IUPACAmbiguousDNA())
>>> feature = record.features[6]
>>> record.extract_feature_seq(feature)
Seq('GTGAACAAACAACAACAAACTGCGCTGAATATGGCGCGATTTATCAGAAGCCAG...TAA',
IUPACAmbiguousDNA())

Alternatively, rather than introducing a new method (e.g.
"extract_feature_seq" as in the above example) we could overload the
__getitem__ method of the SeqRecord, i.e. overloading the slice
mechanism so a SeqFeature can alternatively be given, e.g.
record[feature].  Note that passing the index of a feature wouldn't
work as record[6] currently means the seventh letter, rather than the
seventh feature.

Note that just passing a SeqFeature's FeatureLocation is not enough,
as this lacks the strand information, and also any sub-features and
associated location operator (i.e. join).

> I forgot to mention the second major use case I'm concerned about,
> which is recovering the GenBank/EMBL style location string. ?I have
> looked at this in the past, by adding methods to the FeatureLocation
> and all the Position objects, but it is complicated by the fact the
> Position objects don't know if they are at the start or end (and for
> the start locations we need to add one to convert from Python
> counting). ?This is the main block on having Bio.SeqIO support writing
> GenBank (or EMBL) files with their features included.

See Bug 2294 for writing GenBank files:
http://bugzilla.open-bio.org/show_bug.cgi?id=2294
I've just checked in some code to record the features when writing
GenBank files with Bio.SeqIO.  I solved the feature location issue by
introducing a private function which knows about all the currently
used AbstractPosition objects - the code is actually pretty short.

Peter


From p.j.a.cock at googlemail.com  Tue May  5 12:41:31 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 5 May 2009 17:41:31 +0100
Subject: [Biopython-dev] Dropping Python 2.3 support in Biopython
Message-ID: <320fb6e00905050941m37725eb5ibba02ca99236212e@mail.gmail.com>

Hello all,

This is a final warning that the next release of Biopython will not
support Python 2.3.

As far as we are aware, no-one has come forward with a need for
continued support for Python 2.3, so we will soon begin removing the
special case code needed to keep Biopython working on Python 2.3.
This will give us a simpler code base, less platforms to test on, and
we can also take advantage of various language features only available
in Python 2.4+ (e.g. generator expressions and decorators).

Any last minute requests to postpone this should be made to the main
Biopython mailing list by Friday 8 May.

Thank you,

Peter

From sbassi at clubdelarazon.org  Tue May  5 18:49:11 2009
From: sbassi at clubdelarazon.org (Sebastian Bassi)
Date: Tue, 5 May 2009 19:49:11 -0300
Subject: [Biopython-dev] Missing directories with easy_install?
Message-ID: <9e2f512b0905051549w688195f6w719572fe42ee2678@mail.gmail.com>

When I install Biopython 1.5 (and previous versions too) using
easy_install, it seems that docs, test and scripts directories are not
installed (see here for a screenshot, panel at left is easy_install
product while right panel is when I manually uncompress biopython
tarball: http://www.genesdigitales.com/bioinfo/biopy.jpg).
Is this expected or an oversight?

From biopython at maubp.freeserve.co.uk  Tue May  5 18:56:00 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 5 May 2009 23:56:00 +0100
Subject: [Biopython-dev] Missing directories with easy_install?
In-Reply-To: <9e2f512b0905051549w688195f6w719572fe42ee2678@mail.gmail.com>
References: <9e2f512b0905051549w688195f6w719572fe42ee2678@mail.gmail.com>
Message-ID: <320fb6e00905051556g2821e4f0k3db66ec545dcd399@mail.gmail.com>

On Tue, May 5, 2009 at 11:49 PM, Sebastian Bassi
<sbassi at clubdelarazon.org> wrote:
> When I install Biopython 1.5 (and previous versions too) using
> easy_install, it seems that docs, test and scripts directories are not
> installed (see here for a screenshot, panel at left is easy_install
> product while right panel is when I manually uncompress biopython
> tarball: http://www.genesdigitales.com/bioinfo/biopy.jpg).
> Is this expected or an oversight?

You'd have to ask Brad for an expert opinion, but I think this is
probably to be expected.  If you install from source, the only folders
copied to site-packages are Bio, BioSQL, and Martel.

See also this thread:
http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005924.html

Peter

P.S. I assume you meant Biopython 1.50 and not 1.5 ;)

From sbassi at clubdelarazon.org  Tue May  5 19:05:46 2009
From: sbassi at clubdelarazon.org (Sebastian Bassi)
Date: Tue, 5 May 2009 20:05:46 -0300
Subject: [Biopython-dev] Missing directories with easy_install?
In-Reply-To: <320fb6e00905051556g2821e4f0k3db66ec545dcd399@mail.gmail.com>
References: <9e2f512b0905051549w688195f6w719572fe42ee2678@mail.gmail.com>
	<320fb6e00905051556g2821e4f0k3db66ec545dcd399@mail.gmail.com>
Message-ID: <9e2f512b0905051605k663035d7td84372847675c7d4@mail.gmail.com>

On Tue, May 5, 2009 at 7:56 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> You'd have to ask Brad for an expert opinion, but I think this is
> probably to be expected.  If you install from source, the only folders
> copied to site-packages are Bio, BioSQL, and Martel.
> See also this thread:
> http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005924.html

OK, so that is.

> P.S. I assume you meant Biopython 1.50 and not 1.5 ;)

yes!.

From biopython at maubp.freeserve.co.uk  Tue May  5 19:33:16 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 6 May 2009 00:33:16 +0100
Subject: [Biopython-dev] SeqRecord per-letter-annotation : avoid lists?
Message-ID: <320fb6e00905051633i70604746i332b3bfaf3476876@mail.gmail.com>

Hi all,

I was thinking that about the SeqRecord object's letter_annotations,
and that perhaps we should only allow strings and tuples (which are
immutable), but not lists.  Because lists are mutable, the user can
(accidentaly) alter the list such that its length doesn't match that
of the associated sequence (which would be bad). Currently we do use
lists in the SeqRecord's letter_annotations, e.g. for qualities. I
don't recall having any particular reason for using a list rather than
a tuple.

Any thoughts on this?

Peter

From p.j.a.cock at googlemail.com  Wed May  6 06:32:01 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 6 May 2009 11:32:01 +0100
Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was
	Bio.GFF)
In-Reply-To: <320fb6e00905050826y5ae0b13eiaa6d9e56fd9049e9@mail.gmail.com>
References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com>
	<320fb6e00904210655u5f497227o24457ec2ee3d711f@mail.gmail.com>
	<320fb6e00905050826y5ae0b13eiaa6d9e56fd9049e9@mail.gmail.com>
Message-ID: <320fb6e00905060332t2b9d9595pca68b83db8cef28f@mail.gmail.com>

On Tue, May 5, 2009 at 4:26 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Apr 21, 2009 at 2:55 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>> The prime use case to keep in mind is taking a feature location (even
>>> a join), and using this to extract that region of nucleotides from the
>>> parent sequence (i.e. a Seq object or a SeqRecord object, as now both
>>> can be sliced).
>
> I've written code to do this in test_SeqIO_features.py, which cross
> checks the nucleotides pulled out from a GenBank files based on the
> SeqFeature, against what the NCBI provide in FASTA format. ?This seems
> to work OK, but has not been tested extensively (e.g. running it on
> drosophila or arabidopsis would be good).

Yep - found a corner case my code can't yet cope with, from the
Arabidopsis thaliana chloroplasts (NC_000932).  This has some
pathological mixed strand locations, like
join(complement(69611..69724),139856..140650) which is for a
trans-spliced ribosomal protein.

> It could make sense to expose this functionality directly in
> Biopython, ...

Given this code is non-trivial to implement, this seems worth doing.

Peter


From bugzilla-daemon at portal.open-bio.org  Wed May  6 18:50:08 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 6 May 2009 18:50:08 -0400
Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest
In-Reply-To: <bug-2820-42@http.bugzilla.open-bio.org/>
Message-ID: <200905062250.n46Mo8EM023616@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2820


------- Comment #9 from eric.talevich at gmail.com  2009-05-06 18:50 EST -------
Created an attachment (id=1293)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1293&action=view)
Additional warnings test for Py2.6+

This is the file that test_PDB_unit.py can import to plug in an additional test
for specific warnings.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Wed May  6 18:54:06 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 6 May 2009 18:54:06 -0400
Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest
In-Reply-To: <bug-2820-42@http.bugzilla.open-bio.org/>
Message-ID: <200905062254.n46Ms6YP023831@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2820


------- Comment #10 from eric.talevich at gmail.com  2009-05-06 18:54 EST -------
Created an attachment (id=1294)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1294&action=view)
test_PDB_unit.py, with conditional import

This is a modified test_PDB_unit.py that checks whether the necessary context
manager is available (it will be for Py2.6+), and if so, imports the additional
unit test from _PDB_extra.py into the current class.

(Sorry it's a whole file, I was having trouble diffing between git branches.)


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu May  7 04:51:35 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 7 May 2009 04:51:35 -0400
Subject: [Biopython-dev] [Bug 2824] New: Bio.Entrez.epost is using an HTTP
	GET not an HTTP POST
Message-ID: <bug-2824-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2824

           Summary: Bio.Entrez.epost is using an HTTP GET not an HTTP POST
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk


Following from a query on our mailing list suggesting Bio.Entrez.epost is
failing with long ID lists, I looked a little more closely at the code and it
is actually using an HTTP GET instead of an HTTP POST (which would avoid the
long URL problem).

See:
http://lists.open-bio.org/pipermail/biopython/2009-May/005149.html

We can still use urllib to do this with its data argument...
http://docs.python.org/library/urllib.html


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu May  7 05:18:58 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 7 May 2009 05:18:58 -0400
Subject: [Biopython-dev] [Bug 2824] Bio.Entrez.epost is using an HTTP GET
	not an HTTP POST
In-Reply-To: <bug-2824-42@http.bugzilla.open-bio.org/>
Message-ID: <200905070918.n479IwHQ031195@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2824


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-07 05:18 EST -------
Created an attachment (id=1295)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1295&action=view)
Patch for Bio/Entrez/__init__.py

This patch does two things,
(1) Makes Bio.Entrez.epost do an HTTP POST
(2) Catches the too long URL error 414 messages and raises an IOError

Without the patch:

>>> print Entrez.epost("pubmed", id=",".join(str(i) for i in range(1,10000))).read()
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>414 Request-URI Too Large</title>
</head><body>
<h1>Request-URI Too Large</h1>
<p>The requested URL's length exceeds the capacity
limit for this server.<br />
</p>
</body></html>

>>> print Entrez.efetch("pubmed", id=",".join(str(i) for i in range(1,10000)), retmode="text", rettype="uilist").read()
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>414 Request-URI Too Large</title>
</head><body>
<h1>Request-URI Too Large</h1>
<p>The requested URL's length exceeds the capacity
limit for this server.<br />
</p>
</body></html>

Note both the above trigger the Error 414 message, but it does not get caught.

With the patch:

>>> print Entrez.epost("pubmed", id=",".join(str(i) for i in range(1,10000))).read()
<?xml version="1.0"?>
<!DOCTYPE ePostResult PUBLIC "-//NLM//DTD ePostResult, 11 May 2002//EN"
"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ePost_020511.dtd">
<ePostResult>
        <QueryKey>1</QueryKey>
        <WebEnv>NCID_01_264798363_130.14.18.47_9001_1241687667</WebEnv>
</ePostResult>

>>> print Entrez.efetch("pubmed", id=",".join(str(i) for i in range(1,10000)), retmode="text", rettype="uilist").read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "Bio/Entrez/__init__.py", line 126, in efetch
    return _open(cgi, variables)
  File "Bio/Entrez/__init__.py", line 370, in _open
    raise IOError("Requested URL too long (try using EPost?)")
IOError: Requested URL too long (try using EPost?)

Now epost works with long arguments, and using the other tools with too long a
URL will trigger an IOError.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu May  7 06:20:10 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 7 May 2009 06:20:10 -0400
Subject: [Biopython-dev] [Bug 2824] Bio.Entrez.epost is using an HTTP GET
	not an HTTP POST
In-Reply-To: <bug-2824-42@http.bugzilla.open-bio.org/>
Message-ID: <200905071020.n47AKAGD002826@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2824


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-07 06:20 EST -------
Patch checked in (OK'd with Michiel), marking as fixed.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu May  7 09:56:09 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 7 May 2009 09:56:09 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905071356.n47Du9iQ018532@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #24 from cymon.cox at gmail.com  2009-05-07 09:56 EST -------
(In reply to comment #23)
> In Prank, should realbranches take no arguments?  i.e. use the new _Switch
> class?

Yes, verified and done; pushed to applic-int branch.
C.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu May  7 10:07:23 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 7 May 2009 10:07:23 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905071407.n47E7Nn7019531@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #25 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-07 10:07 EST -------
(In reply to comment #24)
> (In reply to comment #23)
> > In Prank, should realbranches take no arguments?  i.e. use the new _Switch
> > class?
> 
> Yes, verified and done; pushed to applic-int branch.
> C.

Thanks for checking - that's done in CVS now.

I think the final bit of new code is _Dialign.py which still needs to be
updated for the new style __init__ method.  

Then there are your unit tests...


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu May  7 10:39:40 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 7 May 2009 10:39:40 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905071439.n47Edeaj022126@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #26 from cymon.cox at gmail.com  2009-05-07 10:39 EST -------
(In reply to comment #25)
> (In reply to comment #24)
> > (In reply to comment #23)
> > > In Prank, should realbranches take no arguments?  i.e. use the new _Switch
> > > class?
> > 
> > Yes, verified and done; pushed to applic-int branch.
> > C.
> 
> Thanks for checking - that's done in CVS now.
> 
> I think the final bit of new code is _Dialign.py which still needs to be
> updated for the new style __init__ method.

Done - pushed to applic-int (Note windows path stuff absent from _Dialign)

> Then there are your unit tests...

As they are at present, unittests for Muscle, Mafft, Dialign and Prank all
pass. They could of course be made arbitrarily more complex... they should
probably have at least one test that uses the properties style parameter
setting rather than just set_paramter()
C.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu May  7 11:22:35 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 7 May 2009 11:22:35 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905071522.n47FMZ16025500@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #27 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-07 11:22 EST -------
(In reply to comment #26)
> > I think the final bit of new code is _Dialign.py which still needs to be
> > updated for the new style __init__ method.
> 
> Done - pushed to applic-int (Note windows path stuff absent from _Dialign)
> 

OK, that is in CVS now.

> > Then there are your unit tests...
> 
> As they are at present, unittests for Muscle, Mafft, Dialign and Prank all
> pass. They could of course be made arbitrarily more complex... they should
> probably have at least one test that uses the properties style parameter
> setting rather than just set_paramter()
> C.

I've added test_Dialign_tool.py to CVS, and then switched a few to using
keyword arguments and properties.  As far as I can see from here, the tool
isn't expected to work on Windows (although it might still be possible with
cygwin):
http://bibiserv.techfak.uni-bielefeld.de/download/tools/DIALIGN_221.html

Is that everything?  You'd mentioned a more general test which just builds the
strings, but doesn't actually need to run any of the tools themselves.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri May  8 08:07:03 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 8 May 2009 08:07:03 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905081207.n48C73cT012732@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #28 from cymon.cox at gmail.com  2009-05-08 08:07 EST -------
(In reply to comment #27)
> (In reply to comment #26)
> > > I think the final bit of new code is _Dialign.py which still needs to be
> > > updated for the new style __init__ method.
> > 
> > Done - pushed to applic-int (Note windows path stuff absent from _Dialign)
> > 
> 
> OK, that is in CVS now.
> 
> > > Then there are your unit tests...
> > 
> > As they are at present, unittests for Muscle, Mafft, Dialign and Prank all
> > pass. They could of course be made arbitrarily more complex... they should
> > probably have at least one test that uses the properties style parameter
> > setting rather than just set_paramter()
> > C.
> 
> I've added test_Dialign_tool.py to CVS, and then switched a few to using
> keyword arguments and properties.  As far as I can see from here, the tool
> isn't expected to work on Windows (although it might still be possible with
> cygwin):
> http://bibiserv.techfak.uni-bielefeld.de/download/tools/DIALIGN_221.html
> 
> Is that everything?

That's everything currently written. I still want to add interfaces to ProbCons
and T-Coffee.

  You'd mentioned a more general test which just builds the
> strings, but doesn't actually need to run any of the tools themselves.

Yes, I'll do that.
C.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri May  8 08:23:03 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 8 May 2009 08:23:03 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905081223.n48CN3nV013977@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #29 from chapmanb at 50mail.com  2009-05-08 08:23 EST -------
Created an attachment (id=1296)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1296&action=view)
Start of TCoffee command line

Cymon;
Here is the start of a TCoffee command line object. It's not up to date with
the latest changes y'all have been making and doesn't have all the options, but
should save some typing.

Brad


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri May  8 15:14:27 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 8 May 2009 15:14:27 -0400
Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest
In-Reply-To: <bug-2820-42@http.bugzilla.open-bio.org/>
Message-ID: <200905081914.n48JERYx012798@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2820


eric.talevich at gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1293 is|0                           |1
           obsolete|                            |
Attachment #1294 is|0                           |1
           obsolete|                            |


------- Comment #11 from eric.talevich at gmail.com  2009-05-08 15:14 EST -------
Created an attachment (id=1297)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1297&action=view)
Py2.6-only unit test of PDB warnings

I pushed a branch called bug2820 to github containing just this commit, if
that's easier:

http://github.com/etal/biopython/tree/bug2820

Any suggestions for naming the new file?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri May  8 17:45:53 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 8 May 2009 17:45:53 -0400
Subject: [Biopython-dev] [Bug 2817] Meta-bug for cleanup once we drop Python
	2.3 support
In-Reply-To: <bug-2817-42@http.bugzilla.open-bio.org/>
Message-ID: <200905082145.n48Ljr4L023802@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2817


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-08 17:45 EST -------
I've started removing support for Python 2.3 in CVS, including removing all the
sets and subprocess special case code.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri May  8 18:14:36 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 8 May 2009 18:14:36 -0400
Subject: [Biopython-dev] [Bug 2825] New: SeqIO does not successfully parse
	Genbank records related to whole genome sequencing deposits,
	as Did not recognise the LOCUS line layout
Message-ID: <bug-2825-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2825

           Summary: SeqIO does not successfully parse Genbank records
                    related to whole genome sequencing deposits, as Did not
                    recognise the LOCUS line layout
           Product: Biopython
           Version: 1.49
          Platform: All
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: david.wyllie at ndm.ox.ac.uk


Hi

I'm using the BioPython distribution 1.49 obtained as a Package using the
Ubuntu 9 synaptic package manager.  The below describes the problem:

NCBI has a record type which describes the contents of whole-genome sequencing
projects.  The record doesn't itself contain sequence, by constrast to most
genbank records.

this URL gives an example
http://www.ncbi.nlm.nih.gov/nuccore/162285818
should the SeqIO parser be able to read this? it cannot.  Here is an example:

# import modules
from Bio import Entrez
from Bio import SeqIO

# read the record from NCBI, print out the contents.
handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818")
masterrecord=handle.readlines()
for line in masterrecord:
        print line
handle.close()

# let's read it again, and try to parse with with SeqIO.
handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818")

# this line causes the crash
seq_record = SeqIO.read(handle, "genbank")

handle.close()

# fails.  the traceback reads
"""
Traceback (most recent call last):
  File "bugreport.py", line 25, in <module>
    seq_record = SeqIO.read(handle, "genbank")
  File "/var/lib/python-support/python2.6/Bio/SeqIO/__init__.py", line 435, in
read
    first = iterator.next()
  File "/var/lib/python-support/python2.6/Bio/GenBank/Scanner.py", line 410, in
parse_records
    record = self.parse(handle)
  File "/var/lib/python-support/python2.6/Bio/GenBank/Scanner.py", line 393, in
parse
    if self.feed(handle, consumer) :
  File "/var/lib/python-support/python2.6/Bio/GenBank/Scanner.py", line 360, in
feed
    self._feed_first_line(consumer, self.line)
  File "/var/lib/python-support/python2.6/Bio/GenBank/Scanner.py", line 907, in
_feed_first_line
    raise ValueError('Did not recognise the LOCUS line layout:\n' + line)
ValueError: Did not recognise the LOCUS line layout:
LOCUS       ABIN01000000             353 rc    DNA     linear   BCT 10-DEC-2007
"""

# by contrast, reading one of the constituent genbank records, like this one
# http://www.ncbi.nlm.nih.gov/nuccore/162285817
# works correctly;

handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285817")
seq_record = SeqIO.read(handle, "genbank")
handle.close()
print "Successfully loaded record GI=162285817"
print seq_record.description


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri May  8 18:37:47 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 8 May 2009 18:37:47 -0400
Subject: [Biopython-dev] [Bug 2825] Parsing whole genome sequencing (WGS)
	Genbank records
In-Reply-To: <bug-2825-42@http.bugzilla.open-bio.org/>
Message-ID: <200905082237.n48MbleU027475@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2825


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
            Summary|SeqIO does not successfully |Parsing whole genome
                   |parse Genbank records       |sequencing (WGS) Genbank
                   |related to whole genome     |records
                   |sequencing deposits, as Did |
                   |not recognise the LOCUS line|
                   |layout                      |


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-08 18:37 EST -------
Hi David,

This is not expected to work in Bio.SeqIO or Bio.GenBank at the moment.  For
the LOCUS line we expect things like "123 aa" for proteins, or "123 bp" for
nucleotides.  Here you have "353 rc" (rc for record count), which as our error
message says, is unexpected.  At the end of the record, there are also WGS
and/or WGS_SCAFLD lines to worry about:

http://www.ncbi.nlm.nih.gov/Genbank/wgs.html
http://www.bio.net/bionet/mm/genbankb/2009-January/000299.html

Given these WGS files have no sequence, and no real sequence associated
features either, it stikes me that supporting this in Bio.SeqIO is a stretch
(these records are not really sequences, nor are they about a sequence).

However, Bio.GenBank should perhaps be updated to cope... so I'll leave this
bug open for that as a possible enhancement.  Note I have changed the bug title
from "SeqIO does not successfully parse Genbank records related to whole genome
sequencing deposits, as Did not recognise the LOCUS line layout", to "Parsing
whole genome sequencing (WGS) Genbank records", and changed the bug priority to
an enhancement.

What information do you want from this file?  In the meantime, I suggest you
fetch the record as XML, which you can parse using Bio.Entrez.read() or your
XML parser of choice.

Peter

P.S. This is a shorter way to dump the file to screen in python:

>>> from Bio import Entrez
>>> handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818")
>>> print handle.read()
LOCUS       ABIN01000000             353 rc    DNA     linear   BCT 10-DEC-2007
DEFINITION  Mycobacterium intracellulare ATCC 13950, whole genome shotgun
            sequencing project.
ACCESSION   ABIN00000000
VERSION     ABIN00000000.1  GI:162285818
DBLINK      Project:27955
KEYWORDS    WGS.
SOURCE      Mycobacterium intracellulare ATCC 13950
  ORGANISM  Mycobacterium intracellulare ATCC 13950
            Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales;
            Corynebacterineae; Mycobacteriaceae; Mycobacterium; Mycobacterium
            avium complex (MAC).
REFERENCE   1  (bases 1 to 353)
  AUTHORS   Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
            Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
  TITLE     Mycobacterium intracellulare Genome Project
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 353)
  AUTHORS   Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
            Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
  TITLE     Direct Submission
  JOURNAL   Submitted (30-NOV-2007) McGill University and Genome Quebec
            Innovation Centre, 740 Avenue Docteur Penfield, Montreal, Quebec
            H3A 1A4, Canada
COMMENT     The Mycobacterium intracellulare ATCC 13950 whole genome shotgun
            (WGS) project has the project accession ABIN00000000.  This version
            of the project (01) has the accession number ABIN01000000, and
            consists of sequences ABIN01000001-ABIN01000353.
            The whole genome shotgun sequence was generated by the McGill
            University and Genome Quebec Innovation Centre using the GS De Novo
            Assembler from GS-FLX reads.  This strain is available from the
            American Type Culture Collection (www.atcc.org).
FEATURES             Location/Qualifiers
     source          1..353
                     /organism="Mycobacterium intracellulare ATCC 13950"
                     /mol_type="genomic DNA"
                     /strain="ATCC 13950"
                     /serovar="16"
                     /isolation_source="human lymph node"
                     /db_xref="taxon:487521"
                     /note="type strain of Mycobacterium intracellulare ATCC
                     13950
                     associated with disease"
WGS         ABIN01000001-ABIN01000353
//


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri May  8 19:12:43 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 8 May 2009 19:12:43 -0400
Subject: [Biopython-dev] [Bug 2825] Parsing whole genome sequencing (WGS)
	Genbank records
In-Reply-To: <bug-2825-42@http.bugzilla.open-bio.org/>
Message-ID: <200905082312.n48NChKL030485@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2825


------- Comment #2 from david.wyllie at ndm.ox.ac.uk  2009-05-08 19:12 EST -------
Thank you for your help.  
I just wanted to extract the WGS line, which I'm able to do.


(In reply to comment #1)
> Hi David,
> 
> This is not expected to work in Bio.SeqIO or Bio.GenBank at the moment.  For
> the LOCUS line we expect things like "123 aa" for proteins, or "123 bp" for
> nucleotides.  Here you have "353 rc" (rc for record count), which as our error
> message says, is unexpected.  At the end of the record, there are also WGS
> and/or WGS_SCAFLD lines to worry about:
> 
> http://www.ncbi.nlm.nih.gov/Genbank/wgs.html
> http://www.bio.net/bionet/mm/genbankb/2009-January/000299.html
> 
> Given these WGS files have no sequence, and no real sequence associated
> features either, it stikes me that supporting this in Bio.SeqIO is a stretch
> (these records are not really sequences, nor are they about a sequence).
> 
> However, Bio.GenBank should perhaps be updated to cope... so I'll leave this
> bug open for that as a possible enhancement.  Note I have changed the bug title
> from "SeqIO does not successfully parse Genbank records related to whole genome
> sequencing deposits, as Did not recognise the LOCUS line layout", to "Parsing
> whole genome sequencing (WGS) Genbank records", and changed the bug priority to
> an enhancement.
> 
> What information do you want from this file?  In the meantime, I suggest you
> fetch the record as XML, which you can parse using Bio.Entrez.read() or your
> XML parser of choice.
> 
> Peter
> 
> P.S. This is a shorter way to dump the file to screen in python:
> 
> >>> from Bio import Entrez
> >>> handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818")
> >>> print handle.read()
> LOCUS       ABIN01000000             353 rc    DNA     linear   BCT 10-DEC-2007
> DEFINITION  Mycobacterium intracellulare ATCC 13950, whole genome shotgun
>             sequencing project.
> ACCESSION   ABIN00000000
> VERSION     ABIN00000000.1  GI:162285818
> DBLINK      Project:27955
> KEYWORDS    WGS.
> SOURCE      Mycobacterium intracellulare ATCC 13950
>   ORGANISM  Mycobacterium intracellulare ATCC 13950
>             Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales;
>             Corynebacterineae; Mycobacteriaceae; Mycobacterium; Mycobacterium
>             avium complex (MAC).
> REFERENCE   1  (bases 1 to 353)
>   AUTHORS   Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
>             Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
>   TITLE     Mycobacterium intracellulare Genome Project
>   JOURNAL   Unpublished
> REFERENCE   2  (bases 1 to 353)
>   AUTHORS   Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
>             Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
>   TITLE     Direct Submission
>   JOURNAL   Submitted (30-NOV-2007) McGill University and Genome Quebec
>             Innovation Centre, 740 Avenue Docteur Penfield, Montreal, Quebec
>             H3A 1A4, Canada
> COMMENT     The Mycobacterium intracellulare ATCC 13950 whole genome shotgun
>             (WGS) project has the project accession ABIN00000000.  This version
>             of the project (01) has the accession number ABIN01000000, and
>             consists of sequences ABIN01000001-ABIN01000353.
>             The whole genome shotgun sequence was generated by the McGill
>             University and Genome Quebec Innovation Centre using the GS De Novo
>             Assembler from GS-FLX reads.  This strain is available from the
>             American Type Culture Collection (www.atcc.org).
> FEATURES             Location/Qualifiers
>      source          1..353
>                      /organism="Mycobacterium intracellulare ATCC 13950"
>                      /mol_type="genomic DNA"
>                      /strain="ATCC 13950"
>                      /serovar="16"
>                      /isolation_source="human lymph node"
>                      /db_xref="taxon:487521"
>                      /note="type strain of Mycobacterium intracellulare ATCC
>                      13950
>                      associated with disease"
> WGS         ABIN01000001-ABIN01000353
> //
> 

(In reply to comment #1)
> Hi David,
> 
> This is not expected to work in Bio.SeqIO or Bio.GenBank at the moment.  For
> the LOCUS line we expect things like "123 aa" for proteins, or "123 bp" for
> nucleotides.  Here you have "353 rc" (rc for record count), which as our error
> message says, is unexpected.  At the end of the record, there are also WGS
> and/or WGS_SCAFLD lines to worry about:
> 
> http://www.ncbi.nlm.nih.gov/Genbank/wgs.html
> http://www.bio.net/bionet/mm/genbankb/2009-January/000299.html
> 
> Given these WGS files have no sequence, and no real sequence associated
> features either, it stikes me that supporting this in Bio.SeqIO is a stretch
> (these records are not really sequences, nor are they about a sequence).
> 
> However, Bio.GenBank should perhaps be updated to cope... so I'll leave this
> bug open for that as a possible enhancement.  Note I have changed the bug title
> from "SeqIO does not successfully parse Genbank records related to whole genome
> sequencing deposits, as Did not recognise the LOCUS line layout", to "Parsing
> whole genome sequencing (WGS) Genbank records", and changed the bug priority to
> an enhancement.
> 
> What information do you want from this file?  In the meantime, I suggest you
> fetch the record as XML, which you can parse using Bio.Entrez.read() or your
> XML parser of choice.
> 
> Peter
> 
> P.S. This is a shorter way to dump the file to screen in python:
> 
> >>> from Bio import Entrez
> >>> handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818")
> >>> print handle.read()
> LOCUS       ABIN01000000             353 rc    DNA     linear   BCT 10-DEC-2007
> DEFINITION  Mycobacterium intracellulare ATCC 13950, whole genome shotgun
>             sequencing project.
> ACCESSION   ABIN00000000
> VERSION     ABIN00000000.1  GI:162285818
> DBLINK      Project:27955
> KEYWORDS    WGS.
> SOURCE      Mycobacterium intracellulare ATCC 13950
>   ORGANISM  Mycobacterium intracellulare ATCC 13950
>             Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales;
>             Corynebacterineae; Mycobacteriaceae; Mycobacterium; Mycobacterium
>             avium complex (MAC).
> REFERENCE   1  (bases 1 to 353)
>   AUTHORS   Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
>             Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
>   TITLE     Mycobacterium intracellulare Genome Project
>   JOURNAL   Unpublished
> REFERENCE   2  (bases 1 to 353)
>   AUTHORS   Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
>             Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
>   TITLE     Direct Submission
>   JOURNAL   Submitted (30-NOV-2007) McGill University and Genome Quebec
>             Innovation Centre, 740 Avenue Docteur Penfield, Montreal, Quebec
>             H3A 1A4, Canada
> COMMENT     The Mycobacterium intracellulare ATCC 13950 whole genome shotgun
>             (WGS) project has the project accession ABIN00000000.  This version
>             of the project (01) has the accession number ABIN01000000, and
>             consists of sequences ABIN01000001-ABIN01000353.
>             The whole genome shotgun sequence was generated by the McGill
>             University and Genome Quebec Innovation Centre using the GS De Novo
>             Assembler from GS-FLX reads.  This strain is available from the
>             American Type Culture Collection (www.atcc.org).
> FEATURES             Location/Qualifiers
>      source          1..353
>                      /organism="Mycobacterium intracellulare ATCC 13950"
>                      /mol_type="genomic DNA"
>                      /strain="ATCC 13950"
>                      /serovar="16"
>                      /isolation_source="human lymph node"
>                      /db_xref="taxon:487521"
>                      /note="type strain of Mycobacterium intracellulare ATCC
>                      13950
>                      associated with disease"
> WGS         ABIN01000001-ABIN01000353
> //
> 


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Sat May  9 07:59:32 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 9 May 2009 07:59:32 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905091159.n49BxWpM015484@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #30 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-09 07:59 EST -------
I've got test_Mafft_tool.py working on one Linux machine using MAFFT v6.626b
(2009/03/16) installed from source.

However, test_Mafft_tool.py fails on another Linux machine using MAFFT v6.240
(2007/04/04) installed using the distribution's package, in this case Ubuntu
Jaunty:
http://packages.ubuntu.com/jaunty/mafft

Note that the next version of Ubuntu currently also uses the same old package:
http://packages.ubuntu.com/karmic/mafft

As does Debian unstable:
http://packages.debian.org/unstable/science/mafft

>From trying mafft v6.240 by hand at the command line, it never seems to
actually print anything to the console.  Either the MAFFT API changed (which
doesn't seem to be the case), or the version Ubuntu installed on this machine
is broken.  This could be due to something else like the version of awk or gcc
(guesses based on the MAFFT change log):
http://align.bmr.kyushu-u.ac.jp/mafft/software/

Note that the latest version is now MAFFT 6.704, so we should try that too. If
I am right about the current Ubuntu/Debian package being broken, we should get
in touch with them about updating it... otherwise we can look forward to bug
reports about our wrapper and/or test_Mafft_tool.py failing.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Sat May  9 08:31:55 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 9 May 2009 08:31:55 -0400
Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest
In-Reply-To: <bug-2820-42@http.bugzilla.open-bio.org/>
Message-ID: <200905091231.n49CVtUj017919@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2820


------- Comment #12 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-09 08:31 EST -------
(In reply to comment #8)
> I have something that works on both Py2.5 and Py2.6 now:
> http://github.com/etal/biopython/tree/pdbtidy

Would it be easy for you to test your code on Python 2.4?  I can probably do
that but not right now...

I would prefer to avoid the extra file by writing this test as part of
test_PDB_unit.py - but the "with" statement isn't valid syntax on Python 2.4,
although it can be used on Python 2.5 via:
from __future__ import with_statement

Could you re-write this to avoid the with statement?

> Also, apparently tests are run in alphabetical order, ...

Yes, that is expected.

> ... and Exposure was jumping ahead of PDBExceptionTest. I renamed
> PDBExceptionTest to ExceptionTest to restore the natural order of
> things and stop setting off the warnings prematurely. Maybe test
> suites with multiple TestCase classes should be arranged alphabetically
> in the code to avoid confusion in the future.

Ideally the unit tests should work in any order - and this is generally a
reasonable assumption, as they should be independent.  Having some carefully
named unit tests will only hide the ordering problem (which is due to the 
global state information in the warnings module).  At the very least, we should
probably have comments in the code about this (to avoid issues in the future)
and maybe use an eye-catching name like AAAAA which should always come first.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Sat May  9 09:06:15 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 9 May 2009 14:06:15 +0100
Subject: [Biopython-dev] PhyloXML read/parse functions and handles
Message-ID: <320fb6e00905090606g7235d4a6t22914f3ff9c293be@mail.gmail.com>

Hi Eric,

Are you happy to have feedback on your PhyloXML code in public?  In
this case I wanted to make a fairly general observation about parsing
files using handles, so I have cc'd the dev list.

I just had a look at the stub in Bio/PhyloXML/__init__.py and
Bio/PhyloXML/Parser.py on your github branch,
http://github.com/etal/biopython/tree/phyloxml

The convention we are following in Biopython for parsing functions is
as follows:
read(handle, ...) - returns a single object (e.g. a tree in your case)
parse(handle, ...) - returns an iterator (e.g. returning multiple trees)

[This naming convention is arbitrary, but we should try to stick to it
in all new parsers for consistency.]

In Bio/PhyloXML/Parser.py you have a parse() sub function which
according to the comment appears to return a single tree.  If so, this
should be a read() function instead of a parse() function.

You seem to have a read() stub function in Bio/PhyloXML/__init__.py
which returns a single tree (good), but takes a (zip) filename (not a
handle - bad). Taking just a filename prevents using a whole range of
handle objects as input - e.g. StringIO handles, URL handles, piped
output from a command line tool etc.  This flexibility is why we focus
on dealing with handles for parsers.

On a related point, you should leave unzipping the file to the user -
this is not specific to dealing with XML tree files.  Plus, in
addition to zip files (i.e. pkzip/winzip format), there are other
compressed fileformats to consider, such as tarballs.  They too can be
opened and compressed on the fly as a handle (e.g. see the gzip python
library).  By taking a handle as the input your parser can then be
used with any of these import sources.

Peter

P.S. Finally, a more general note about a possible "Bio.TreeIO"
module. For simple Newick trees, a single file can contain one or more
trees (e.g. from bootstrapping).  A tree can be split over multiple
lines (but may be one long line), but multiple trees can be split up
because they should all have a semicolon terminator.  For Nexus files,
I'm not sure off hand if there can be more than one tree.  If you are
going to use the Tree objects from Bio.Nexus, then we could provide a
"Bio.TreeIO" module with read/parse/write methods coping with
"newick", "nexus", "phyloxml" formats, all using the same tree
objects.

From bugzilla-daemon at portal.open-bio.org  Sat May  9 12:40:27 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 9 May 2009 12:40:27 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905091640.n49GeRvY002521@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #31 from cymon.cox at gmail.com  2009-05-09 12:40 EST -------
(In reply to comment #30)
> I've got test_Mafft_tool.py working on one Linux machine using MAFFT v6.626b
> (2009/03/16) installed from source.

That was my reference installation when writing the command line tool (on
Jaunty/RHE 5.3).

> However, test_Mafft_tool.py fails on another Linux machine using MAFFT v6.240
> (2007/04/04) installed using the distribution's package, in this case Ubuntu
> Jaunty:
> http://packages.ubuntu.com/jaunty/mafft
> 
> Note that the next version of Ubuntu currently also uses the same old package:
> http://packages.ubuntu.com/karmic/mafft
> 
> As does Debian unstable:
> http://packages.debian.org/unstable/science/mafft
> 
> From trying mafft v6.240 by hand at the command line, it never seems to
> actually print anything to the console.  Either the MAFFT API changed (which
> doesn't seem to be the case), or the version Ubuntu installed on this machine
> is broken.  This could be due to something else like the version of awk or gcc
> (guesses based on the MAFFT change log):
> http://align.bmr.kyushu-u.ac.jp/mafft/software/

Hadn't tried the Ubuntu package...

On the upside, the Muscle3.7 package installed from Ubuntu passes our tests,
whereas the source compiles but core-dumps. Similarly, ProbCons1.2 won't
compile but the Ubuntu package looks good (havent written the tests yet).

> Note that the latest version is now MAFFT 6.704, so we should try that too. If
> I am right about the current Ubuntu/Debian package being broken, we should get
> in touch with them about updating it... otherwise we can look forward to bug
> reports about our wrapper and/or test_Mafft_tool.py failing.

Built from source on Jaunty; it passes our tests.
C.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From eric.talevich at gmail.com  Sun May 10 01:22:46 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sat, 9 May 2009 22:22:46 -0700
Subject: [Biopython-dev] PhyloXML read/parse functions and handles
In-Reply-To: <320fb6e00905090606g7235d4a6t22914f3ff9c293be@mail.gmail.com>
References: <320fb6e00905090606g7235d4a6t22914f3ff9c293be@mail.gmail.com>
Message-ID: <3f6baf360905092222w305556c0kdbf94e3c336b5958@mail.gmail.com>

On Sat, May 9, 2009 at 6:06 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> Hi Eric,
>
> Are you happy to have feedback on your PhyloXML code in public?


Sure am! I was just getting around to drafting up some questions for
biopython-dev, but I'm glad to receive some preemptive advice.

I just had a look at the stub in Bio/PhyloXML/__init__.py and
> Bio/PhyloXML/Parser.py on your github branch,
> http://github.com/etal/biopython/tree/phyloxml
>
> The convention we are following in Biopython for parsing functions is
> as follows:
> read(handle, ...) - returns a single object (e.g. a tree in your case)
> parse(handle, ...) - returns an iterator (e.g. returning multiple trees)
>
>
I noticed that; I'll change the Bio.PhyloXML.Parser.parse() stub to read()
and have it behave as expected.

The function currently allows either filenames or file handles as the source
because ElementTree.iterparse() also accepts either object as a source. The
read() function could "assert not isinstance(infile, str)", I guess...

The existing Java implementation in Forester/ATV has even more magic,
automatically performing Zip extraction if the given filename ends with
'.zip'. Since this looks like it will be a pretty common use case, at least
for big files, I thought it would be nice to also offer a wrapper function
that takes a filename and does the Right Thing -- that's what
__init__.read() does currently. Is there a precedent for this in Biopython?
The name should probably be something different; in the pdbtidy branch I
used load(), to match the Pickle module, since the wrapper function does
more than just parse or read a file.

So how about:

from Bio import PhyloXML
handle = open('somefile', 'r') # file-like object from any source
tree = PhyloXML.read(handle)

Equivalent to:

from Bio import PhyloXML
tree = PhyloXML.load('somefile') # DTRT for xml, zip, gz, ...?

Or, to be explicit, offer a read_zip or load_zip function. I'd leave well
enough alone, but the incantation to extract a character stream from a
single zipped file is kind of unintuitive, and one of the three example
files on phyloxml.org is already zipped. (I should really ask Christian
Zmasek about this to see if that's a real convention or not.)

P.S. Finally, a more general note about a possible "Bio.TreeIO"
> module. For simple Newick trees, a single file can contain one or more
> trees (e.g. from bootstrapping).  A tree can be split over multiple
> lines (but may be one long line), but multiple trees can be split up
> because they should all have a semicolon terminator.  For Nexus files,
> I'm not sure off hand if there can be more than one tree.  If you are
> going to use the Tree objects from Bio.Nexus, then we could provide a
> "Bio.TreeIO" module with read/parse/write methods coping with
> "newick", "nexus", "phyloxml" formats, all using the same tree
> objects.
>

OK, I'll give it a try. Brad recommended that I just get a simple PhyloXML
parser working first before attempting integration, but if some of Bio.Nexus
can be reused in that process, great. I'm about to go dark from the end of
this week until 3/31 (getting married, yaknow), but I'll fix all this code
when I get back and have access to git again.

Thanks for your help,
Eric

From biopython at maubp.freeserve.co.uk  Sun May 10 05:22:21 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 10 May 2009 10:22:21 +0100
Subject: [Biopython-dev] PhyloXML read/parse functions and handles
In-Reply-To: <3f6baf360905092222w305556c0kdbf94e3c336b5958@mail.gmail.com>
References: <320fb6e00905090606g7235d4a6t22914f3ff9c293be@mail.gmail.com>
	<3f6baf360905092222w305556c0kdbf94e3c336b5958@mail.gmail.com>
Message-ID: <320fb6e00905100222n22b7670dre26f9368726fce68@mail.gmail.com>

On Sun, May 10, 2009 at 6:22 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> The function currently allows either filenames or file handles as the source
> because ElementTree.iterparse() also accepts either object as a source. The
> read() function could "assert not isinstance(infile, str)", I guess...

Interesting - ReportLab also allows filenames or handles.  If this truely is a
widespread or growing trend in Python libraries, maybe we should do this
as well.

> The existing Java implementation in Forester/ATV has even more magic,
> automatically performing Zip extraction if the given filename ends with
> '.zip'. Since this looks like it will be a pretty common use case, at least
> for big files, I thought it would be nice to also offer a wrapper function
> that takes a filename and does the Right Thing -- that's what
> __init__.read() does currently. Is there a precedent for this in Biopython?

Note that Bio.Nexus does this already, making it a bit inconsistent with the
rest of Biopython.  I guess no one noticed or commented back when it was
added.

> The name should probably be something different; in the pdbtidy branch I
> used load(), to match the Pickle module, since the wrapper function does
> more than just parse or read a file.
>
> So how about:
>
> from Bio import PhyloXML
> handle = open('somefile', 'r') # file-like object from any source
> tree = PhyloXML.read(handle)
>
> Equivalent to:
>
> from Bio import PhyloXML
> tree = PhyloXML.load('somefile') # DTRT for xml, zip, gz, ...?
>
> Or, to be explicit, offer a read_zip or load_zip function.

I prefer the more explicit read_zip idea, your would also have an optional
argument for the filename within the zip file.  However, I'm not yet
convinced we need this function.

> I'd leave well enough alone, but the incantation to extract a character
> stream from a single zipped file is kind of unintuitive, and one of the
> three example files on phyloxml.org is already zipped. (I should really
> ask Christian Zmasek about this to see if that's a real convention or
> not.)

Do you want to find out if this really is a phyloxml.org convention first?

If this is their convention, it surprises me they didn't go for .gz files,
which in my experience are more widley used in Bioinformatics (e.g.
at the NCBI and PDB).  These are supported cross platform and hold
one single file (often a tarred file containing multiple files).  A zip file
can hold multiple files, which means you have to make extra
asumptions (e.g. you are using the first file in your code).

>> P.S. Finally, a more general note about a possible "Bio.TreeIO"
>> module. For simple Newick trees, a single file can contain one or more
>> trees (e.g. from bootstrapping).  A tree can be split over multiple
>> lines (but may be one long line), but multiple trees can be split up
>> because they should all have a semicolon terminator.  For Nexus files,
>> I'm not sure off hand if there can be more than one tree.  If you are
>> going to use the Tree objects from Bio.Nexus, then we could provide a
>> "Bio.TreeIO" module with read/parse/write methods coping with
>> "newick", "nexus", "phyloxml" formats, all using the same tree
>> objects.
>>
>
> OK, I'll give it a try. Brad recommended that I just get a simple PhyloXML
> parser working first before attempting integration, but if some of Bio.Nexus
> can be reused in that process, great.

Brad is right - getting a simple PhyloXML parser working is the first step.
It would be sensible to look at the Bio.Nexus tree structure though.

> I'm about to go dark from the end of this week until 3/31 (getting
> married, yaknow), but I'll fix all this code when I get back and have
> access to git again.

Congratulations - it looks like you've got a proper break sheduled as well :)

Peter

From bugzilla-daemon at portal.open-bio.org  Sun May 10 09:50:50 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 10 May 2009 09:50:50 -0400
Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest
In-Reply-To: <bug-2820-42@http.bugzilla.open-bio.org/>
Message-ID: <200905101350.n4ADoo7x001186@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2820


------- Comment #13 from eric.talevich at gmail.com  2009-05-10 09:50 EST -------
(In reply to comment #12)
> Would it be easy for you to test your code on Python 2.4?  I can probably do
> that but not right now...

Yes, I can do that, but only on Linux. I don't think there's anything
platform-specific here, though.

> I would prefer to avoid the extra file by writing this test as part of
> test_PDB_unit.py - but the "with" statement isn't valid syntax on Python 2.4,
> although it can be used on Python 2.5 via:
> from __future__ import with_statement
> 
> Could you re-write this to avoid the with statement?

I think the with statement is isomorphic to a try-except-finally arrangement,
calling the context manager's __enter__ method in the try block and __exit__ in
the finally block. I'll look at the source code of the warnings module and
maybe just copy a substantial chunk of it into this unit test (assuming it's
pure Python). That might make it possible to support Py2.4, too.

> Ideally the unit tests should work in any order - and this is generally a
> reasonable assumption, as they should be independent.  Having some carefully
> named unit tests will only hide the ordering problem (which is due to the 
> global state information in the warnings module).  At the very least, we should
> probably have comments in the code about this (to avoid issues in the future)
> and maybe use an eye-catching name like AAAAA which should always come first.
> 

Agreed. I'll tinker with it some more to see what can be improved here.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon May 11 08:40:49 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 11 May 2009 08:40:49 -0400
Subject: [Biopython-dev] [Bug 2783] Using alternative start codons in
	Bio.Seq translate method/function
In-Reply-To: <bug-2783-42@http.bugzilla.open-bio.org/>
Message-ID: <200905111240.n4BCenqD006754@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2783


------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-11 08:40 EST -------
Created an attachment (id=1298)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1298&action=view)
Patch for Bio/Seq.py to support complete CDS translation with non-standard
start codons

I've recently been doing CDS translations for viral/bacterial genes with
alternative start codons - and would like to fix this limitation in Biopython,
rather than having to hack around it.

On Bug 2381, comment #14, I wrote:
> For comparison, the following is copied from the BioPerl documentation about
> their sequence object's translate method.  It would be nice to follow some of
> the same naming conventions for any optional arguments.
> 
> http://www.bioperl.org/Core/Latest/bptutorial.html#iii_3_1_manipulating_sequence_data_with_seq_methods
> 
> If we want to translate full coding regions (CDS) the way major nucleotide
> databanks EMBL, GenBank and DDBJ do it, the translate() method has to perform
> more checks. Specifically, translate() needs to confirm that the sequence has
> appropriate start and terminator codons at the very beginning and the very end
> of the sequence and that there are no terminator codons present within the
> sequence in frame 0. In addition, if the genetic code being used has an
> atypical (non-ATG) start codon, the translate() method needs to convert the
> initial amino acid to methionine. These checks and conversions are triggered
> by setting ``complete'' to 1:
> 
>   $prot_obj = $my_seq_object->translate(-complete => 1);
> 

On Bug 2381, comment #51, Leighton wrote:
> In terms of nomenclature:
> 
> The default behaviour of translate() as Peter proposed: read through in-frame
> and translate with the appropriate codon table - is fine in nearly all
> circumstances.  Most other circumstances are covered by stopping at the first
> in-frame stop codon, which Peter has implemented, and is an option we all seem
> to agree on.
> 
> Biologically-speaking, this behaviour is not always correct for CDS in
> prokaryotes, where alternative start codons may occur a significant minority
> of the time.  These will be mistranslated if no provision is made for them.  I
> think a useful biological sequence object should at least try to mimic actual
> biology, so we should provide an option to handle this.
> 
> We should not assume that a sequence is a CDS unless it is specified by the
> user.  It seems reasonable to me that the term 'cds' should occur in any such
> argument from the user.
> 
> We have at least two options for how to proceed with a CDS: i) we can provide
> a strict CDS-type translation, which requires confirmation that the sequence
> is, in fact, a CDS; ii) we can provide a weak CDS-type translation, which only
> modifies the way the start codon is translated.  In both cases, behaviour is
> specific to CDS, and so having 'cds' in the argument name *somewhere* seems
> obvious, and entirely reasonable.

Leighton's option (ii) is start codon only modification.  This is what I
implemented in the patch on comment 1 (attachment 1259).  We haven't agreed on
a good name for this - which is partly why I went back to revisit the
alternative:

Leighton's option (i) is strict CDS-type translation.  As Leighton suggests,
having "cds" in the argument name here makes sense.  Regarding the BioPerl
argument name for this functionality, "complete", on Bug 2381 comment 19,
Martin wrote:
> The "complete" is a cryptic naming, I wouldn't be fond of it.
>

I think you are both right about the naming.  Would complete_cds=True would be
clear?  In fact, I quite like the idea of using cds=True which is short and
also fairly clear.  This patch adds a complete_cds=Boolean argument to the
Bio.Seq translate methods and function, which should act like the BioPerl
equivalent.  It includes doctests showing the new functionality.

I would like to use either of these approaches in Biopython - but not both ;)


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon May 11 16:00:29 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 11 May 2009 16:00:29 -0400
Subject: [Biopython-dev] [Bug 2826] New: when creating a de-novo SeqRecord,
	the dbxrefs are not written by SeqIO.write
Message-ID: <bug-2826-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2826

           Summary: when creating a de-novo SeqRecord, the dbxrefs are not
                    written by SeqIO.write
           Product: Biopython
           Version: 1.49
          Platform: All
        OS/Version: Linux
            Status: NEW
          Severity: minor
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: david.wyllie at ndm.ox.ac.uk


Hi

when creating a SeqRecord de novo, the dbxrefs are not written by SeqIO.write.
Is this the intended behaviour?

here is an example:

# example script
from Bio.Seq import Seq
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein

# list to hold output records
outlist=[]

# ofh is the output file handle
ofh = open("/home/dwyllie/temporary.gbk","w")

# example of de novo creation of SeqRecord object from url:
# http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html
rec = SeqRecord(Seq("MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKAT",
generic_protein),  \
                                id="NP_418483.1", name="b4059",  
description="ssDNA-binding protein", \
                                dbxrefs=["ASAP:13298", "GI:16131885",
"GeneID:948570"])

print rec

outlist.append(rec)
count = SeqIO.write(outlist, ofh, "genbank")
ofh.close()

# end of script

OUTPUT:
ID: NP_418483.1
Name: b4059
Description: ssDNA-binding protein
Database cross-references: ASAP:13298, GI:16131885, GeneID:948570
Number of features: 0
Seq('MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKAT', ProteinAlphabet())

Contents of temporary.gbk:
LOCUS       b4059                     46 bp                     UNK 01-JAN-1980
DEFINITION  ssDNA-binding protein
ACCESSION   NP_418483
VERSION     NP_418483.1
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
ORIGIN
        1 MASRGVNKVI LVGNLGQDPE VRYMPNGGAV ANITLATSES WRDKAT
//


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon May 11 16:29:02 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 11 May 2009 16:29:02 -0400
Subject: [Biopython-dev] [Bug 2826] SeqRecord dbxrefs not written to GenBank
	by SeqIO
In-Reply-To: <bug-2826-42@http.bugzilla.open-bio.org/>
Message-ID: <200905112029.n4BKT2x0024871@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2826


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|when creating a de-novo     |SeqRecord dbxrefs not
                   |SeqRecord, the dbxrefs are  |written to GenBank by SeqIO
                   |not written by SeqIO.write  |


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-11 16:29 EST -------
Hi David,

Thank you for another interesting bug report. See here for what the NCBI uses
in a GenPept file for this example protein, NP_418483.1
http://www.ncbi.nlm.nih.gov/protein/16131885

The ASAP and GeneID numbers are not recorded at the sequence level - there is
nowhere in the GenBank file format to but them.  They are however recorded
within a CDS feature on the link above.  So, if you want these recorded, you'd
have to create a SeqFeature with the information (you can't use the SeqRecord's
dbxrefs list).

The GI number would get written, but due to an anomology in the GenBank parser
this is currently stored in the annotations dictionary under the key "gi", so
this is where the GenBank writer looks for this.  We should probably switch to
recording this in the dbxrefs as "gi:12345" as well/instead, and look for this
GI number there instead/as well.

Currently when parsing GenBank files, the only thing stored in the SeqRecord's
dbxref list is a PROJECT line cross reference (see Bug 2225).  Looking at the
code, we don't currently record that - we should.

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon May 11 18:55:21 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 11 May 2009 18:55:21 -0400
Subject: [Biopython-dev] [Bug 2826] SeqRecord dbxrefs not written to GenBank
	by SeqIO
In-Reply-To: <bug-2826-42@http.bugzilla.open-bio.org/>
Message-ID: <200905112255.n4BMtLFc004295@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2826


------- Comment #2 from david.wyllie at ndm.ox.ac.uk  2009-05-11 18:55 EST -------
Thank you. I'm new to BioPython.

The goal was to take some whole-genome sequence (which isn't in Genbank) and
attach a taxon to it, in order that it be written to a BioSQL database.

Other records in the BioSQL database derive from NCBI and so have taxon_ids, so
the additional WGS being in a similar format would make things simpler.

Thank you very much for all your assistance.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From cy at cymon.org  Tue May 12 07:07:59 2009
From: cy at cymon.org (Cymon Cox)
Date: Tue, 12 May 2009 12:07:59 +0100
Subject: [Biopython-dev] Clustal alignment format header line
Message-ID: <7265d4f0905120407j151408bcnf988ba554fcb5140@mail.gmail.com>

Both Muscle (-clw) and Probcons (-clustalw)  output a programme specific
header line for the clustal format alignment:

"MUSCLE (3.7) multiple sequence alignment


AK1H_ECOLI/1-378      CPDSINAALICRGEKMSIAIMAGVLEAR etc"

"PROBCONS version 1.12 multiple sequence alignment

AK1H_ECOLI/1-378    CPDSINAALICRGEKMSIAIMA

"

Bio.AlignIO will not read these alignments
Bio/AlignIO/ClustalIO.py:94
 if line[:7] != 'CLUSTAL':
       raise ValueError("Did not find CLUSTAL header")

Muscle does have a -clwstrict flag but ProbCons doesnt.

Would it be a good idea to relax the header parsing?

C.
--

From biopython at maubp.freeserve.co.uk  Tue May 12 11:28:35 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 12 May 2009 16:28:35 +0100
Subject: [Biopython-dev] Clustal alignment format header line
In-Reply-To: <7265d4f0905120407j151408bcnf988ba554fcb5140@mail.gmail.com>
References: <7265d4f0905120407j151408bcnf988ba554fcb5140@mail.gmail.com>
Message-ID: <320fb6e00905120828t139113d8rceb7db874acdd8a0@mail.gmail.com>

On Tue, May 12, 2009 at 12:07 PM, Cymon Cox <cy at cymon.org> wrote:
> Both Muscle (-clw) and Probcons (-clustalw) ?output a programme specific
> header line for the clustal format alignment:
>
> "MUSCLE (3.7) multiple sequence alignment
>
>
> AK1H_ECOLI/1-378 ? ? ?CPDSINAALICRGEKMSIAIMAGVLEAR etc"
>
> "PROBCONS version 1.12 multiple sequence alignment
>
> AK1H_ECOLI/1-378 ? ?CPDSINAALICRGEKMSIAIMA
>
> "
>
> Bio.AlignIO will not read these alignments
> Bio/AlignIO/ClustalIO.py:94
> ?if line[:7] != 'CLUSTAL':
> ? ? ? raise ValueError("Did not find CLUSTAL header")
>
> Muscle does have a -clwstrict flag but ProbCons doesnt.
>
> Would it be a good idea to relax the header parsing?
>
> C.

Maybe.  Up until now the only example of this I had personally come
across was MUSCLE, but they helpfully provide the -clwstrict argument
so the issue wasn't important.

There are also of course the official variants like:

CLUSTAL W (1.81) multiple sequence alignment
CLUSTAL 2.0.9 multiple sequence alignment

How would you code this?  A flexible option would be to take anything
where the first line ends with "multiple sequence alignment", but this
risks letting a lot of non-clustal files though which will then
(hopefully) fail, but probably with a much more cryptic error message.
A white list of safe variants like "MUSCLE" and "PROBCONS" would be
safest.

Also I have a vague memory of some tool using something like "CLUSTAL
... from ToolX" but I don't recall the details.

Peter


From cy at cymon.org  Tue May 12 11:43:47 2009
From: cy at cymon.org (Cymon Cox)
Date: Tue, 12 May 2009 16:43:47 +0100
Subject: [Biopython-dev] Clustal alignment format header line
In-Reply-To: <320fb6e00905120828t139113d8rceb7db874acdd8a0@mail.gmail.com>
References: <7265d4f0905120407j151408bcnf988ba554fcb5140@mail.gmail.com> 
	<320fb6e00905120828t139113d8rceb7db874acdd8a0@mail.gmail.com>
Message-ID: <7265d4f0905120843j172f5303y7029e1f7b5f4187f@mail.gmail.com>

2009/5/12 Peter <biopython at maubp.freeserve.co.uk>

> On Tue, May 12, 2009 at 12:07 PM, Cymon Cox <cy at cymon.org> wrote:
> > Both Muscle (-clw) and Probcons (-clustalw)  output a programme specific
> > header line for the clustal format alignment:
> >
> > "MUSCLE (3.7) multiple sequence alignment
> >
> >
> > AK1H_ECOLI/1-378      CPDSINAALICRGEKMSIAIMAGVLEAR etc"
> >
> > "PROBCONS version 1.12 multiple sequence alignment
> >
> > AK1H_ECOLI/1-378    CPDSINAALICRGEKMSIAIMA
> >
> > "
> >
> > Bio.AlignIO will not read these alignments
> > Bio/AlignIO/ClustalIO.py:94
> >  if line[:7] != 'CLUSTAL':
> >       raise ValueError("Did not find CLUSTAL header")
> >
> > Muscle does have a -clwstrict flag but ProbCons doesnt.
> >
> > Would it be a good idea to relax the header parsing?
> >
> > C.
>
> Maybe.  Up until now the only example of this I had personally come
> across was MUSCLE, but they helpfully provide the -clwstrict argument
> so the issue wasn't important.
>
> There are also of course the official variants like:
>
> CLUSTAL W (1.81) multiple sequence alignment
> CLUSTAL 2.0.9 multiple sequence alignment
>
> How would you code this?  A flexible option would be to take anything
> where the first line ends with "multiple sequence alignment", but this
> risks letting a lot of non-clustal files though which will then
> (hopefully) fail, but probably with a much more cryptic error message.
> A white list of safe variants like "MUSCLE" and "PROBCONS" would be
> safest.
>
> Also I have a vague memory of some tool using something like "CLUSTAL
> ... from ToolX" but I don't recall the details.


T-COFFEE for one:
"CLUSTAL FORMAT for T-COFFEE Version_6.92 [http://www.tcoffee.org] [MODE:
], CPU=0.00 sec, SCORE=100, Nseq=2, Len=601"

Is it so bad to let it fail on the structure of the data - effectively
ignore the header? Maybe have a general "this doesnt look like clustal
formatted data" error based on the data structure...

C.

--

From biopython at maubp.freeserve.co.uk  Tue May 12 12:05:15 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 12 May 2009 17:05:15 +0100
Subject: [Biopython-dev] Loading SeqRecords into BioSQL with NCBI taxon ID
Message-ID: <320fb6e00905120905l3d0e31b2p2a3d92f61096cbd5@mail.gmail.com>

Over on Bug 2826, David wrote:
http://bugzilla.open-bio.org/show_bug.cgi?id=2826#c2

> Thank you. I'm new to BioPython.
>
> The goal was to take some whole-genome sequence (which isn't in Genbank) and
> attach a taxon to it, in order that it be written to a BioSQL database.

You've talked about trying to parse WGS GenBank files on Bug 2825 but
presumable if this new data isn't in GenBank, it is in another format.

What format is your  whole-genome sequence?  FASTA or something simple?

> Other records in the BioSQL database derive from NCBI and so have taxon_ids,
> so the additional WGS being in a similar format would make things simpler.

I see. Basically you need to import a SeqRecord into BioSQL with an
NCBI taxon ID.  You don't need to write out a GenBank file to do this.

First create the SeqRecord, e.g.

from Bio import SeqIO
record = SeqIO.read(handle, format, alphabet)

There are now two options - because the BioSQL loader will look for
the NCBI taxon ID in two places:

(Option 1) Record the NCBI taxon ID in the SeqRecord's annotation
dictionary under the "ncbi_taxid" key.  This should work (untested):

record.annotations["ncbi_taxid"] = 12345 #or single element list, [12345]

(Option 2) Mimic a SeqRecord from parsing a GenBank file with a source
feature containing the taxon ID. This should work (untested):

#Create the SeqRecord:
record = SeqIO.read(handle, format, alphabet)
#Create the source features:
from Bio.SeqFeature import SeqFeature, FeatureLocation
f = SeqFeature(FeatureLocation(0, len(record)), strand=+1, type="source")
f.qualifiers["db_xref"] = ["taxon:12345"]
record.features = [f] #or insert at start

If you don't really have a sequence, this second approach doesn't make
so much sense.

[Arguably there could be a third option via the dbxref's list]

Then in either case, load the modified SeqRecord into the database.
You may want to pre-load the NCBI taxonomy, see
http://www.biopython.org/wiki/BioSQL

Alternatively, using Biopython 1.49+ you can have this fetched from
Entrez on demand with the fetch_NCBI_taxonomy=True option.  The BioSQL
wiki page needs updating on this topic.

Peter

From bugzilla-daemon at portal.open-bio.org  Tue May 12 12:11:43 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 12 May 2009 12:11:43 -0400
Subject: [Biopython-dev] [Bug 2826] SeqRecord dbxrefs not written to GenBank
	by SeqIO
In-Reply-To: <bug-2826-42@http.bugzilla.open-bio.org/>
Message-ID: <200905121611.n4CGBhrY001864@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2826


------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-12 12:11 EST -------
(In reply to comment #2)
> Thank you. I'm new to BioPython.
> 
> The goal was to take some whole-genome sequence (which isn't in Genbank) and
> attach a taxon to it, in order that it be written to a BioSQL database.

For this example you don't need to write out a GenBank file at all (which is
what this bug was about).  See my email on the mailing list for details:

http://lists.open-bio.org/pipermail/biopython/2009-May/005154.html
and sent in error to the dev list:
http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006028.html

I am leaving this bug open for relevant dbxrefs entries not currently recorded
when writing GenBank files with Bio.SeqIO (GI number which goes on the VERSION
line, and genome projects on the PROJECT / DBLINK line).

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Tue May 12 12:16:35 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 12 May 2009 17:16:35 +0100
Subject: [Biopython-dev] Clustal alignment format header line
In-Reply-To: <7265d4f0905120843j172f5303y7029e1f7b5f4187f@mail.gmail.com>
References: <7265d4f0905120407j151408bcnf988ba554fcb5140@mail.gmail.com>
	<320fb6e00905120828t139113d8rceb7db874acdd8a0@mail.gmail.com>
	<7265d4f0905120843j172f5303y7029e1f7b5f4187f@mail.gmail.com>
Message-ID: <320fb6e00905120916p3db7c003kf6eef581cbb4c93b@mail.gmail.com>

On Tue, May 12, 2009 at 4:43 PM, Cymon Cox <cy at cymon.org> wrote:
>Peter wrote:
>> Also I have a vague memory of some tool using something like "CLUSTAL
>> ... from ToolX" but I don't recall the details.
>
> T-COFFEE for one:
> "CLUSTAL FORMAT for T-COFFEE Version_6.92 [http://www.tcoffee.org] [MODE:
> ], CPU=0.00 sec, SCORE=100, Nseq=2, Len=601"

Yes - that is almost certainly the example I was thinking of.

> Is it so bad to let it fail on the structure of the data - effectively
> ignore the header? Maybe have a general "this doesnt look like clustal
> formatted data" error based on the data structure...

Some of the current error messages are a little cryptic to an end
user, I guess they could have "Are you sure this is a Clustal format
file?" appended to them.

I'd be happy with a whitelist of variant headers, i.e. must start with
"CLUSTAL", "MUSCLE" or "PROBCONS" (assuming these tools don't write
their own file formats which also start that way!).  If people find
new cases and report them, it also gives us notice about another tool
we may want to include in our command line wrappers, and/or obtain
sample output files for the unit tests.

Peter

From biopython at maubp.freeserve.co.uk  Tue May 12 13:14:27 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 12 May 2009 18:14:27 +0100
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
	<320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com>
	<8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com>
	<8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com>
	<320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com>
	<8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com>
	<320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com>
	<320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com>
	<320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com>
	<8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com>
Message-ID: <320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com>

On Tue, Apr 28, 2009 at 6:50 PM, Bartek Wilczynski
<bartek at rezolwenta.eu.org> wrote:
> On Tue, Apr 28, 2009 at 7:45 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> I take that back - I added an email address of just "peterc" to my
>> github account (it seems they don't do any validation, perhaps for
>> this very reason?). ?This had no immediate effect, but one day later
>> and all my CVS commits are now shown with my photo in github. ?Neat -
>
> great

That seems to have stopped working now - no idea why, "peterc" is
still listed an one of my email addresses on my github account, but my
github account is no longer linked to commits in Biopython.  Odd.

Do you think it would be straight forward for your CVS to git
conversion to map the CVS usernames to github usernames for future
commits (so as not to alter the currently published history)?

Peter


From bugzilla-daemon at portal.open-bio.org  Tue May 12 13:33:03 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 12 May 2009 13:33:03 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905121733.n4CHX3jK009739@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #32 from cymon.cox at gmail.com  2009-05-12 13:33 EST -------
Added PROBCONS and TCOFFEE command line interfaces and unittests.

The TCOFFEE commadline implements a very restricted set of options (just those
Brad attached). 

Also added white list of known headers to AlignIO/ClustalwIO.py:97 - the
PROBSCONS unittest will fail without this alteration.

On http://github.com/cymon/biopython-github-master/tree/applic-int
C.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bartek at rezolwenta.eu.org  Tue May 12 14:23:18 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Tue, 12 May 2009 20:23:18 +0200
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
	<8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com>
	<8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com>
	<320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com>
	<8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com>
	<320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com>
	<320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com>
	<320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com>
	<8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com>
	<320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com>
Message-ID: <8b34ec180905121123w6fe553ahd416b9e1d317dd11@mail.gmail.com>

On Tue, May 12, 2009 at 7:14 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:

> That seems to have stopped working now - no idea why, "peterc" is
> still listed an one of my email addresses on my github account, but my
> github account is no longer linked to commits in Biopython. ?Odd.

It seems to be OK again. Maybe it was temporary ?

>
> Do you think it would be straight forward for your CVS to git
> conversion to map the CVS usernames to github usernames for future
> commits (so as not to alter the currently published history)?
>

It would be straightforward to add a mapping to the conversion, but I
think it would affect the whole history...

I was thinking that the mapping was going to change when we finally
switch to git. Then it would be a natural cause of events...
Otherwise, we would have another step in our transition. Whether it's
worth doing it, depends on how long we expect to be in the transition
between CVS and git.

cheers
Bartek


From bugzilla-daemon at portal.open-bio.org  Tue May 12 14:44:09 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 12 May 2009 14:44:09 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905121844.n4CIi9sb017010@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1296 is|0                           |1
           obsolete|                            |


------- Comment #33 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-12 14:44 EST -------
(From update of attachment 1296)
(In reply to comment #32)
> Added PROBCONS and TCOFFEE command line interfaces and unittests.
> 
> The TCOFFEE commadline implements a very restricted set of options
> (just those Brad attached). 
> 
> Also added white list of known headers to AlignIO/ClustalwIO.py:97 - the
> PROBSCONS unittest will fail without this alteration.
> 
> On http://github.com/cymon/biopython-github-master/tree/applic-int

Thank you Cymon and Brad - those are now checked in, more or less as is.
I did tweak Bio/AlignIO/ClustalwIO.py a little bit.  Also, TCoffee says it can
be installed on Windows using Cygwin - we should try that at some point ;)

Note for the TCoffee suite we could also consider adding xpresso, 3dcoffee,
mcoffee and rcoffee as well - hopefully they have similar interfaces so with
some subclassing we won't have to duplicate a lot of the code.

One other thought - do you think the EMBOSS water and needle wrappers (and any
other alignment tools in EMBOSS) be made available under Bio.Align.Applications
(via an import in Bio/Align/Applications/__init__.py so no code duplication)?

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Tue May 12 14:57:24 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 12 May 2009 19:57:24 +0100
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <8b34ec180905121123w6fe553ahd416b9e1d317dd11@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
	<8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com>
	<320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com>
	<8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com>
	<320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com>
	<320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com>
	<320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com>
	<8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com>
	<320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com>
	<8b34ec180905121123w6fe553ahd416b9e1d317dd11@mail.gmail.com>
Message-ID: <320fb6e00905121157i2068b81fw5a8b21e65c33abe2@mail.gmail.com>

On Tue, May 12, 2009 at 7:23 PM, Bartek Wilczynski
<bartek at rezolwenta.eu.org> wrote:
>
> I was thinking that the mapping was going to change when we finally
> switch to git. Then it would be a natural cause of events...
> Otherwise, we would have another step in our transition. Whether it's
> worth doing it, depends on how long we expect to be in the transition
> between CVS and git.

I'm happy that git will work, and that I personally know enough about
the basics to manage.

I'm not happy with the current github repository due to the history
tag issue - but we know we can fix that now.  Are you going to try
removing the old tags and re-doing them on github?

Does anyone know how the git provided "ViewCVS" equivalent shows tags
in a file's history?

I think we should now have a chat with the OBF (off list) about how we
might go about installing git on their server.  Commits can then be
pushed out to github automatically (or pulled from github if we go the
other way round).  This would make several things easier:

(1) Seamless continuation of existing user accounts
(2) Keeping the snapshot code up to date: http://biopython.org/SRC/biopython/
(3) Having our own commit RSS feeds (not essential as this could be
done on github)
(4) Having automatic builds of the documentation (previously discussed
as nice to have)

Plus of course giving redundancy with the code mirrored on both OBF
servers and GitHub :)

Peter

From bugzilla-daemon at portal.open-bio.org  Tue May 12 15:45:12 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 12 May 2009 15:45:12 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905121945.n4CJjCFj023070@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #34 from cymon.cox at gmail.com  2009-05-12 15:45 EST -------
(In reply to comment #33)
> Note for the TCoffee suite we could also consider adding xpresso, 3dcoffee,
> mcoffee and rcoffee as well - hopefully they have similar interfaces so with
> some subclassing we won't have to duplicate a lot of the code.

With the latest version of t_coffee (and not the currently available Jaunty
package!), these (ie the meta calls like mcoffee etc) are all covered by the
"-mode" option. I just installed t_coffee from source and this appears to be
the case. There are so many options and interdependencies in TCOFFEE, and its
command line is clearly a moving target, that the interface may require more
work before being released.

> One other thought - do you think the EMBOSS water and needle wrappers (and any
> other alignment tools in EMBOSS) be made available under Bio.Align.Applications
> (via an import in Bio/Align/Applications/__init__.py so no code duplication)?

Sounds good to me.
C.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Tue May 12 19:04:53 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 13 May 2009 00:04:53 +0100
Subject: [Biopython-dev] Bio.EMBOSS wrappers
In-Reply-To: <320fb6e00904130649v699b16d1seea0c51172b9ab71@mail.gmail.com>
References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com>
	<20090413123219.GB5429@sobchak.mgh.harvard.edu>
	<320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com>
	<20090413134429.GE5429@sobchak.mgh.harvard.edu>
	<320fb6e00904130649v699b16d1seea0c51172b9ab71@mail.gmail.com>
Message-ID: <320fb6e00905121604q4c70d69ck35fb16210fb0efe2@mail.gmail.com>

On Mon, Apr 13, 2009 at 2:49 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Mon, Apr 13, 2009 at 2:44 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>>> > ... Feel free to add away.
>>>
>>> I need to work on my delegation skills - that seems to have back fired ;)
>>
>> Oops. I honestly read that as "do I have your permission?" I can of
>> course tackle this, but am a bit underwater now.
>
> Looking back, I was a bit ambiguous.  I don't mind who does it - let's
> see who has time free first.

That's done in CVS now - plus a few other things like -die and -stdout.
I've also done -outfile via the new base Emboss wrapper, as all the
tools (so far at least) include this option.

>>> Regarding adding -auto support, I have a question about the needle
>>> wrapper and the gap parameters.  Using the needle tool at the command
>>> line will prompt for the gap parameters UNLESS the -auto argument has
>>> been used.  i.e. Without -auto, it makes sense to insist on the gap
>>> parameters being included, which is what the current wrapper does.
>>> However, if we add support for -auto, then these parameters can be
>>> optional.  We could handle this in the wrapper, but it would be messy
>>> (and there may be similar questions with other EMBOSS tools).  What do
>>> you think - stick with the simple option of insisting the Biopython
>>> user set the gap parameters, even if they are using -auto?
>>
>> I think we should stick with the simple option. These were meant to
>> be pretty dumb specifiers that help users write more modular code than
>> simply pasting in a raw string for the command line. Trying to get
>> too fancy is probably overkill.
>
> Agreed.

By putting the outfile argument on the base EMBOSS wrapper class,
together with the related -filter and -stdout options, I was able to
enforce a simple check that at least one of these is used, applicable
to all the wrappers.  This preserves the old safety check that the
output file is required (unless using standard out via -filter and/or
-stdout instead).

Something similar could be done so that using -auto overrides the
any "required" flags we have set (e.g. for gapopen in water), but this
seems unnecessary to me (as discussed above).

Peter

From biopython at maubp.freeserve.co.uk  Wed May 13 05:55:06 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 13 May 2009 10:55:06 +0100
Subject: [Biopython-dev] Properties names in command line wrappers
In-Reply-To: <20090505123656.GB15113@sobchak.mgh.harvard.edu>
References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com>
	<20090505123656.GB15113@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00905130255j22905e80y6ff8678f6af89cb2@mail.gmail.com>

On Mon, May 4, 2009, Peter  wrote:
>>> ... The (hardly used) existing blastall wrapper in
>>> Bio/Blast/Applications.py gives the "-a" argument a human
>>> readable name of "nprocessors", and "-A" gets "window_size".
>>> With the old set_parameter call either alias could be used.
>>> However, with a python property we need to pick one as a
>>> preferred name - and I'm not 100% sure being helpful and
>>> using "nprocessors" (e.g. cline.nprocessors=4) is actually
>>> better than using the actual argument name (e.g. cline.a = 4).

On Tue, May 5, 2009, Brad wrote:
>> Could we support both the original argument and optional human
>> readable arguments? I know the code in Application is a bit
>> hard coded for the first argument as the real name and the last
>> argument as the readable name; the cleanest solution would be to
>> generalize this to have multiple names where it makes sense.
>> ...

On Tue, May 5, 2009, Peter wrote:
> ...
> I favour using only a single property for each parameter, with the
> name as similar as possible to the actual command line switch (i.e.
> property name "a" for "-a", not "nprocessors").  Note each property
> would have a docstring which will say what is it for ("Number of
> processors to use.").

I still favour only using a single python property for each parameter,
but after some work on the blastall wrapper last night, I am
beginning to come round to your point of view.

If a command line tool provides a long parameter name (some tools
provide both short and long names for important parameters) we
should use that rather than inventing our own [so no change here].

However, for tools like BLAST which *only* have cryptic single letter
command line options (case sensitive), maybe we should be using
a sensible human readable name for the associated property in the
Biopython wrapper (i.e. "nprocessors" for "-a", and "window_size"
for "-A").  Having actually now tried using properties "a" and "A",
the resulting python code is very cryptic - and only makes sense
if you are familiar with the blastall arguments (and given there are
so many of them, this is difficult!).

It should be trivial to extend to documentation strings automatically
to include something like "This maps onto the XXX command line
argument" so that the mapping is clear to the user without having to
look at our source code.

Hopefully this gets the balance right between giving nice python
code, and staying faithful to the actual command line tool API.

Peter

From biopython at maubp.freeserve.co.uk  Wed May 13 07:15:35 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 13 May 2009 12:15:35 +0100
Subject: [Biopython-dev] Properties names in command line wrappers
In-Reply-To: <7265d4f0905130350s4006db5bqc40eed1ec535243d@mail.gmail.com>
References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com>
	<20090505123656.GB15113@sobchak.mgh.harvard.edu>
	<320fb6e00905130255j22905e80y6ff8678f6af89cb2@mail.gmail.com>
	<7265d4f0905130350s4006db5bqc40eed1ec535243d@mail.gmail.com>
Message-ID: <320fb6e00905130415p6757da94id8d7508a2ea3eebf@mail.gmail.com>

On Wed, May 13, 2009 at 11:50 AM, Cymon Cox <cy at cymon.org> wrote:
>> On Tue, May 5, 2009, Peter wrote:
>> > ...
>> > I favour using only a single property for each parameter, with the
>> > name as similar as possible to the actual command line switch (i.e.
>> > property name "a" for "-a", not "nprocessors"). ?Note each property
>> > would have a docstring which will say what is it for ("Number of
>> > processors to use.").
>>
>> I still favour only using a single python property for each parameter,
>
> A confusing issue arises where we have alternative names for options.
> That the following example from _Probcons.py:
>
> ??????????? _Option(["-c", "c", "--consistency", "consistency" ], ["input"],
> ??????????????????? lambda x: x in range(0,6),
> ??????????????????? 0,
> ??????????????????? "Use 0 <= REPS <= 5 (default: 2) passes of consistency
> transformation",
> ??????????????????? 0),
>
>>>> cmd = cmdline = ProbconsCommandline("probcons", input="blah")
>>>> cmd.c = 1
>>>> str(cmd)
> 'probcons blah '
>>>> cmd.set_parameter("c", 1)
>>>> str(cmd)
> 'probcons -c 1 blah '
>>>> cmd.consistency = 2
>>>> str(cmd)
> 'probcons -c 2 blah '
>>>> cmd.c = 5
>>>> str(cmd)
> 'probcons -c 2 blah '
>
> That is, the user needs to look at the code to figure out what the correct
> name is to use when assigning to the property. Is it possible to restrict
> the binding of attributes to the cmdline to only valid property names? An
> alternative would be to restrict all parameters to only one name and
> document the alternatives it covers (dont like this idea - see below).

Yes, you can use any of the defined aliases with set_parameter, and
they are all equally valid, and all do exactly the same thing.  e.g.

cmd = ProbconsCommandline("probcons", input="blah")
cmd.set_parameter("c", 1)
cmd.set_parameter("-c", 1)
cmd.set_parameter("--consistency", 1)
cmd.set_parameter("consistency", 1)

I would however regard set_parameter as a legacy method and
push the (single) keyword argument or property alternative, for
which there is only one name (here "consistency" ):

cmd = ProbconsCommandline("probcons", input="blah")
cmd.consistency = 1

or,

cmd = ProbconsCommandline("probcons", input="blah", consistency=1)

[And yes, we should have some error checking code in the base
class __init__ method to make sure the string used is a valid python
identifier.]

The user does NOT have to look at the source code to find this out -
just the docstrings or properties - try help(cmd) or dir(cmd) in python.

>> but after some work on the blastall wrapper last night, I am
>> beginning to come round to your point of view.
>>
>> If a command line tool provides a long parameter name (some tools
>> provide both short and long names for important parameters) we
>> should use that rather than inventing our own [so no change here].
>>
>> However, for tools like BLAST which *only* have cryptic single letter
>> command line options (case sensitive), maybe we should be using
>> a sensible human readable name for the associated property in the
>> Biopython wrapper (i.e. "nprocessors" for "-a", and "window_size"
>> for "-A"). ?Having actually now tried using properties "a" and "A",
>> the resulting python code is very cryptic - and only makes sense
>> if you are familiar with the blastall arguments (and given there are
>> so many of them, this is difficult!).
>
> I dont agree. If you want to make your python code legible to people
> who are not familar with the command line options, you can just
> comment it. I think the interfaces should stick as close as possible
> to the application documentation. I see these interfaces being used
> mostly by people who are familar with the applications, in which case
> the command line construction should be fairly intuitive.

Well, I am on the fence here.  The trouble is that sometimes (e.g. BLAST)
the command line parameters themselves are just so cryptic.  Yes, we
could just use "a" and "A", and leave it up to the user to document their
code.  If we using "nprocessors" and "window_size" the code becomes
self documenting (although you have to know Biopython's mapping).

Brad's suggestion to support both in the property and keyword
arguments brings us back to having multiple choices on how to do
set a parameter (as in the set_parameter with its aliases), confusing
and unpythonic.

Peter


From cy at cymon.org  Wed May 13 06:50:54 2009
From: cy at cymon.org (Cymon Cox)
Date: Wed, 13 May 2009 11:50:54 +0100
Subject: [Biopython-dev] Properties names in command line wrappers
In-Reply-To: <320fb6e00905130255j22905e80y6ff8678f6af89cb2@mail.gmail.com>
References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com> 
	<20090505123656.GB15113@sobchak.mgh.harvard.edu>
	<320fb6e00905130255j22905e80y6ff8678f6af89cb2@mail.gmail.com>
Message-ID: <7265d4f0905130350s4006db5bqc40eed1ec535243d@mail.gmail.com>

2009/5/13 Peter <biopython at maubp.freeserve.co.uk>

> On Mon, May 4, 2009, Peter  wrote:
> >>> ... The (hardly used) existing blastall wrapper in
> >>> Bio/Blast/Applications.py gives the "-a" argument a human
> >>> readable name of "nprocessors", and "-A" gets "window_size".
> >>> With the old set_parameter call either alias could be used.
> >>> However, with a python property we need to pick one as a
> >>> preferred name - and I'm not 100% sure being helpful and
> >>> using "nprocessors" (e.g. cline.nprocessors=4) is actually
> >>> better than using the actual argument name (e.g. cline.a = 4).
>
> On Tue, May 5, 2009, Brad wrote:
> >> Could we support both the original argument and optional human
> >> readable arguments? I know the code in Application is a bit
> >> hard coded for the first argument as the real name and the last
> >> argument as the readable name; the cleanest solution would be to
> >> generalize this to have multiple names where it makes sense.
> >> ...
>
> On Tue, May 5, 2009, Peter wrote:
> > ...
> > I favour using only a single property for each parameter, with the
> > name as similar as possible to the actual command line switch (i.e.
> > property name "a" for "-a", not "nprocessors").  Note each property
> > would have a docstring which will say what is it for ("Number of
> > processors to use.").
>
> I still favour only using a single python property for each parameter,


A confusing issue arises where we have alternative names for options. That
the following example from _Probcons.py:


            _Option(["-c", "c", "--consistency", "consistency" ], ["input"],
                    lambda x: x in range(0,6),
                    0,
                    "Use 0 <= REPS <= 5 (default: 2) passes of consistency
transformation",
                    0),

>>> cmd = cmdline = ProbconsCommandline("probcons", input="blah")
>>> cmd.c = 1
>>> str(cmd)
'probcons blah '
>>> cmd.set_parameter("c", 1)
>>> str(cmd)
'probcons -c 1 blah '
>>> cmd.consistency = 2
>>> str(cmd)
'probcons -c 2 blah '
>>> cmd.c = 5
>>> str(cmd)
'probcons -c 2 blah '

That is, the user needs to look at the code to figure out what the correct
name is to use when assigning to the property. Is it possible to restrict
the binding of attributes to the cmdline to only valid property names? An
alternative would be to restrict all parameters to only one name and
document the alternatives it covers (dont like this idea - see below).

but after some work on the blastall wrapper last night, I am
> beginning to come round to your point of view.
>
> If a command line tool provides a long parameter name (some tools
> provide both short and long names for important parameters) we
> should use that rather than inventing our own [so no change here].
>
> However, for tools like BLAST which *only* have cryptic single letter
> command line options (case sensitive), maybe we should be using
> a sensible human readable name for the associated property in the
> Biopython wrapper (i.e. "nprocessors" for "-a", and "window_size"
> for "-A").  Having actually now tried using properties "a" and "A",
> the resulting python code is very cryptic - and only makes sense
> if you are familiar with the blastall arguments (and given there are
>
so many of them, this is difficult!).


I dont agree. If you want to make your python code legible to people who are
not familar with the command line options, you can just comment it. I think
the interfaces should stick as close as possible to the application
documentation. I see these interfaces being used mostly by people who are
familar with the applications, in which case the command line construction
should be fairly intuitive.

Cheers, C.
--

From biopython at maubp.freeserve.co.uk  Wed May 13 09:10:59 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 13 May 2009 14:10:59 +0100
Subject: [Biopython-dev] Properties names in command line wrappers
In-Reply-To: <320fb6e00905130415p6757da94id8d7508a2ea3eebf@mail.gmail.com>
References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com>
	<20090505123656.GB15113@sobchak.mgh.harvard.edu>
	<320fb6e00905130255j22905e80y6ff8678f6af89cb2@mail.gmail.com>
	<7265d4f0905130350s4006db5bqc40eed1ec535243d@mail.gmail.com>
	<320fb6e00905130415p6757da94id8d7508a2ea3eebf@mail.gmail.com>
Message-ID: <320fb6e00905130610g3eb8edb4q99913b8b0ae14bf9@mail.gmail.com>

On Wed, May 13, 2009 at 12:15 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> The user does NOT have to look at the source code to find this out -
> just the docstrings or properties - try help(cmd) or dir(cmd) in python.
>

I've just updated the automatically generated docstrings for each property
so that it includes the actual parameter name which will be used to build
the string.

Peter

From bugzilla-daemon at portal.open-bio.org  Wed May 13 11:01:33 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 13 May 2009 11:01:33 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905131501.n4DF1XYv019413@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #35 from cymon.cox at gmail.com  2009-05-13 11:01 EST -------
Ive added some very basic unittests for the command line interfaces, which dont
require the applications to be installed.

test_Application_Commandlines.py - currently in only includes
Bio/Align/Applications but Bio/Emboss tests could be added.

Note that the _Mafft.py command line interface is currently broken due the
restriction only having a single instance of a parameter on the command line.
Mafft uses the following option: 

--seed alignment1 [--seed alignment2 --seed alignment3 ...]

We could remove support this option in Mafft.

C.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Wed May 13 11:23:34 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 13 May 2009 11:23:34 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905131523.n4DFNYX7021233@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #36 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-13 11:23 EST -------
(In reply to comment #35)
> 
> Note that the _Mafft.py command line interface is currently broken due the
> restriction only having a single instance of a parameter on the command line.
> Mafft uses the following option: 
> 
> --seed alignment1 [--seed alignment2 --seed alignment3 ...]
> 
> We could remove support this option in Mafft.

Removing the --seed argument might be a pragmatic short term solution.

I'd considered this type of thing as a possible corner case - but hadn't
mentioned it as I didn't have a concrete example.  I would suggest setting the
parameter value to a list could work:

i.e. Support any of:

cline = MafftCommandline(seed=["alignment1", "alignment2", "alignment3"])
cline.set_paramter("seed", ["alignment1", "alignment2", "alignment3"])
cline.seed = ["alignment1", "alignment2", "alignment3"]

giving:

mafft --seed alignment1 --seed alignment2 --seed alignment3

We'd need to introduce a new _Option subclass for this.  A similar situation
applies to optional argument lists, like the Unix zip command:

zip zipfile file1 file2 file3 ...

where there is a single output filename (here zipfile), and then one or more
input files or filespecifiers (here three entries).

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From winda002 at student.otago.ac.nz  Thu May 14 00:53:42 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Thu, 14 May 2009 16:53:42 +1200
Subject: [Biopython-dev] Cookbook entry, concatenating nexus files
Message-ID: <4A0BA3D6.5070207@student.otago.ac.nz>


I have been slowly adding some of the scripts I use most commonly to the 
cookbook section of the wiki 
(http://biopython.org/wiki/Category:Cookbook). Since I'm very much a  
dilettante at this programming business as the cookbook is meant as 
supplementary documentation for Biopython it's probably a good idea for 
someone that knows what they are doing to look at these things (Peter 
has been really helpful with this thus far, but is seems unfair to 
saddle one man with so much bad programming :)

I've just added a recipe that uses the nexus class to concatenate 
multiple nexus files and provide some feedback if the taxa are not the 
same in each one: http://biopython.org/wiki/Concatenate_nexus

Any thoughts? If you think you can make it clearer/quicker/better then 
you can edit it on the wiki or provide comments here of there.

Cheers,
David

From biopython at maubp.freeserve.co.uk  Thu May 14 05:27:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 14 May 2009 10:27:12 +0100
Subject: [Biopython-dev] Cookbook entry, concatenating nexus files
In-Reply-To: <4A0BA3D6.5070207@student.otago.ac.nz>
References: <4A0BA3D6.5070207@student.otago.ac.nz>
Message-ID: <320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com>

On Thu, May 14, 2009 at 5:53 AM, David Winter
<winda002 at student.otago.ac.nz> wrote:
>
> I have been slowly adding some of the scripts I use most commonly to the
> cookbook section of the wiki (http://biopython.org/wiki/Category:Cookbook).
> Since I'm very much a ?dilettante at this programming business as the
> cookbook is meant as supplementary documentation for Biopython it's probably
> a good idea for someone that knows what they are doing to look at these
> things (Peter has been really helpful with this thus far, but is seems
> unfair to saddle one man with so much bad programming :)
>
> I've just added a recipe that uses the nexus class to concatenate multiple
> nexus files and provide some feedback if the taxa are not the same in each
> one: http://biopython.org/wiki/Concatenate_nexus
>
> Any thoughts? If you think you can make it clearer/quicker/better then you
> can edit it on the wiki or provide comments here of there.

What exactly are you trying to achieve?  A big Nexus files with lots
of alignments (and trees) in it?

When I talked to Frank about Nexus files, he said they should only
ever hold one alignment matrix, hence Bio.AlignIO does not allow
writing multiple alignments to a single Nexus file.  If you have some
real world examples of Nexus files holding more than one alignment
matrix, please share them - then we can try and get Bio.AlignIO (and
if need be Bio.Nexus) to cope with them directly!

Peter


From cy at cymon.org  Thu May 14 05:59:51 2009
From: cy at cymon.org (Cymon Cox)
Date: Thu, 14 May 2009 10:59:51 +0100
Subject: [Biopython-dev] Cookbook entry, concatenating nexus files
In-Reply-To: <320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com>
References: <4A0BA3D6.5070207@student.otago.ac.nz>
	<320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com>
Message-ID: <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com>

2009/5/14 Peter <biopython at maubp.freeserve.co.uk>

> On Thu, May 14, 2009 at 5:53 AM, David Winter
> <winda002 at student.otago.ac.nz> wrote:
> >
> > I have been slowly adding some of the scripts I use most commonly to the
> > cookbook section of the wiki (
> http://biopython.org/wiki/Category:Cookbook).
> > Since I'm very much a  dilettante at this programming business as the
> > cookbook is meant as supplementary documentation for Biopython it's
> probably
> > a good idea for someone that knows what they are doing to look at these
> > things (Peter has been really helpful with this thus far, but is seems
> > unfair to saddle one man with so much bad programming :)
> >
> > I've just added a recipe that uses the nexus class to concatenate
> multiple
> > nexus files and provide some feedback if the taxa are not the same in
> each
> > one: http://biopython.org/wiki/Concatenate_nexus
> >
> > Any thoughts? If you think you can make it clearer/quicker/better then
> you
> > can edit it on the wiki or provide comments here of there.
>
> What exactly are you trying to achieve?  A big Nexus files with lots
> of alignments (and trees) in it?


The example David has given is very useful and a common procedure for
phylogeneticists. Single gene/proteins tend to be aligned in separate
alignment files and the concatenated into a so-called 'supermatrix'.

One thing I would question is the first line:

"It's a good idea, if possible, to make species-level phylogenetic
inferences bases on multiple genes because a) demographic processes can lead
gene-trees to diverge from species trees and b) journal editors now this."

Yes, it is a good idea to make inferences based upon the largest amount of
data, but if demographic process have led to some gene(s) that have diverged
from the species tree, then this is a reason not to combined them.
Phylogenetic inference assumes all data evolved on the same tree - typically
one would analyse gene partitions individually to look for incongruence
among partitions before combining the data.


> When I talked to Frank about Nexus files, he said they should only
> ever hold one alignment matrix,


Well, that was my understanding as well. But, it may be wrong. I just tried
it - p4 will read both matrices no problem, PAUP* (the de facto standard
here) will execute both matrices ok presumably leaving just the last as the
data in memory.

I'll look into this further...

Cheers C.
-- 
____________________________________________________________________

Cymon J. Cox

Centro de Ciencias do Mar
Faculdade de Ciencias do Mar e Ambiente (FCMA)
Universidade do Algarve
Campus de Gambelas
8005-139 Faro
Portugal

Phone: +0351 289800909 ext 7909
Fax: +0351 289800051
Email: cy at cymon.org, cymon at ualg.pt, cymon.cox at gmail.com
HomePage : http://biology.duke.edu/bryology/cymon.html
-8.63/-6.77

From biopython at maubp.freeserve.co.uk  Thu May 14 07:02:03 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 14 May 2009 12:02:03 +0100
Subject: [Biopython-dev] Cookbook entry, concatenating nexus files
In-Reply-To: <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com>
References: <4A0BA3D6.5070207@student.otago.ac.nz>
	<320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com>
	<7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com>
Message-ID: <320fb6e00905140402i5e7d9343p60162f697692b427@mail.gmail.com>

On Thu, May 14, 2009 at 10:59 AM, Cymon Cox <cy at cymon.org> wrote:
>> What exactly are you trying to achieve? ?A big Nexus files with lots
>> of alignments (and trees) in it?
>
> The example David has given is very useful and a common procedure for
> phylogeneticists. Single gene/proteins tend to be aligned in separate
> alignment files and the concatenated into a so-called 'supermatrix'.

Oh right - I hadn't looked at David's example carefully enough earlier
to work out which concatenation he was doing (by row or by column).
It does make sense on re-reading.

Concatenation to give a single supermatrix (same number of taxa,
longer sequences) would be most elegantly done by sorting the three
alignments (so the taxa are in the same order) and then concatenating
them (by column).  See Bug 2552,
http://bugzilla.open-bio.org/show_bug.cgi?id=2552

Note that this procedure isn't specific to NEXUS files - you could do
this with any alignment format.  It is just fairly straight forward
with the Bio.Nexus module at the moment (at least, until we fix Bug
2552).

Peter


From biopython at maubp.freeserve.co.uk  Thu May 14 07:11:30 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 14 May 2009 12:11:30 +0100
Subject: [Biopython-dev] Cookbook entry, concatenating nexus files
In-Reply-To: <320fb6e00905140402i5e7d9343p60162f697692b427@mail.gmail.com>
References: <4A0BA3D6.5070207@student.otago.ac.nz>
	<320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com>
	<7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com>
	<320fb6e00905140402i5e7d9343p60162f697692b427@mail.gmail.com>
Message-ID: <320fb6e00905140411med6a548g5f687a53881e24aa@mail.gmail.com>

On Thu, May 14, 2009 at 12:02 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Oh right - I hadn't looked at David's example carefully enough earlier
> to work out which concatenation he was doing (by row or by column).
> It does make sense on re-reading.

I'd rephrase this bit of the intro:

<start>
It's a good idea, if possible, to make species-level phylogenetic
inferences bases on multiple genes because a) demographic processes
can lead gene-trees to diverge from species trees and b) journal
editors now this. Most of the alignment files supported by Biopython
allow you to write multiple alignments to the same file which makes
this easy. However, the nexus file format (used by PAUP* and Mr Bayes)
does not. In nexus files multiple alignments need to be represented as
different 'character partitions' within a data matrix that contains
one long sequence for each taxon.
<end>

Bio.AlignIO will in general write out one or more alignments to a
file.  It does NOT do any concatenation by column, required to give
the "supermatrix" which you want (which is why I get confused on the
first reading).  How about:

<start>
It's a good idea, if possible, to make species-level phylogenetic
inferences bases on multiple genes because (a) demographic processes
can lead gene-trees to diverge from species trees and (b) journal
editors know this.  [add stuff from Cymon's comment here?]

This is usually handled by creating a single "supermatrix" from
separate alignments for each gene.  i.e. You need a single alignment
containing one row for each taxon where the rows are the concatenated
pre-aligned sequences.  In NEXUS files (used by PAUP* and Mr Bayes)
multiple alignments can be explicitly represented as different
'character partitions' within a data matrix that contains one long
sequence for each taxon.  The Bio.Nexus module makes this relatively
straight forward.
<end>

Peter

From cy at cymon.org  Thu May 14 07:30:20 2009
From: cy at cymon.org (Cymon Cox)
Date: Thu, 14 May 2009 12:30:20 +0100
Subject: [Biopython-dev] Cookbook entry, concatenating nexus files
In-Reply-To: <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com>
References: <4A0BA3D6.5070207@student.otago.ac.nz>
	<320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com> 
	<7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com>
Message-ID: <7265d4f0905140430j47b0a661jd58dbe5749e4a1f7@mail.gmail.com>

2009/5/14 Cymon Cox <cy at cymon.org>

> 2009/5/14 Peter <biopython at maubp.freeserve.co.uk>
>
>> When I talked to Frank about Nexus files, he said they should only
>> ever hold one alignment matrix,
>
>
> Well, that was my understanding as well. But, it may be wrong. I just tried
> it - p4 will read both matrices no problem, PAUP* (the de facto standard
> here) will execute both matrices ok presumably leaving just the last as the
> data in memory.
>
> I'll look into this further...
>

After a quick scan of the spec, there appears to be only one oblique
reference to this issue:

"Although the NEXUS standard does not impose constraints on the number of
blocks, particular programs will. For example, MacClade 3.07 does not allow
more than one TAXA block in a file."

So I read that to mean, you can have any number of similarly named blocks in
a NEXUS file, ie multiple DATA, TAXA, CHARACTERS, TREES etc, and its up to
an individual application to decide how to deal with them.

This seems to be in practice what happens: PAUP* will read multiple blocks
of the same name but only the last block of a particular name will remain in
memory after the file has been parsed. On the other hand, P4 will read
multiple DATA blocks and store the different alignments as separate objects,
and read multiple TREES blocks and store all the trees.

C.

From biopython at maubp.freeserve.co.uk  Thu May 14 14:20:47 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 14 May 2009 19:20:47 +0100
Subject: [Biopython-dev] SwissProt DE lines and bioentry.description field
	in BioSQL
Message-ID: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>

Hi,

This is cross-posted between biopython-dev and biosql-l as it regards
parsing the description (DE) lines in SwissProt files and how they are
stored in BioSQL.  This follows from an earlier discussion on
biopython-dev

Older SwissProt files just had one or two DE lines, and it made sense
to treat this as a simple string mapped onto the description field in
the bioentry table in BioSQL.  This appears to what happens with
BioPerl 1.5.x and in Biopython (although the details regarding white
space differ).  However, newer SwissProt files have many DE lines with
additional structure.  The example Michiel gave earlier on the
biopython-dev list was:

http://www.uniprot.org/uniprot/Q9XHP0.txt

This has the following DE lines:

DE   RecName: Full=11S globulin seed storage protein 2;
DE   AltName: Full=11S globulin seed storage protein II;
DE   AltName: Full=Alpha-globulin;
DE   Contains:
DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
DE     AltName: Full=11S globulin seed storage protein II acidic chain;
DE   Contains:
DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
DE     AltName: Full=11S globulin seed storage protein II basic chain;
DE   Flags: Precursor;

I had to fight with perl to get my old copy of BioPerl working again
(some week reference thing), but I managed, and then loaded this file
into my test BioSQL database with:

$ perl load_seqdatabase.pl --dbname biosql_test --dbuser root --dbpass
XXX --namespace biosql_test --format swiss Q9XHP0.txt

Then I looked at the resulting description in the main bioentry table:

$ mysql --user=root -p biosql_test -e 'SELECT description FROM
bioentry WHERE accession="Q9XHP0";'

This is stored as one huge long string (without the newlines, I'm not
sure if BioPerl strips those in parsing the file, or when loading it
into the database):

RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S
globulin seed storage protein II; AltName: Full=Alpha-globulin;
Contains: RecName: Full=11S globulin seed storage protein 2 acidic
chain; AltName: Full=11S globulin seed storage protein II acidic
chain; Contains: RecName: Full=11S globulin seed storage protein 2
basic chain; AltName: Full=11S globulin seed storage protein II basic
chain; Flags: Precursor;

For Biopython, I emptied the database then did:

>>> from Bio import SeqIO
>>> from BioSQL import BioSeqDatabase
>>> server = BioSeqDatabase.open_database(driver="MySQLdb", user="root", passwd = "XXX", host = "localhost", db="biosql_test")
>>> db = server["biosql-test"] #namespace
>>> db.load(SeqIO.parse(open("Q9XHP0.txt"), "swiss"))
1
>>> server.commit()

As before, I looked in the table with mysql.  Again - this stores the
full description from the DE line, although with the newlines
embedded.  So, Biopython is consistent with my old copy of BioPerl
(1.5.x) if we ignore the white space.

However, how does this look in BioPerl 1.6?  If this is the same, are
there any plans to change this?  For Biopython we have discussed
recording most of the DE information under the annotations instead
(keyed off RecName, AltName, Contains, Flags), but I would like to be
consistent with BioPerl+BioSQL.

Thanks

Peter

From winda002 at student.otago.ac.nz  Thu May 14 18:39:34 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Fri, 15 May 2009 10:39:34 +1200
Subject: [Biopython-dev] Cookbook entry, concatenating nexus files
In-Reply-To: <320fb6e00905140411med6a548g5f687a53881e24aa@mail.gmail.com>
References: <4A0BA3D6.5070207@student.otago.ac.nz>	
	<320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com>	
	<7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com>	
	<320fb6e00905140402i5e7d9343p60162f697692b427@mail.gmail.com>
	<320fb6e00905140411med6a548g5f687a53881e24aa@mail.gmail.com>
Message-ID: <4A0C9DA6.9060403@student.otago.ac.nz>

Peter wrote:
> On Thu, May 14, 2009 at 12:02 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>   
>> Oh right - I hadn't looked at David's example carefully enough earlier
>> to work out which concatenation he was doing (by row or by column).
>> It does make sense on re-reading.
>>     
Well, just about ;)
>
> I'd rephrase this bit of the intro:
>   
Yep, that's much better. Thanks Peter and Cymon for your feedback on 
this, I've updated the intro to include it and a couple of specific 
examples of how you'd use the character partitions.

(Have you guys seen  this:  
doi.wiley.com/10.1111/j.1755-0998.2008.02164.x , you could write a paper 
from one function in your nexus module!)

cheers,
david

From biopython at maubp.freeserve.co.uk  Fri May 15 05:05:59 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 15 May 2009 10:05:59 +0100
Subject: [Biopython-dev] Cookbook entry, concatenating nexus files
In-Reply-To: <4A0C9DA6.9060403@student.otago.ac.nz>
References: <4A0BA3D6.5070207@student.otago.ac.nz>
	<320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com>
	<7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com>
	<320fb6e00905140402i5e7d9343p60162f697692b427@mail.gmail.com>
	<320fb6e00905140411med6a548g5f687a53881e24aa@mail.gmail.com>
	<4A0C9DA6.9060403@student.otago.ac.nz>
Message-ID: <320fb6e00905150205k31d95c84naac1fa7873461263@mail.gmail.com>

On Thu, May 14, 2009 at 11:39 PM, David Winter
<winda002 at student.otago.ac.nz> wrote:
>>
>> I'd rephrase this bit of the intro:
>>
>
> Yep, that's much better. Thanks Peter and Cymon for your feedback on this,
> I've updated the intro to include it and a couple of specific examples of
> how you'd use the character partitions.

That does look much clearer now :) Could you include the three original
alignments in the text?  It would help to let the reader see what is going
on (and could be used to reproduce the example).

> (Have you guys seen ?this: ?doi.wiley.com/10.1111/j.1755-0998.2008.02164.x ,
> you could write a paper from one function in your nexus module!)

>From the abstract that does sound pretty trivial, but I guess that tool would be
useful for non-programmers - even if you could probably rewrite it as one
short python script using Biopython (or indeed a Perl script using BioPerl etc).

Peter


From bugzilla-daemon at portal.open-bio.org  Fri May 15 20:24:29 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 15 May 2009 20:24:29 -0400
Subject: [Biopython-dev] [Bug 2829] New: Biosequence.alphabet can be set to
	unknown after loading a nucleotide SeqRecord
Message-ID: <bug-2829-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2829

           Summary: Biosequence.alphabet can be set to unknown after loading
                    a nucleotide SeqRecord
           Product: Biopython
           Version: 1.49
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: BioSQL
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: david.wyllie at ndm.ox.ac.uk


Hi

I have done the following
1 loaded a small nucleotide fasta file with SeqIO, setting the alphabet
successfully 
2 written it to a test database with BioSQL
3 reloaded it, at which point the reloaded object has a "SingleLetterAlphabet"
alphabet and biosequence.alphabet is set to unknown.

Is this expected?

The overall object was to add some SeqFeatures to the loaded SeqRecord, but it
doesn't seem to store correctly even without any manipulations.

Below demonstrates the problem. The system is Ubuntu 9 x64/ Python 2.6/
Biopython 1.49.

#!/usr/bin/env python

from BioSQL import BioSeqDatabase
from Bio.Alphabet import generic_nucleotide
from Bio import SeqIO
from Bio import Seq

# define variables needed for testing
username="myusername"
password="mypassword"
hostname="localhost"

# we are going to try to load a nucleotide fasta file into a BioSQL database
# need a test file, with inputfile the file name;
#>test_sequence
#ttgaccgatgaccccggttcaggcttcaccacagtgtggaacgcggtcgtctccgaactt
inputfile="/home/dwyllie/test.faa"

# we want to create a new BioSQL database, called test
dbname="test"
dbdescription="test of alphabet storage"

# we also want to remove one if it exists, for the purposes of testing
server = BioSeqDatabase.open_database(driver="MySQLdb", db="bioseqdb",
user=username, passwd=password, host=hostname)   
# if the database doesn't exist, we get an error, so we trap for that
try:
        server.remove_database(dbname)
        server.adaptor.commit()
except KeyError:
        print "Attempt to remove ",dbname," failed; going on to create a new
one" 

server = BioSeqDatabase.open_database(driver="MySQLdb", db="bioseqdb",
user=username, passwd=password, host=hostname)   
db = server.new_database(dbname, description=dbdescription)
server.adaptor.commit()

# set up a list to hold the mycobacterial sequences
selectedrecords = [] # Setup an empty list which we'll later write

# ifh is the input file handle;
ifh = open(inputfile, "rU")

# set a counter
recordsread=0

for record in SeqIO.parse(ifh, "fasta", generic_nucleotide):

        # increment counter
        recordsread=recordsread+1

        # just so we can reload it easily, we'll assign an id to this record
        # however, the problem does not depend on this,
        # nor on the nature of the defline, as far as I can tell
        record.id="IDENTIFIER_"+str(recordsread)

        print "** Note the sequence type of the Seq ** "
        print record

        # note that to this point it does appear to work, and the alphabet is
correct.
        selectedrecords.append(record)

print inputfile, "total found ", recordsread
ifh.close()

# write it to the bioSQL database
print "Writing sequences to database"
db.load(selectedrecords)
server.adaptor.commit()

# subsequent attempts to write the re-loaded object fail because no alphabet is
defined
print "However, the alphabet hasn't been stored."
loadedrecord=db.lookup(gi="IDENTIFIER_1")
print "Displaying re-loaded record"
print loadedrecord

# this can be confirmed by running
sqlcmd="""
select * from 
bioseqdb.biosequence,
bioseqdb.bioentry, 
bioseqdb.biodatabase
where 
biodatabase.biodatabase_id= bioentry.biodatabase_id  and
biosequence.bioentry_id=bioentry.bioentry_id and
biodatabase.name="test"

"""

print "This can be confirmed by examining bioseqdb.biosequence.alphabet, which
is set to unknown; ", sqlcmd


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Sat May 16 07:37:52 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 16 May 2009 07:37:52 -0400
Subject: [Biopython-dev] [Bug 2829] BioSQL does not record a generic
	nucleotide alphabet
In-Reply-To: <bug-2829-42@http.bugzilla.open-bio.org/>
Message-ID: <200905161137.n4GBbqKe018688@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2829


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
            Summary|Biosequence.alphabet can be |BioSQL does not record a
                   |set to unknown after loading|generic nucleotide alphabet
                   |a nucleotide SeqRecord      |


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-16 07:37 EST -------
Biopython has a relatively rich range of alphabets, including IUPAC ambiguous
and unambiguous alphabets, plus ways to indicate gap characters and stop
symbols.  The BioSQL range is much simpler, so some information is inevitably
lost.

In BioSQL, all we store is a simple string, "dna", "rna", "protein" or
"unknown" (although BioJava used uppercase, so that is effectively allowed
too). See:
http://www.biosql.org/wiki/Enhancement_Requests#Check_constraint_on_biosequence.alphabet

This means if your sequence was using "IUPAC extended protein with a * stop
codon", all we can record is "protein". i.e. On retrieval from a BioSQL
database, the alphabet is simply a generic protein.  Likewise "ambiguous IUAC
DNA with minus as the gap character" just becomes generic DNA.

Note that as far as I know, currently none of the Bio* languages attempt to
record "nucleotide" (i.e. "dna" or "rna").  This is something we should discuss
on the BioSQL mailing list as a possible enhancement.

So in answer to your question "Is this expected?", yes, a generic nucleotide
alphabet isn't "dna", "rna" or "protein" so is currently recorded in the BioSQL
database as "unknown".  This gets turned into the SingleLetterAlphabet on
retrieval.

Changing title to "BioSQL does not record a generic nucleotide alphabet" and
marking this as an enhancement.

Peter

P.S. Are you just testing here, or do you really not know if your sequence is
DNA or RNA?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Sat May 16 07:54:11 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 16 May 2009 07:54:11 -0400
Subject: [Biopython-dev] [Bug 2829] BioSQL does not record a generic
	nucleotide alphabet
In-Reply-To: <bug-2829-42@http.bugzilla.open-bio.org/>
Message-ID: <200905161154.n4GBsBWZ019474@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2829


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-16 07:54 EST -------
See:
http://lists.open-bio.org/pipermail/biosql-l/2009-May/001515.html


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bartek at rezolwenta.eu.org  Sat May 16 13:39:18 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Sat, 16 May 2009 19:39:18 +0200
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <320fb6e00905121157i2068b81fw5a8b21e65c33abe2@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
	<320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com>
	<8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com>
	<320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com>
	<320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com>
	<320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com>
	<8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com>
	<320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com>
	<8b34ec180905121123w6fe553ahd416b9e1d317dd11@mail.gmail.com>
	<320fb6e00905121157i2068b81fw5a8b21e65c33abe2@mail.gmail.com>
Message-ID: <8b34ec180905161039u529a7093p978aa8d61970ca51@mail.gmail.com>

On Tue, May 12, 2009 at 8:57 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> I'm not happy with the current github repository due to the history
> tag issue - but we know we can fix that now. ?Are you going to try
> removing the old tags and re-doing them on github?

I've finally found some time for it and fixed the tags in the main repository.
I was able to run the update and it ran ok,  I w2as also able to clone the repo
from the official branch and see that they are OK in gitx. If anyone
has problems
with the tags, please let me know.

>
> Does anyone know how the git provided "ViewCVS" equivalent shows tags
> in a file's history?

If you are talking about gitweb, you can see it (for example: Makefile
for linux 2.6.17) here:

http://git.kernel.org/?p=linux/kernel/git/chrisw/linux-2.6.17.y.git;a=history;f=Makefile;h=79072d86297e78406791f0fc5764c35eb04fd07d;hb=78ace17e51d4968ed2355e8f708d233d1cc37f6d

I've also installed gitweb on a copy of biopython repo on my server
(not a permanent URL, not updated from trunk)
http://83.243.39.60/cgi-bin/gitweb.cgi?p=biopython.git;a=tree;hb=HEAD

It shows the tags, but (as usually with git), the tags are only shown
for the files which were affected by the particular commit marked with
the tag. So this behavior is consistent with kernel.org and github.


cheers
Bartek


From biopython at maubp.freeserve.co.uk  Sat May 16 16:35:36 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 16 May 2009 21:35:36 +0100
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <8b34ec180905161039u529a7093p978aa8d61970ca51@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
	<8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com>
	<320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com>
	<320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com>
	<320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com>
	<8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com>
	<320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com>
	<8b34ec180905121123w6fe553ahd416b9e1d317dd11@mail.gmail.com>
	<320fb6e00905121157i2068b81fw5a8b21e65c33abe2@mail.gmail.com>
	<8b34ec180905161039u529a7093p978aa8d61970ca51@mail.gmail.com>
Message-ID: <320fb6e00905161335i28be05fay848dc18f86e728cf@mail.gmail.com>

On 5/16/09, Bartek Wilczynski <bartek at rezolwenta.eu.org> wrote:
> On Tue, May 12, 2009 at 8:57 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>  >
>  > I'm not happy with the current github repository due to the history
>  > tag issue - but we know we can fix that now.  Are you going to try
>  > removing the old tags and re-doing them on github?
>
>  I've finally found some time for it and fixed the tags in the main repository.

Great :)

>  I was able to run the update and it ran ok,  I was also able to clone the repo
>  from the official branch and see that they are OK in gitx. If anyone
>  has problems with the tags, please let me know.

I'll check with my Mac on Monday.

>  > Does anyone know how the git provided "ViewCVS" equivalent shows
>  > tags in a file's history?
>
> If you are talking about gitweb, you can see it (for example: Makefile
> for linux 2.6.17) here:
>
>  http://git.kernel.org/?p=linux/kernel/git/chrisw/linux-2.6.17.y.git;a=history;f=Makefile;h=79072d86297e78406791f0fc5764c35eb04fd07d;hb=78ace17e51d4968ed2355e8f708d233d1cc37f6d
>
>  I've also installed gitweb on a copy of biopython repo on my server
>  (not a permanent URL, not updated from trunk)
>  http://83.243.39.60/cgi-bin/gitweb.cgi?p=biopython.git;a=tree;hb=HEAD
>
>  It shows the tags, but (as usually with git), the tags are only shown
>  for the files which were affected by the particular commit marked with
>  the tag. So this behavior is consistent with kernel.org and github.

Thanks for those examples.

I see what you mean, looking at Bio/Blast/NCBIXML.py in gitweb for
example, no tags show up at all.  On the other hand, for the NEWS
file, some tags show up.  Basically for what I want to use the tags
for (identifying changes to a single file between two releases),
gitweb doesn't work.  Nor does github's history. This is a shame.

I think the reason CVS (or SVN) seem to work better in this regard is
like python they care about individual files, while git works in terms
of changes (which may affect multiple files).

I'll see how I get on with the command line or graphical git history
viewers and get back to you...

Cheers,

Peter

From hlapp at gmx.net  Sat May 16 18:34:57 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 16 May 2009 18:34:57 -0400
Subject: [Biopython-dev] SwissProt DE lines and bioentry.description
	field in BioSQL
In-Reply-To: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
Message-ID: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>

Don't you love SwissProt (or UniProt as we must call it now I  
suppose). They (understandably) try to squeeze ever more annotation  
into the existing tags, rather than adding new tags.

So, of the following structure:

DE   RecName: Full=11S globulin seed storage protein 2;
DE   AltName: Full=11S globulin seed storage protein II;
DE   AltName: Full=Alpha-globulin;
DE   Contains:
DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
DE     AltName: Full=11S globulin seed storage protein II acidic chain;
DE   Contains:
DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
DE     AltName: Full=11S globulin seed storage protein II basic chain;
DE   Flags: Precursor;

really only the first line, with the 'RecName: Full=' removed, is the  
description line as we know it. The rest, I would say, is annotation,  
such as two alternative names, amino acid chains contained in the full  
record (shouldn't this be feature annotation, really? and indeed it is  
- why it needs to be repeated here is beyond me) and their names as  
well as alternative names, and the fact that the sequence is a  
precursor form.

Leaving all this in one string has the advantage that we can round- 
trip it (and there is probably hardly any other way to accomplish  
that), but clearly in terms of semantics this isn't the sequence  
description as we know it anymore.

Does anyone else think too that completely changing the semantics of  
sequence annotation fields is a bad idea? <sigh/>

My inclination from a BioPerl perspective is to extract the part  
following 'RecName: Full=' as the description, and attach the rest as  
annotation. We could in fact use the TagTree class for this. I'm cross- 
posting to BioPerl too to gather what other BioPerl'ers think about  
this.

	-hilmar

On May 14, 2009, at 2:20 PM, Peter wrote:

> Hi,
>
> This is cross-posted between biopython-dev and biosql-l as it regards
> parsing the description (DE) lines in SwissProt files and how they are
> stored in BioSQL.  This follows from an earlier discussion on
> biopython-dev
>
> Older SwissProt files just had one or two DE lines, and it made sense
> to treat this as a simple string mapped onto the description field in
> the bioentry table in BioSQL.  This appears to what happens with
> BioPerl 1.5.x and in Biopython (although the details regarding white
> space differ).  However, newer SwissProt files have many DE lines with
> additional structure.  The example Michiel gave earlier on the
> biopython-dev list was:
>
> http://www.uniprot.org/uniprot/Q9XHP0.txt
>
> This has the following DE lines:
>
> DE   RecName: Full=11S globulin seed storage protein 2;
> DE   AltName: Full=11S globulin seed storage protein II;
> DE   AltName: Full=Alpha-globulin;
> DE   Contains:
> DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
> DE     AltName: Full=11S globulin seed storage protein II acidic  
> chain;
> DE   Contains:
> DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
> DE     AltName: Full=11S globulin seed storage protein II basic chain;
> DE   Flags: Precursor;
>
> I had to fight with perl to get my old copy of BioPerl working again
> (some week reference thing), but I managed, and then loaded this file
> into my test BioSQL database with:
>
> $ perl load_seqdatabase.pl --dbname biosql_test --dbuser root --dbpass
> XXX --namespace biosql_test --format swiss Q9XHP0.txt
>
> Then I looked at the resulting description in the main bioentry table:
>
> $ mysql --user=root -p biosql_test -e 'SELECT description FROM
> bioentry WHERE accession="Q9XHP0";'
>
> This is stored as one huge long string (without the newlines, I'm not
> sure if BioPerl strips those in parsing the file, or when loading it
> into the database):
>
> RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S
> globulin seed storage protein II; AltName: Full=Alpha-globulin;
> Contains: RecName: Full=11S globulin seed storage protein 2 acidic
> chain; AltName: Full=11S globulin seed storage protein II acidic
> chain; Contains: RecName: Full=11S globulin seed storage protein 2
> basic chain; AltName: Full=11S globulin seed storage protein II basic
> chain; Flags: Precursor;
>
> For Biopython, I emptied the database then did:
>
>>>> from Bio import SeqIO
>>>> from BioSQL import BioSeqDatabase
>>>> server = BioSeqDatabase.open_database(driver="MySQLdb",  
>>>> user="root", passwd = "XXX", host = "localhost", db="biosql_test")
>>>> db = server["biosql-test"] #namespace
>>>> db.load(SeqIO.parse(open("Q9XHP0.txt"), "swiss"))
> 1
>>>> server.commit()
>
> As before, I looked in the table with mysql.  Again - this stores the
> full description from the DE line, although with the newlines
> embedded.  So, Biopython is consistent with my old copy of BioPerl
> (1.5.x) if we ignore the white space.
>
> However, how does this look in BioPerl 1.6?  If this is the same, are
> there any plans to change this?  For Biopython we have discussed
> recording most of the DE information under the annotations instead
> (keyed off RecName, AltName, Contains, Flags), but I would like to be
> consistent with BioPerl+BioSQL.
>
> Thanks
>
> Peter
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Sat May 16 19:14:54 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 17 May 2009 00:14:54 +0100
Subject: [Biopython-dev] SwissProt DE lines and bioentry.description
	field in BioSQL
In-Reply-To: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
Message-ID: <320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com>

On 5/16/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>
> Don't you love SwissProt (or UniProt as we must call it now I suppose).
> They (understandably) try to squeeze ever more annotation into the existing
> tags, rather than adding new tags.
>
>  So, of the following structure:
>
>  DE   RecName: Full=11S globulin seed storage protein 2;
>  DE   AltName: Full=11S globulin seed storage protein II;
>  DE   AltName: Full=Alpha-globulin;
>  DE   Contains:
>  DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
>  DE     AltName: Full=11S globulin seed storage protein II acidic chain;
>  DE   Contains:
>  DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
>  DE     AltName: Full=11S globulin seed storage protein II basic chain;
>  DE   Flags: Precursor;
>
>  really only the first line, with the 'RecName: Full=' removed, is the
> description line as we know it. The rest, I would say, is annotation, such
> as two alternative names, amino acid chains contained in the full record
> (shouldn't this be feature annotation, really? and indeed it is - why it
> needs to be repeated here is beyond me) and their names as well as
> alternative names, and the fact that the sequence is a precursor form.
>
>  Leaving all this in one string has the advantage that we can round-trip it
> (and there is probably hardly any other way to accomplish that), but clearly
> in terms of semantics this isn't the sequence description as we know it
> anymore.
>
>  Does anyone else think too that completely changing the semantics of
> sequence annotation fields is a bad idea? <sigh/>

+1
That's pretty much what I thought on seeing this the first time.

>  My inclination from a BioPerl perspective is to extract the part following
> 'RecName: Full=' as the description, and attach the rest as annotation. We
> could in fact use the TagTree class for this. I'm cross-posting to BioPerl
> too to gather what other BioPerl'ers think about this.

Am I right to infer that currently BioPerl 1.6.x, like BioPerl 1.5.x just
treats the DE lines as only big long string?

Could you translate your idea about the TagTree class into something
concrete with BioSQL tables and fields for me? I'm not familiar with
the TagTree (or Perl).

Over on the Biopython list we'd talked about storing this annotation in
a nested structured.  However, in order to use the BioSQL annotations
mechanisms, I think a simple flat structure is required :(

Peter

From biopython at maubp.freeserve.co.uk  Sat May 16 19:28:43 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 17 May 2009 00:28:43 +0100
Subject: [Biopython-dev] [BioSQL-l] SwissProt DE lines and
	bioentry.description field in BioSQL
In-Reply-To: <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
Message-ID: <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>

On 5/17/09, Chris Fields <cjfields at illinois.edu> wrote:
>
> On May 16, 2009, at 5:34 PM, Hilmar Lapp wrote:
> > My inclination from a BioPerl perspective is to extract the part following
> > 'RecName: Full=' as the description, and attach the rest as annotation. We
> > could in fact use the TagTree class for this. I'm cross-posting to BioPerl
> > too to gather what other BioPerl'ers think about this.
> >
> >        -hilmar
> >
>
> This is much like the GN issues we've run into before, and we *could* set
> this up using TagTree or similar.  In the latter case of gene name the data
> is stored in a text tree as follows:
>
>  gene_names:
>   gene_name:
>     Name: GC1QBP
>     Synonyms: HABP1
>     Synonyms: SF2P32
>     Synonyms: C1QBP
>
>  That could be changed to an XML string:
>
>  <?xml version="1.0" encoding="UTF-8"?>
>  <gene_names>
>   <gene_name>
>     <Name>GC1QBP</Name>
>     <Synonyms>HABP1</Synonyms>
>     <Synonyms>SF2P32</Synonyms>
>     <Synonyms>C1QBP</Synonyms>
>   </gene_name>
>  </gene_names>
>
> Thinking about this we should attempt to coalesce around a standard instead
> of forcing the other Bio*  to a specific format.

How would you record this in BioSQL?  As an XML string for an annotation value?

Brad has suggested JSON might be useful for this kind of thing (see
also per-letter-annotation discussion).

Peter

From hlapp at gmx.net  Sat May 16 19:37:14 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 16 May 2009 19:37:14 -0400
Subject: [Biopython-dev] [BioSQL-l] SwissProt DE lines and
	bioentry.description field in BioSQL
In-Reply-To: <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
	<320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>
Message-ID: <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net>


On May 16, 2009, at 7:28 PM, Peter wrote:

>> That could be changed to an XML string:
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <gene_names>
>>  <gene_name>
>>    <Name>GC1QBP</Name>
>>    <Synonyms>HABP1</Synonyms>
>>    <Synonyms>SF2P32</Synonyms>
>>    <Synonyms>C1QBP</Synonyms>
>>  </gene_name>
>> </gene_names>
>>
>> Thinking about this we should attempt to coalesce around a standard  
>> instead
>> of forcing the other Bio*  to a specific format.
>
> How would you record this in BioSQL?  As an XML string for an  
> annotation value?

Yes. A TagTree object can be serialized to XML, and the XML can be  
stored as the annotation value in BioSQL. As the XML can be read back  
in, it allows full round-tripping.

> Brad has suggested JSON might be useful for this kind of thing (see
> also per-letter-annotation discussion).

JSON could be another serialization format, but XML is equally or  
better supported in all languages except JavaScript. Furthermore, you  
could just send the XML to the browser and have an XSLT (either  
directly, or indirectly through JavaScript doing the transformation)  
do the rendering.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Sat May 16 19:42:17 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 16 May 2009 19:42:17 -0400
Subject: [Biopython-dev] SwissProt DE lines and bioentry.description
	field in BioSQL
In-Reply-To: <320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com>
Message-ID: <8CD4EED1-A689-447F-8F6E-8D2204DD4E86@gmx.net>


On May 16, 2009, at 7:14 PM, Peter wrote:

> Am I right to infer that currently BioPerl 1.6.x, like BioPerl 1.5.x  
> just
> treats the DE lines as only big long string?

Yes.

> Could you translate your idea about the TagTree class into something
> concrete with BioSQL tables and fields for me? [...] Over on the  
> Biopython list we'd talked about storing this annotation in a nested  
> structured.

That's more or less what TagTree is.

>  However, in order to use the BioSQL annotations mechanisms, I think  
> a simple flat structure is required :(

Not necessarily. If you have a flat serialization (such as XML) the  
nested structure isn't needed. Of course that's not a fully normalized  
relational representation, but if you had one, how often would it be  
used, how efficient would those queries be (SQL is poor at nested or  
recursive data structures), and how much pain would it be to write the  
object-relational mappings?

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Sun May 17 08:40:47 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 17 May 2009 13:40:47 +0100
Subject: [Biopython-dev] [BioSQL-l] SwissProt DE lines and
	bioentry.description field in BioSQL
In-Reply-To: <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
	<320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>
	<0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net>
Message-ID: <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com>

On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>
>  On May 16, 2009, at 7:28 PM, Peter wrote:
> > > That could be changed to an XML string:
> > >
> > > <?xml version="1.0" encoding="UTF-8"?>
> > > <gene_names>
> > >  <gene_name>
> > >   <Name>GC1QBP</Name>
> > >   <Synonyms>HABP1</Synonyms>
> > >   <Synonyms>SF2P32</Synonyms>
> > >   <Synonyms>C1QBP</Synonyms>
> > >  </gene_name>
> > > </gene_names>
> > >
> > > Thinking about this we should attempt to coalesce around a standard
> > > instead of forcing the other Bio*  to a specific format.

Absolutely - some common standard should be agreed.

Would you envision doing this for other structured fields, inventing a
new mini XML format each time?  That seems open ended and likely to
cause a lot of work keeping all the Bio* project synchronised.

Here you have mapped RecName and AltName fields in the DE lines to
Name and Synonyms (shouldn't that be Synonym singular?).  I also don't
get why you have used a gene_name entry inside a gene_names list.
Would you hold the contains information and the flags information from
the DE lines in separate XML entries?

I would have gone for something much closer to the original DE line
markup i.e. using the field names UniProt use, RecName and AltName,
rather than mapping these to Name and Synonym.

> > How would you record this in BioSQL?  As an XML string for an annotation
> > value?
>
> Yes. A TagTree object can be serialized to XML, and the XML can be stored
> as the annotation value in BioSQL. As the XML can be read back in, it allows
> full round-tripping.

Assuming you stored all the DE markup, then yes, a round trip back to
the SwissProt file could be possible.  And, depending on the details
of the XML structure used, it would be possible to represent this in a
python structure too.

> > Brad has suggested JSON might be useful for this kind of thing (see
> > also per-letter-annotation discussion).
>
> JSON could be another serialization format, but XML is equally or better
> supported in all languages except JavaScript. Furthermore, you could just
> send the XML to the browser and have an XSLT (either directly, or indirectly
> through JavaScript doing the transformation) do the rendering.

I have no strong preference for either XML or JSON (but would rather
avoid them if they are not really needed).  For other types of
annotation there may be a clearer advantage for one over the other,
e.g. per letter annotation like the secondary structure of a protein
sequence, or the quality scores of a nucleotide contig.

On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
> Not necessarily. If you have a flat serialization (such as XML) the nested
> structure isn't needed. Of course that's not a fully normalized relational
> representation, but if you had one, how often would it be used, how
> efficient would those queries be (SQL is poor at nested or recursive data
> structures), and how much pain would it be to write the object-relational
> mappings?

In this example, searching the database using one of the SwissProt
AltNames (synonyms), or filtering on the Flags sounds like a
reasonable request - but this would be very difficult if the data is
stored inside XML strings.

Of course, because the RecName and AltName entries are top level, we
could just record them as normal - simple strings in the annotations
table.  This seems much nicer.  Likewise the "Flags: Precursor;" line.
 i.e. listing the tag/value pairs which could be used in the
bioentry_qualifier_value table:

AltName = "Full=11S globulin seed storage protein II"
AltName = "Full=Alpha-globulin"
Flags = "Precursor"

(the RecName field, "Full=11S globulin seed storage protein 2", could
be used for the bioentry.description instead)

The above are all pretty easy.  We only need to consider nesting (or
something like XML or JSON) for some of the DE information, in the
example discussed the Contains lines.  Even this could be even be done
by storing each contains entry as a single long string (holding both
the name and synonyms) directly from the DE line itself, something
like this:

Contains = "RecName: Full=11S globulin seed storage protein 2 acidic
chain;\nAltName: Full=11S globulin seed storage protein II acidic
chain;"
Contains = "RecName: Full=11S globulin seed storage protein 2 basic
chain;\nAltName: Full=11S globulin seed storage protein II basic
chain;"

Peter

From hlapp at gmx.net  Sun May 17 11:21:59 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 17 May 2009 11:21:59 -0400
Subject: [Biopython-dev] SwissProt DE lines and bioentry.description
	field in BioSQL
In-Reply-To: <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
	<320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>
	<0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net>
	<320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com>
Message-ID: <A8AB4BCB-9CD3-428D-AF10-899AD8055EC7@gmx.net>


On May 17, 2009, at 8:40 AM, Peter wrote:

> On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>>
>> On May 16, 2009, at 7:28 PM, Peter wrote:
>>>> That could be changed to an XML string:
>>>>
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <gene_names>
>>>> <gene_name>
>>>>  <Name>GC1QBP</Name>
>>>>  <Synonyms>HABP1</Synonyms>
>>>>  <Synonyms>SF2P32</Synonyms>
>>>>  <Synonyms>C1QBP</Synonyms>
>>>> </gene_name>
>>>> </gene_names>
>>>>
>>>> Thinking about this we should attempt to coalesce around a standard
>>>> instead of forcing the other Bio*  to a specific format.
>
> [...] Here you have mapped RecName and AltName fields in the DE  
> lines to
> Name and Synonyms (shouldn't that be Synonym singular?).

The example is for the GN lines in SwissProt, not the DE lines.

> [...]
> On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>> Not necessarily. If you have a flat serialization (such as XML) the  
>> nested
>> structure isn't needed. Of course that's not a fully normalized  
>> relational
>> representation, but if you had one, how often would it be used, how
>> efficient would those queries be (SQL is poor at nested or  
>> recursive data
>> structures), and how much pain would it be to write the object- 
>> relational
>> mappings?
>
> In this example, searching the database using one of the SwissProt
> AltNames (synonyms), or filtering on the Flags sounds like a
> reasonable request - but this would be very difficult if the data is
> stored inside XML strings.

Actually no. Modern full-text indexers (inside or outside the  
database) can index XML text columns right away and very well. In  
fact, for the last project that I built a full-text search for (on top  
of a BioSQL database) I did that by writing custom XML documents to a  
separate table for each record I wanted indexed. Oracle's full text  
indexer did the rest. I also built a separate identifier/name/ 
accession index that pulled all the gene names, symbols, accession  
numbers, identifiers etc into a single table for indexing.

What I mean is, a fully normalized relational representation,  
especially if nested, is often not the most efficient data structure  
for efficient searching and filtering.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From bugzilla-daemon at portal.open-bio.org  Sun May 17 18:53:13 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 17 May 2009 18:53:13 -0400
Subject: [Biopython-dev] [Bug 2829] BioSQL does not record a generic
	nucleotide alphabet
In-Reply-To: <bug-2829-42@http.bugzilla.open-bio.org/>
Message-ID: <200905172253.n4HMrDIX006938@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2829


------- Comment #3 from david.wyllie at ndm.ox.ac.uk  2009-05-17 18:53 EST -------
(In reply to comment #2)
> See:
> http://lists.open-bio.org/pipermail/biosql-l/2009-May/001515.html
> 

Hi

thank you very much for explaining.

I'm not sure this is a bug, it's a design feature due to my not understanding
the implications of generic_nucleotide.  

I know it's DNA, and if one uses generic_dna instead in the testcase, all is
well.

Alphabets are explained clearly in the documentation.  Thank you again.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon May 18 06:08:45 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 18 May 2009 06:08:45 -0400
Subject: [Biopython-dev] [Bug 2829] BioSQL does not record a generic
	nucleotide alphabet
In-Reply-To: <bug-2829-42@http.bugzilla.open-bio.org/>
Message-ID: <200905181008.n4IA8j0J015956@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2829


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |WONTFIX


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-18 06:08 EST -------
(In reply to comment #3)
> Hi
> 
> thank you very much for explaining.
> 
> I'm not sure this is a bug, it's a design feature due to my
> not understanding the implications of generic_nucleotide.  

As I argued on the BioSQL mailing list, generic nucleotide
sequences are a valid case not catered to at the moment.
However, they are a corner case, and have no equivalent in
BioPerl (which is happy to guess at DNA or RNA).

Marking this bug as WON'T FIX.

> I know it's DNA, and if one uses generic_dna instead in
> the testcase, all is well.

Good - if you know you have DNA, then specifying a DNA
alphabet would be my recommended course of action.

> Alphabets are explained clearly in the documentation.
> Thank you again.

Let us know if you find anything that needs further
clarification in the documentation.

Thanks,

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Mon May 18 09:38:03 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 18 May 2009 14:38:03 +0100
Subject: [Biopython-dev] SwissProt DE lines and bioentry.description
	field in BioSQL
In-Reply-To: <A8AB4BCB-9CD3-428D-AF10-899AD8055EC7@gmx.net>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
	<320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>
	<0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net>
	<320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com>
	<A8AB4BCB-9CD3-428D-AF10-899AD8055EC7@gmx.net>
Message-ID: <320fb6e00905180638q29de63c4if0627eff416c4481@mail.gmail.com>

On Sun, May 17, 2009 at 4:21 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>
> On May 17, 2009, at 8:40 AM, Peter wrote:
>>
>> [...] Here you have mapped RecName and AltName fields in the DE lines to
>> Name and Synonyms (shouldn't that be Synonym singular?).
>
> The example is for the GN lines in SwissProt, not the DE lines.

Ah, that probably explains some of my confusion.

>> In this example, searching the database using one of the SwissProt
>> AltNames (synonyms), or filtering on the Flags sounds like a
>> reasonable request - but this would be very difficult if the data is
>> stored inside XML strings.
>
> Actually no. Modern full-text indexers (inside or outside the database) can
> index XML text columns right away and very well. In fact, for the last
> project that I built a full-text search for (on top of a BioSQL database) I
> did that by writing custom XML documents to a separate table for each
> record I wanted indexed. Oracle's full text indexer did the rest. I also built a
> separate identifier/name/accession index that pulled all the gene names,
> symbols, accession numbers, identifiers etc into a single table for
> indexing.

OK, when I said searching "would be very difficult if the data is
stored inside XML strings", maybe it wasn't so difficult for you - but
that still sounds complicated!

Sticking with the GN lines and the synonym, if this was stored as a
simple tag/value as usual in BioSQL, I would write my SQL statement to
search the annotation table where the term id was that associated with
a GN synonym, and the annotation value was "HABP1".  Simple.

Using the XML approach, are you suggesting you could do a full text
search on the annotation value field, looking for any rows where the
field contains "<Synonyms>HABP1</Synonyms>", where the term id matches
the GN lines' XML string? This sounds simplistic and probably rather
slow - presumably why you resorted to the more complicated indexing
scheme described above?

> What I mean is, a fully normalized relational representation, especially if
> nested, is often not the most efficient data structure for efficient
> searching and filtering.

OK.  But do we really need to worry about complex nested structures
for the SwissProt annotation (or in general)?

Peter

From biopython at maubp.freeserve.co.uk  Tue May 19 10:23:58 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 19 May 2009 15:23:58 +0100
Subject: [Biopython-dev] [Biopython] Parsing large blast files
In-Reply-To: <320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com>
References: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com>
	<290052.25369.qm@web62407.mail.re1.yahoo.com>
	<320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com>
Message-ID: <320fb6e00905190723u2eca08e6o3f70bf37be79e4bf@mail.gmail.com>

Last month on this thread we started talking about the BLAST
command line wrappers:
http://lists.open-bio.org/pipermail/biopython/2009-April/005134.html

On Wed, Apr 29, 2009, Peter wrote:
> On Wed, Apr 29, 2009, Michiel de Hoon wrote:
>>
>> How would users typically use Bio.Blast.Applications?
>
> In the next release, I would aim to have Bio.Blast.Applications
> updated to cover blastall (fully), plus blastpgp and rpsblast
> (currently not covered) and for the three helper functions
> Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast to all use
> Bio.Blast.Applications internally.

That should be done now in CVS - it turned out to be a lot more
tedious that I had expected, but I think we are OK.

I would be very grateful to have a couple of people test this out.
At the very least, just update your copy of Biopython and confirm
any existing scripts using the Bio.Blast.NCBIStandalone
blastall, blastpgp or rpsblast functions still work as expected.

Note we still need to agree on the preferred name for each
parameter (i.e. what do we use for the python properties) as
discussed on this thread:
http://lists.open-bio.org/pipermail/biopython-dev/2009-May/005976.html
http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006039.html

Peter

From biopython at maubp.freeserve.co.uk  Tue May 19 13:00:41 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 19 May 2009 18:00:41 +0100
Subject: [Biopython-dev] Repeated options in command line interfaces
Message-ID: <320fb6e00905191000g473b9e68r12b8704652b1ad93@mail.gmail.com>

Hello all,

Yes - its another thread about command line wrappers!

One of the Roche 454 off instrument applications is runMapping,
which in the most general situation allows you to map one or
more SFF files onto one or more FASTA files, e.g.

runMapping -o ~/test -ref example1.fasta example2.fasta -read
data1.sff data2.sff

Notice that "-ref" and "-read" are not repeated, so we could
treat this via the current application wrapper system as follows:

#These modules don't exist (yet):
from Bio.Sequencing.Applications import RunMappingCommandline
cline = RunMappingCommandline()
cline.ref = "example1.fasta example2.fasta"
cline.read = "data1.sff data2.sff"

This isn't very elegant, but would work.  Over on Bug 2815,
Cymon and I have briefly discussed the --seed parameter in
Mafft, which is used to specify one or more alignment files, e.g.

mafft ... --seed alignment1 --seed alignment2 --seed alignment3 ...

Notice that "--seed" is repeated before each value.

I was thinking it would be nice to treat this as a single
property (seed) which takes a list of strings as its value:

from Bio.Align.Applications import MafftCommandline
cline = MafftCommandline()
cline.seed = ["alignment1", "alignment2", ...]

or, equivalently:

from Bio.Align.Applications import MafftCommandline
cline = MafftCommandline(seed=["alignment1", "alignment2", ...])

or, using the old set_parameter approach,

from Bio.Align.Applications import MafftCommandline
cline = MafftCommandline()
cline.set_parameter("seed", ["alignment1", "alignment2", ...])

and similarly for a Roche wrapper, e.g.

#These modules don't exist (yet):
from Bio.Sequencing.Applications import RunMappingCommandline
cline = RunMappingCommandline()
cline.ref = ["example1.fasta", "example2.fasta"]
cline.read = ["data1.sff", "data2.sff"]

Doing this nicely would require two _Option subclasses in
Bio.Application, one for repeated options like "seed" in
Mafft, and one for multiple valued options like "ref" and
"read" in the Roche tools.

Does this sound sensible?

Does anyone have any more examples?

Peter

From bugzilla-daemon at portal.open-bio.org  Wed May 20 12:31:24 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 20 May 2009 12:31:24 -0400
Subject: [Biopython-dev] [Bug 2833] New: Features insertion on previous
	bioentry_id
Message-ID: <bug-2833-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833

           Summary: Features insertion on previous bioentry_id
           Product: Biopython
           Version: 1.50
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P1
         Component: BioSQL
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: andrea at biodec.com


Biopython 1.50 (also 1.50b it's the same code)
python2.4 or python2.5
postgresql 8.3
BioSQL Schema 1.0.1

Problem: 
 imagine to have 3 seqrecord (s1,s2,s3), imagine that 
  - s1 == s3 (but from different sources....) in other words
    s1 and s3 are not the same object
  - s2 != s1 and s2 != s3

 imagine to load a Biosql db in this order:
 - db.load([s1])
 - db.load([s2])
 - db.load([s3])

 At the end of the loading i will have only 2 bioentry ID 
 BUT the s3.features will be inserted on s2 seqrecord.

---------------------------------------------------------------------------------------
More in details (documented behaviour):

print s1
ID: ENST00000334859
Name: ENST00000334859
Description: Leucine-rich repeat and calponin homology domain-containing
protein 3 precursor. [Source:Uniprot/SWISSPROT;Acc:Q96II8]
Number of features: 24
/source=
/taxonomy=[]
/keywords=['']
/accessions=['ENST00000334859']
/data_file_division=UNK
/date=01-JAN-1980
/organism=. .
/gi=ENST00000334859
Seq('ATGGCGGCCGCGGGCTTGGTCGCTGTGGCAGCGGCTGCCGAGTACTCTGGCACG...TGA',
IUPACAmbiguousDNA())

print s2
ID: ENST00000391466
Name: ENST00000391466
Description: CDNA FLJ44976 fis, clone BRAWH3001833.
[Source:Uniprot/SPTREMBL;Acc:Q6ZQT1]
Number of features: 8
/source=
/taxonomy=[]
/keywords=['']
/accessions=['ENST00000391466']
/data_file_division=UNK
/date=01-JAN-1980
/organism=. .
/gi=ENST00000391466
Seq('ATGACAGTGATTCTCTTTACCCAACTCACCGCACCCATGGCAGTGATTCTCTTT...TAG',
IUPACAmbiguousDNA())

print s3
ID: ENST00000334859
Name: ENST00000334859
Description: Leucine-rich repeat and calponin homology domain-containing
protein 3 precursor. [Source:Uniprot/SWISSPROT;Acc:Q96II8]
Number of features: 24
/source=
/taxonomy=[]
/keywords=['']
/accessions=['ENST00000334859']
/data_file_division=UNK
/date=01-JAN-1980
/organism=. .
/gi=ENST00000334859
Seq('ATGGCGGCCGCGGGCTTGGTCGCTGTGGCAGCGGCTGCCGAGTACTCTGGCACG...TGA',
IUPACAmbiguousDNA())

As you can see: 
 - s1 and S3 are identical and s2 differs from them.
 - s1 and s3 has 24 features
 - s2 has 8 features

STEP 1 (biosql insertion of s1)
  - db.load([s1])
  - looking into the db:
 select bioentry_id, name, accession, identifier  from bioentry;
 bioentry_id |      name       |    accession    |   identifier    |
-------------+-----------------+-----------------+-----------------+
          39 | ENST00000334859 | ENST00000334859 | ENST00000334859 |
(1 row)

  select * from seqfeature;
select * from seqfeature;
 seqfeature_id | bioentry_id | type_term_id | source_term_id | display_name |
rank
---------------+-------------+--------------+----------------+--------------+------
           291 |          39 |           27 |             15 |              |  
 1
           292 |          39 |           27 |             15 |              |  
 2
           293 |          39 |           27 |             15 |              |  
 3
           294 |          39 |           27 |             15 |              |  
 4
           295 |          39 |           27 |             15 |              |  
 5
           296 |          39 |           14 |             15 |              |  
 6
           297 |          39 |           14 |             15 |              |  
 7
           298 |          39 |           30 |             15 |              |  
 8
           299 |          39 |           30 |             15 |              |  
 9
           300 |          39 |           30 |             15 |              |  
10
           301 |          39 |           30 |             15 |              |  
11
           302 |          39 |           30 |             15 |              |  
12
           303 |          39 |           30 |             15 |              |  
13
           304 |          39 |           30 |             15 |              |  
14
           305 |          39 |           30 |             15 |              |  
15
           306 |          39 |           30 |             15 |              |  
16
           307 |          39 |           30 |             15 |              |  
17
           308 |          39 |           25 |             15 |              |  
18
           309 |          39 |           25 |             15 |              |  
19
           310 |          39 |           25 |             15 |              |  
20
           311 |          39 |           25 |             15 |              |  
21
           312 |          39 |           25 |             15 |              |  
22
           313 |          39 |           26 |             15 |              |  
23
           314 |          39 |           26 |             15 |              |  
24
(24 rows)


STEP 2 (biosql insertion of s2)
  - db.load([s2])
  - looking into the db:
 select bioentry_id, name, accession, identifier  from bioentry;
 bioentry_id |      name       |    accession    |   identifier
-------------+-----------------+-----------------+-----------------
          39 | ENST00000334859 | ENST00000334859 | ENST00000334859
          40 | ENST00000391466 | ENST00000391466 | ENST00000391466
(2 rows)

  select * from seqfeature;
 seqfeature_id | bioentry_id | type_term_id | source_term_id | display_name |
rank
---------------+-------------+--------------+----------------+--------------+------
           291 |          39 |           27 |             15 |              |  
 1
           292 |          39 |           27 |             15 |              |  
 2
           293 |          39 |           27 |             15 |              |  
 3
           294 |          39 |           27 |             15 |              |  
 4
           295 |          39 |           27 |             15 |              |  
 5
           296 |          39 |           14 |             15 |              |  
 6
           297 |          39 |           14 |             15 |              |  
 7
           298 |          39 |           30 |             15 |              |  
 8
           299 |          39 |           30 |             15 |              |  
 9
           300 |          39 |           30 |             15 |              |  
10
           301 |          39 |           30 |             15 |              |  
11
           302 |          39 |           30 |             15 |              |  
12
           303 |          39 |           30 |             15 |              |  
13
           304 |          39 |           30 |             15 |              |  
14
           305 |          39 |           30 |             15 |              |  
15
           306 |          39 |           30 |             15 |              |  
16
           307 |          39 |           30 |             15 |              |  
17
           308 |          39 |           25 |             15 |              |  
18
           309 |          39 |           25 |             15 |              |  
19
           310 |          39 |           25 |             15 |              |  
20
           311 |          39 |           25 |             15 |              |  
21
           312 |          39 |           25 |             15 |              |  
22
           313 |          39 |           26 |             15 |              |  
23
           314 |          39 |           26 |             15 |              |  
24
           315 |          40 |           28 |             15 |              |  
 1
           316 |          40 |           28 |             15 |              |  
 2
           317 |          40 |           28 |             15 |              |  
 3
           318 |          40 |           28 |             15 |              |  
 4
           319 |          40 |           28 |             15 |              |  
 5
           320 |          40 |           28 |             15 |              |  
 6
           321 |          40 |           28 |             15 |              |  
 7
           322 |          40 |           28 |             15 |              |  
 8
(32 rows)

STEP 3 (biosql insertion of s3)
  - db.load([s3])
  - looking into the db:
select bioentry_id, name, accession, identifier  from bioentry;
 bioentry_id |      name       |    accession    |   identifier
-------------+-----------------+-----------------+-----------------
          39 | ENST00000334859 | ENST00000334859 | ENST00000334859
          40 | ENST00000391466 | ENST00000391466 | ENST00000391466
(2 rows)

select * from seqfeature;
 seqfeature_id | bioentry_id | type_term_id | source_term_id | display_name |
rank
---------------+-------------+--------------+----------------+--------------+------
           291 |          39 |           27 |             15 |              |  
 1
           292 |          39 |           27 |             15 |              |  
 2
           293 |          39 |           27 |             15 |              |  
 3
           294 |          39 |           27 |             15 |              |  
 4
           295 |          39 |           27 |             15 |              |  
 5
           296 |          39 |           14 |             15 |              |  
 6
           297 |          39 |           14 |             15 |              |  
 7
           298 |          39 |           30 |             15 |              |  
 8
           299 |          39 |           30 |             15 |              |  
 9
           300 |          39 |           30 |             15 |              |  
10
           301 |          39 |           30 |             15 |              |  
11
           302 |          39 |           30 |             15 |              |  
12
           303 |          39 |           30 |             15 |              |  
13
           304 |          39 |           30 |             15 |              |  
14
           305 |          39 |           30 |             15 |              |  
15
           306 |          39 |           30 |             15 |              |  
16
           307 |          39 |           30 |             15 |              |  
17
           308 |          39 |           25 |             15 |              |  
18
           309 |          39 |           25 |             15 |              |  
19
           310 |          39 |           25 |             15 |              |  
20
           311 |          39 |           25 |             15 |              |  
21
           312 |          39 |           25 |             15 |              |  
22
           313 |          39 |           26 |             15 |              |  
23
           314 |          39 |           26 |             15 |              |  
24
           315 |          40 |           28 |             15 |              |  
 1
           316 |          40 |           28 |             15 |              |  
 2
           317 |          40 |           28 |             15 |              |  
 3
           318 |          40 |           28 |             15 |              |  
 4
           319 |          40 |           28 |             15 |              |  
 5
           320 |          40 |           28 |             15 |              |  
 6
           321 |          40 |           28 |             15 |              |  
 7
           322 |          40 |           28 |             15 |              |  
 8
           323 |          40 |           27 |             15 |              |  
 1
           324 |          40 |           27 |             15 |              |  
 2
           325 |          40 |           27 |             15 |              |  
 3
           326 |          40 |           27 |             15 |              |  
 4
           327 |          40 |           27 |             15 |              |  
 5
           328 |          40 |           14 |             15 |              |  
 6
           329 |          40 |           14 |             15 |              |  
 7
           330 |          40 |           30 |             15 |              |  
 8
           331 |          40 |           30 |             15 |              |  
 9
           332 |          40 |           30 |             15 |              |  
10
           333 |          40 |           30 |             15 |              |  
11
           334 |          40 |           30 |             15 |              |  
12
           335 |          40 |           30 |             15 |              |  
13
           336 |          40 |           30 |             15 |              |  
14
           337 |          40 |           30 |             15 |              |  
15
           338 |          40 |           30 |             15 |              |  
16
           339 |          40 |           30 |             15 |              |  
17
           340 |          40 |           25 |             15 |              |  
18
           341 |          40 |           25 |             15 |              |  
19
           342 |          40 |           25 |             15 |              |  
20
           343 |          40 |           25 |             15 |              |  
21
           344 |          40 |           25 |             15 |              |  
22
           345 |          40 |           26 |             15 |              |  
23
           346 |          40 |           26 |             15 |              |  
24
(56 rows)

As you can easily see the 24 feature of s3 seqrecord has been added to the
bioentry_id 40 (that was s2).
------------------------------------------------------------------------------------

The problem is not so easy to understand. I tried to have a look into the code
of
Loader.py and i found something:
  the code works in this way:
  1) it tries to load the seqrecord using:
          load_seqrecord(self, record)
          this method as first thing tries to load the bioentry table with
          the method:
                _load_bioentry_table(self, record)
                this method at last thing tries to get the bioentry_id
                of the "just inserted" record with the db method:
                self.adaptor.last_id('bioentry')

  2) then with the  bioentry_id recovered from the first method
     it tries to fill the other tables...and also the seqfeature...

  3) In biosql (the schema), if you try to insert a record into
     the bioentry table that has the same Identifier or Accession
     of an existing record it doesn't do anything....
     and it tells you "INSERT 0 0"

  4) So, if you try to insert the s3 record that has the same
     Accession and Identifier of the s1... the bioentry_id 
     the load_seqrecord(self, record) method will return
     the bioentry_id of the s2 record (it will be the 
     self.adaptor.last_id('bioentry') output)

Maybe other information will be transferred to s2 (not only
the features...). For example also "dbxrefs" could suffer
of the same problem....

I think the solution depend on what we expect from the code:
  - if we expect a behaviour like "don't do anything with identical
Accession/Identifier"
    it is better to check the last_id before and after insertion and return
None
    if it is identical... 
    than manage a "None" bioentry_id like a block in the other 
    biosql insertions....

  - if we expect a "Merge" behaviour it is better to
    retrive the bioentry_id of the object with the same Accession/Identifier
    and than verify if the 2 seqrecord has identical sequence and
    than merge features/annotations/dbxrefs.... etc.

  - other behaviours... other solutions...

Andrea


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Wed May 20 16:25:39 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 20 May 2009 16:25:39 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905202025.n4KKPdYT020904@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-20 16:25 EST -------
(In reply to comment #0)
> Biopython 1.50 (also 1.50b it's the same code)
> python2.4 or python2.5
> postgresql 8.3
> BioSQL Schema 1.0.1
> 
> Problem: 
>  imagine to have 3 seqrecord (s1,s2,s3), ... load a Biosql db in this order:
>  - db.load([s1])
>  - db.load([s2])
>  - db.load([s3])
> 
>  At the end of the loading i will have only 2 bioentry ID 
>  BUT the s3.features will be inserted on s2 seqrecord.

BioSQL will allow you to have multiple versions of the same record but they
must have different versions (e.g. s1.id="ENST00000334859.0" and
s3.id="ENST00000334859.1" should work). The problem with your data is s1.id ==
s3.id, so I would expect them to get the same accession and version (taken as
zero).  Therefore s3 should *fail* to load.

I can try and reproduce this using the information given, but it would help if
you could attach the original sequence files to this bug.

Thanks,

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Wed May 20 17:07:08 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 20 May 2009 17:07:08 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905202107.n4KL78te024053@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-20 17:07 EST -------
(In reply to comment #0)
> Biopython 1.50 (also 1.50b it's the same code)
> python2.4 or python2.5
> postgresql 8.3
> BioSQL Schema 1.0.1

What version of psycopg are you using? i.e. The python library for talking to
PostgreSQL.

Have you tried running Biopython's BioSQL unit tests?  You'll need to configure
your settings in setup_BioSQL.py first.

If that looks good could you try updating to the latest Biopython from CVS and
retesting? I've added a basic check in test_BioSQL.py for duplicated entries
(using a GenBank file) which works on my machine using MySQL.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu May 21 06:31:42 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 May 2009 06:31:42 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905211031.n4LAVgvW019852@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #3 from andrea at biodec.com  2009-05-21 06:31 EST -------
Created an attachment (id=1299)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1299&action=view)
Pickled Seqrecord s1


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu May 21 06:32:12 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 May 2009 06:32:12 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905211032.n4LAWBXC019888@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #4 from andrea at biodec.com  2009-05-21 06:32 EST -------
Created an attachment (id=1300)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1300&action=view)
Pickled Seqrecord s2


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu May 21 06:32:28 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 May 2009 06:32:28 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905211032.n4LAWSlA019903@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #5 from andrea at biodec.com  2009-05-21 06:32 EST -------
Created an attachment (id=1301)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1301&action=view)
Pickled Seqrecord s3


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu May 21 06:34:46 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 May 2009 06:34:46 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905211034.n4LAYkhC020056@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #6 from andrea at biodec.com  2009-05-21 06:34 EST -------
Hi Peter,
i did 4 tests: [python2.4,python2.5]*[psycopg,psycopg2]
with 
 - biopython from "this morning" cvs.
 - psycopg.__version__  '1.1.21'
 - psycopg2.__version__ '2.0.7 (dec mx dt ext pq3)'

in any case i've the same results:

Make sure all records are correctly loaded. ... ok
Make sure can't import records twice. ... FAIL
Indepth check that SeqFeatures are transmitted through the db. ... ok
Load SeqRecord objects into a BioSQL database. ... ok
Get a list of all items in the database. ... ok
Test retrieval of items using various ids. ... ok
Check can add DBSeq objects together. ... ok
Check can turn a DBSeq object into a Seq or MutableSeq. ... ok
Make sure Seqs from BioSQL implement the right interface. ... ok
Check SeqFeatures of a sequence. ... ok
Make sure SeqRecords from BioSQL implement the right interface. ... ok
Check that slices of sequences are retrieved properly. ... ok

======================================================================
FAIL: Make sure can't import records twice.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 374, in test_reload
    self.assert_("duplicate" in str(err).lower())
AssertionError

----------------------------------------------------------------------
Ran 12 tests in 23.815s

FAILED (failures=1)

i've 1 failure in "Make sure can't import records twice. ..." it seems
interesting
for the problem...


Then i tried with python2.4, python2.5, psycopg, psycopg2

i attached the pickles of the 3 seqrecords so you can try by yourself...

###########################################################
from BioSQL import BioSeqDatabase
import cPickle

server = BioSeqDatabase.open_database(driver = "psycopg2", user = 'postgres',
passwd = "hidden", host = "dbservertest", db = 'test_biosql' )

## LOAD SeqRecords from pickle
s1=cPickle.load(open('s1.cpk'))
s2=cPickle.load(open('s2.cpk'))
s3=cPickle.load(open('s3.cpk'))

## LOAD INTO DB 
db=server.new_database('test')
server.commit()
db.load([s1])
db.load([s2])
db.load([s3])
db.adaptor.commit()
###########################################################


I had always the same problem.

So i prepare a buildout environment with the last Biopython
and with a new psycopg2 library (for psycopg i had the latest).

psycopg2.__version__ '2.0.11 (dt dec ext pq3)'

The result from the test was the same
The result from the upload (based on pickled seqrecords) was the same

Thanks
Andrea


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu May 21 06:39:18 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 May 2009 06:39:18 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905211039.n4LAdIit020365@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-21 06:39 EST -------
(In reply to comment #6)
> Hi Peter,
> i did 4 tests: [python2.4,python2.5]*[psycopg,psycopg2]
> with 
>  - biopython from "this morning" cvs.
>  - psycopg.__version__  '1.1.21'
>  - psycopg2.__version__ '2.0.7 (dec mx dt ext pq3)'
> 
> in any case i've the same results:
> 
> Make sure all records are correctly loaded. ... ok
> Make sure can't import records twice. ... FAIL
> ...
> ======================================================================
> FAIL: Make sure can't import records twice.
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "test_BioSQL.py", line 374, in test_reload
>     self.assert_("duplicate" in str(err).lower())
> AssertionError

OK - the unit test is doing what I expected, and the duplicate insertion
is failing. Its just the error message is different to what I expected,
which should be trivial to fix. This means inserting the same GenBank record
twice fails (which is good).

However, the unit test doesn't reproduce your original issue. Hopefully your
pickled SeqRecord objects will help there...

Thanks,

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu May 21 07:36:34 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 May 2009 07:36:34 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905211136.n4LBaYO8024199@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-21 07:36 EST -------
(In reply to comment #7)
> However, the unit test doesn't reproduce your original issue. Hopefully
> your pickled SeqRecord objects will help there...

Based on your example script in comment 6 with the pickled SeqRecord objects,
but using MySQL, I get an IntegrityError as expected:

Traceback (most recent call last):
...
IntegrityError: (1062, "Duplicate entry 'ENST00000334859-2-0' for key 2")

I get the same error with simplified records lacking any annotation or features
(I just saved your three records to a FASTA file and reloaded them). So what
ever is going wrong seems to be PostgreSQL specific (or at least, does not
affect MySQL).

I have updated test_BioSQL.py in CVS to cover more variations (revision 1.33),
and hopefully the error message check should work on PostgreSQL as well. It
would be very helpful if you could test that.

Part of the new tests is a slight variation on your original example.  Could
you try this:

db.load([s1])
server.commit()
db.load([s2])
server.commit()
db.load([s3])
server.commit()

This might tell us if the issue is with PostgreSQL not checking the key
constraints until the commit.

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From chapmanb at 50mail.com  Thu May 21 08:29:27 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 21 May 2009 08:29:27 -0400
Subject: [Biopython-dev] Repeated options in command line interfaces
In-Reply-To: <320fb6e00905191000g473b9e68r12b8704652b1ad93@mail.gmail.com>
References: <320fb6e00905191000g473b9e68r12b8704652b1ad93@mail.gmail.com>
Message-ID: <20090521122927.GM84112@sobchak.mgh.harvard.edu>

Hi Peter;

> Yes - its another thread about command line wrappers!

It seems like y'all are unearthing every single crazy command line
option choice out there. Great to have this fleshed out.

> One of the Roche 454 off instrument applications is runMapping,
> which in the most general situation allows you to map one or
> more SFF files onto one or more FASTA files, e.g.
> 
> runMapping -o ~/test -ref example1.fasta example2.fasta -read
> data1.sff data2.sff
[...]
> the --seed parameter in Mafft, which is used to specify one or more
> alignment files, e.g.
> 
> mafft ... --seed alignment1 --seed alignment2 --seed alignment3 ...
> 
> Notice that "--seed" is repeated before each value.
> 
> I was thinking it would be nice to treat this as a single
> property (seed) which takes a list of strings as its value:
> 
> from Bio.Align.Applications import MafftCommandline
> cline = MafftCommandline()
> cline.seed = ["alignment1", "alignment2", ...]
[...]
> #These modules don't exist (yet):
> from Bio.Sequencing.Applications import RunMappingCommandline
> cline = RunMappingCommandline()
> cline.ref = ["example1.fasta", "example2.fasta"]
> cline.read = ["data1.sff", "data2.sff"]

This makes good sense to me. It hides the actual nastiness a bit and
makes it clear in the code what is happening -- assigning multiple
parameters to a single option. It sounds like a great way to handle
it.

Brad

From bugzilla-daemon at portal.open-bio.org  Thu May 21 11:04:40 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 May 2009 11:04:40 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905211504.n4LF4ej0015238@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #9 from andrea at biodec.com  2009-05-21 11:04 EST -------
(In reply to comment #8)
> (In reply to comment #7)
> > However, the unit test doesn't reproduce your original issue. Hopefully
> > your pickled SeqRecord objects will help there...
> 
> Based on your example script in comment 6 with the pickled SeqRecord objects,
> but using MySQL, I get an IntegrityError as expected:
> 
> Traceback (most recent call last):
> ...
> IntegrityError: (1062, "Duplicate entry 'ENST00000334859-2-0' for key 2")
> 
> I get the same error with simplified records lacking any annotation or features
> (I just saved your three records to a FASTA file and reloaded them). So what
> ever is going wrong seems to be PostgreSQL specific (or at least, does not
> affect MySQL).

According to me it's postgres specific the fact that i don't have any 
error at all. If biopython expects from postgres an error in this 
situation there are some problem in postgres (or in mine).

> 
> I have updated test_BioSQL.py in CVS to cover more variations (revision 1.33),
> and hopefully the error message check should work on PostgreSQL as well. It
> would be very helpful if you could test that.

This is te results of the test: it's the same on python2.4 and python2.5:
Make sure can't import records with same ID (in one go). ... FAIL
Make sure can't import records with same ID (in steps). ... FAIL
Make sure can't import records with same ID (in steps with commit). ... FAIL
Make sure can't import a single record twice (in one go). ... FAIL
Make sure can't import a single record twice (in steps). ... FAIL
Make sure can't import a single record twice (in steps with commit). ... FAIL
Make sure all records are correctly loaded. ... ok
Make sure can't reimport existing records. ... FAIL
Indepth check that SeqFeatures are transmitted through the db. ... ok
Load SeqRecord objects into a BioSQL database. ... ok
Get a list of all items in the database. ... ok
Test retrieval of items using various ids. ... ok
Check can add DBSeq objects together. ... ok
Check can turn a DBSeq object into a Seq or MutableSeq. ... ok
Make sure Seqs from BioSQL implement the right interface. ... ok
Check SeqFeatures of a sequence. ... ok
Make sure SeqRecords from BioSQL implement the right interface. ... ok
Check that slices of sequences are retrieved properly. ... ok

======================================================================
FAIL: Make sure can't import records with same ID (in one go).
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 397, in test_duplicate_id_load
    err.__class__.__name__ + "\n" + str(err))
AssertionError: Exception
Should have failed!

======================================================================
FAIL: Make sure can't import records with same ID (in steps).
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 410, in test_duplicate_id_load2
    err.__class__.__name__ + "\n" + str(err))
AssertionError: Exception
Should have failed!

======================================================================
FAIL: Make sure can't import records with same ID (in steps with commit).
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 424, in test_duplicate_id_load3
    err.__class__.__name__ + "\n" + str(err))
AssertionError: Exception
Should have failed!

======================================================================
FAIL: Make sure can't import a single record twice (in one go).
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 361, in test_duplicate_load
    err.__class__.__name__ + "\n" + str(err))
AssertionError: Exception
Should have failed!

======================================================================
FAIL: Make sure can't import a single record twice (in steps).
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 373, in test_duplicate_load2
    err.__class__.__name__ + "\n" + str(err))
AssertionError: Exception
Should have failed!

======================================================================
FAIL: Make sure can't import a single record twice (in steps with commit).
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 386, in test_duplicate_load3
    err.__class__.__name__ + "\n" + str(err))
AssertionError: Exception
Should have failed!

======================================================================
FAIL: Make sure can't reimport existing records.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 463, in test_reload
    err.__class__.__name__ + "\n" + str(err))
AssertionError: OperationalError
currval of sequence "bioentry_pk_seq" is not yet defined in this session


----------------------------------------------------------------------
Ran 18 tests in 26.938s

FAILED (failures=7)


> 
> Part of the new tests is a slight variation on your original example.  Could
> you try this:
> 
> db.load([s1])
> server.commit()
> db.load([s2])
> server.commit()
> db.load([s3])
> server.commit()
> 
>>> ## LOAD INTO DB
>>> db.load([s1])
1
>>> server.commit()
>>> db.load([s2])
1
>>> server.commit()
>>> db.load([s3])
1
>>> server.commit()
>>>
i don't have any errors!!!

> This might tell us if the issue is with PostgreSQL not checking the key
> constraints until the commit.
> 
it seems that. If i try to do the insertion via SQL i don't have any
errors. I just have a message of the type:
INSERT 0 0
due to the fact the postgres doesn't insert anything.

Andrea


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu May 21 13:05:12 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 May 2009 13:05:12 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905211705.n4LH5Ca6028981@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #10 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-21 13:05 EST -------
Well, some progress :)

(In reply to comment #9)
> This is te results of the test: it's the same on python2.4 and python2.5:
> Make sure can't import records with same ID (in one go). ... FAIL
> Make sure can't import records with same ID (in steps). ... FAIL
> Make sure can't import records with same ID (in steps with commit). ... FAIL
> Make sure can't import a single record twice (in one go). ... FAIL
> Make sure can't import a single record twice (in steps). ... FAIL
> Make sure can't import a single record twice (in steps with commit). ... FAIL
> Make sure all records are correctly loaded. ... ok
> Make sure can't reimport existing records. ... FAIL
> Indepth check that SeqFeatures are transmitted through the db. ... ok
> Load SeqRecord objects into a BioSQL database. ... ok
> Get a list of all items in the database. ... ok
> Test retrieval of items using various ids. ... ok
> Check can add DBSeq objects together. ... ok
> Check can turn a DBSeq object into a Seq or MutableSeq. ... ok
> Make sure Seqs from BioSQL implement the right interface. ... ok
> Check SeqFeatures of a sequence. ... ok
> Make sure SeqRecords from BioSQL implement the right interface. ... ok
> Check that slices of sequences are retrieved properly. ... ok
> 
> ======================================================================
> FAIL: Make sure can't import records with same ID (in one go).
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "test_BioSQL.py", line 397, in test_duplicate_id_load
>     err.__class__.__name__ + "\n" + str(err))
> AssertionError: Exception
> Should have failed!
> ...

Also the error formatting wasn't quite what I had intended, fixed in CVS.
However, most of the tests are allowing duplicates to be recorded without any
error (on PostgreSQL).  This is bad.

> ======================================================================
> FAIL: Make sure can't reimport existing records.
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "test_BioSQL.py", line 463, in test_reload
>     err.__class__.__name__ + "\n" + str(err))
> AssertionError: OperationalError
> currval of sequence "bioentry_pk_seq" is not yet defined in this session

Interestingly the final test gives us an OperationalError about the bioentry
table's primary key (presumably from our last_id method which would call the
SQL statement "select currval('bioentry_pk_seq')"). This suggests some clues
about what is going wrong.

http://www.postgresql.org/docs/8.3/static/functions-sequence.html
http://www.postgresql.org/docs/8.3/static/sql-createsequence.html

See also:
http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/sql/biosqldb-pg.sql

CREATE SEQUENCE bioentry_pk_seq;
CREATE TABLE bioentry ( 
         bioentry_id INTEGER DEFAULT nextval ( 'bioentry_pk_seq' ) NOT NULL , 
         biodatabase_id INTEGER NOT NULL , 
         taxon_id INTEGER , 
         name VARCHAR ( 40 ) NOT NULL , 
         accession VARCHAR ( 128 ) NOT NULL , 
         identifier VARCHAR ( 40 ) , 
         division VARCHAR ( 6 ) , 
         description TEXT , 
         version INTEGER NOT NULL , 
         PRIMARY KEY ( bioentry_id ) , 
         UNIQUE ( accession , biodatabase_id , version ) , 
-- CONFIG: uncomment one (and only one) of the two lines below. The
-- first puts a uniqueness constraint on the identifier column alone;
-- the other one puts a uniqueness constraint on identifier only
-- within a namespace.
--       UNIQUE ( identifier ) 
         UNIQUE ( identifier , biodatabase_id ) 
) ; 

CREATE INDEX bioentry_name ON bioentry ( name ); 
CREATE INDEX bioentry_db ON bioentry ( biodatabase_id ); 
CREATE INDEX bioentry_tax ON bioentry ( taxon_id );


I'm a little surprised all the other duplicate record tests show different
behaviour. I have updated test_BioSQL.py to perform all these new duplicate
tests on a clean database - which I probably should have done in the first
place (CVS revision 1.35).

[All these tests are passing on MySQL. Trying the example by hand triggers an
IntegrityError.]

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu May 21 18:22:18 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 May 2009 18:22:18 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905212222.n4LMMIls028194@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #11 from andrea at biodec.com  2009-05-21 18:22 EST -------
So the problem is related to the different behaviur adopted by postgres loaded
with the biosql schema, with respect to mysql.

Sorry because i thought the problem was due to BioSQL because i didn't know
wich was the "expected database behaviour". 

Since we expect an error during insertion of a "duplicate" or "quite duplicate"
record... we have only to focus on the postgres biosql schema, and why/where it
differs from the mysql one.

I didn't have time to have a look to the difference between the various
"duplicate record tests". I will do.

[i've tried postgres 8.4... and it's exactly the same]


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From cy at cymon.org  Thu May 21 18:52:39 2009
From: cy at cymon.org (Cymon Cox)
Date: Thu, 21 May 2009 23:52:39 +0100
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <200905212222.n4LMMIls028194@portal.open-bio.org>
References: <bug-2833-42@http.bugzilla.open-bio.org/>
	<200905212222.n4LMMIls028194@portal.open-bio.org>
Message-ID: <7265d4f0905211552pddc3180ua3a3deb6ba8102bc@mail.gmail.com>

2009/5/21 <bugzilla-daemon at portal.open-bio.org>

> http://bugzilla.open-bio.org/show_bug.cgi?id=2833
>
>
>
>
>
> ------- Comment #11 from andrea at biodec.com  2009-05-21 18:22 EST -------
> So the problem is related to the different behaviur adopted by postgres
> loaded
> with the biosql schema, with respect to mysql.
>
> Sorry because i thought the problem was due to BioSQL because i didn't know
> wich was the "expected database behaviour".
>
> Since we expect an error during insertion of a "duplicate" or "quite
> duplicate"
> record... we have only to focus on the postgres biosql schema, and
> why/where it
> differs from the mysql one.
>
> I didn't have time to have a look to the difference between the various
> "duplicate record tests". I will do.
>
> [i've tried postgres 8.4... and it's exactly the same]


Hi Andrea,

The problem appears to be related to the BioSQL schema/PostGreSQL.

As you indicated, adding a duplicate entry to bioentry returns a "INSERT 0
0" and doesnt throw an IntegrityError which is what the code is looking from
and presumably what MySQL throws.

The reason it doesnt throw an error is because of one (or both) of the RULES
in the schema:

rule_bioentry_i1
and/or
rule_bioentry_i2

If you delete these two rules, load the schema and try to do a duplicate
entry:

mytest=# insert into bioentry(bioentry_id, biodatabase_id, name, accession,
version) values (2, 1, 'blah1', 'test4', 1);
INSERT 0 1
mytest=# select * from bioentry;
 bioentry_id | biodatabase_id | taxon_id | name  | accession | identifier |
division | description | version
-------------+----------------+----------+-------+-----------+------------+----------+-------------+---------
           2 |              1 |          | blah1 | test4     |
|          |             |       1
(1 row)

mytest=# insert into bioentry(bioentry_id, biodatabase_id, name, accession,
version) values (2, 1, 'blah1', 'test4', 1);
ERROR:  duplicate key value violates unique constraint "bioentry_pkey"

we have an error rather than a "INSERT 0 0"

I'm going to assume that psycopg2 would pick-up this error and throw an
IntegrityError, but I havent taken it any further to check.

Cheers, C.

From hlapp at gmx.net  Thu May 21 22:05:17 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 21 May 2009 22:05:17 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <7265d4f0905211552pddc3180ua3a3deb6ba8102bc@mail.gmail.com>
References: <bug-2833-42@http.bugzilla.open-bio.org/>
	<200905212222.n4LMMIls028194@portal.open-bio.org>
	<7265d4f0905211552pddc3180ua3a3deb6ba8102bc@mail.gmail.com>
Message-ID: <8C0BF1E3-15DF-4F89-AB57-7AE09B86BCCE@gmx.net>


On May 21, 2009, at 6:52 PM, Cymon Cox wrote:

> [...]
>
> Hi Andrea,
>
> The problem appears to be related to the BioSQL schema/PostGreSQL.
>
> As you indicated, adding a duplicate entry to bioentry returns a  
> "INSERT 0
> 0" and doesnt throw an IntegrityError which is what the code is  
> looking from
> and presumably what MySQL throws.
>
> The reason it doesnt throw an error is because of one (or both) of  
> the RULES
> in the schema:

Indeed, I'd almost forgotten. The rules are there mostly as a remnant  
from earlier versions of PostgreSQL to support transactional loading  
the way bioperl-db (the object-relational mapping for BioPerl) is  
optimized. You probably don't need them anywhere else.

	-hilmar

<gory-details>
Bioperl-db is optimized such that entities that very likely don't  
exist yet in the database are attempted for insert right away. If the  
insert fails due to a unique key violation, the record is looked up  
(and then expected to be found). In Oracle and MySQL you can do this  
and the transaction remains healthy; i.e., you can commit the  
transaction later and all statements except those that failed will be  
committed. In PostgreSQL any failed statement dooms the entire  
transaction, and the only way out is a rollback. In this case, if you  
want the loading of one sequence record as one transaction, failing to  
insert a single feature record will doom the entire sequence load and  
you would need to start over with the sequence. To fix this, I wrote  
the rules, which in essence do do the lookups for PostgreSQL that the  
bioperl-db code would otherwise avoid, and on insert do nothing if the  
record is found, which results in zero rows affected when you would  
expect one (which is what bioperl-db cues off of and then triggers a  
lookup).
The right way to do this meanwhile is to use nested transactions,  
which PostgreSQL supports since v8.0.x, but I haven't gotten around to  
implement support for that in Bioperl-db.
</gory-details>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From bugzilla-daemon at portal.open-bio.org  Thu May 21 23:56:13 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 May 2009 23:56:13 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905220356.n4M3uDfM021127@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #12 from cymon.cox at gmail.com  2009-05-21 23:56 EST -------
After deleting the RULES in the BioSQL schema, all the new unittests pass.

(All the RULES can be deleted as they are all there to circumvent the problem
in Bioperl-db described by Hilmar Lapp on the biopython-dev list:

http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006084.html

See also the comment in the schema.)

C.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri May 22 04:41:39 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 May 2009 04:41:39 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905220841.n4M8fd3w015716@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #13 from andrea at biodec.com  2009-05-22 04:41 EST -------
(In reply to comment #12)
> After deleting the RULES in the BioSQL schema, all the new unittests pass.
> 
> (All the RULES can be deleted as they are all there to circumvent the problem
> in Bioperl-db described by Hilmar Lapp on the biopython-dev list:
> 
> http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006084.html
> 
> See also the comment in the schema.)
> 
> C.

I've deleted the two rules, 
rule_bioentry_i1
rule_bioentry_i2

and then i run the tests:
Make sure can't import records with same ID (in one go). ... ok
Make sure can't import records with same ID (in steps). ... ok
Make sure can't import records with same ID (in steps with commit). ... ok
Make sure can't import a single record twice (in one go). ... ok
Make sure can't import a single record twice (in steps). ... ok
Make sure can't import a single record twice (in steps with commit). ... ok
Make sure all records are correctly loaded. ... ok
Make sure can't reimport existing records. ... ok
Indepth check that SeqFeatures are transmitted through the db. ... ok
Load SeqRecord objects into a BioSQL database. ... ok
Get a list of all items in the database. ... ok
Test retrieval of items using various ids. ... ok
Check can add DBSeq objects together. ... ok
Check can turn a DBSeq object into a Seq or MutableSeq. ... ok
Make sure Seqs from BioSQL implement the right interface. ... ok
Check SeqFeatures of a sequence. ... ok
Make sure SeqRecords from BioSQL implement the right interface. ... ok
Check that slices of sequences are retrieved properly. ... ok

----------------------------------------------------------------------
Ran 18 tests in 58.371s

OK

with pythhon2.4, python2.5, psycopg, psycopg2.
Everything seems to be ok. I don't know which other possible effects could be
triggered by this deletion. But i think it should be inserted as soon as
possbile into the BioSQL Schema/PostGreSQL (updating also the Test BioSQL
schema/PostGreSQL).


After removing the rules i've run my own tests:
.....
>>> ## LOAD INTO DB
>>> db.load([s1])
1
>>> db.load([s2])
1
>>> db.load([s3])
Traceback (most recent call last):
  File "<console>", line 1, in ?
  File "../BioSQL/BioSeqDatabase.py", line 442, in load
  File "../BioSQL/Loader.py", line 50, in load_seqrecord
  File "../BioSQL/Loader.py", line 550, in _load_bioentry_table
  File "../BioSQL/BioSeqDatabase.py", line 301, in execute
IntegrityError: duplicate key value violates unique constraint
"bioentry_accession_key"

And i've got the error, that is what it is expected as a normal behaviour.
So now i've only to trap the exception or pre-check duplications.

Many Thanks
Andrea


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri May 22 08:06:36 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 May 2009 08:06:36 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905221206.n4MC6aWo000368@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #14 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-22 08:06 EST -------
(In reply to comment #13)
> (In reply to comment #12)
> > After deleting the RULES in the BioSQL schema, all the new unittests pass.
> > 
> > (All the RULES can be deleted as they are all there to circumvent the
> > problem in Bioperl-db described by Hilmar Lapp on the biopython-dev list:
> > 
> > http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006084.html
> > 
> > See also the comment in the schema.)
> > 
> > C.

Well spotted Cymon - I'd missed that.

> I've deleted the two rules, 
> rule_bioentry_i1
> rule_bioentry_i2
> 
> ...
> with pythhon2.4, python2.5, psycopg, psycopg2.
> Everything seems to be ok.
> ...
> After removing the rules i've run my own tests:
> .....
> >>> ## LOAD INTO DB
> >>> db.load([s1])
> 1
> >>> db.load([s2])
> 1
> >>> db.load([s3])
> Traceback (most recent call last):
>   File "<console>", line 1, in ?
>   File "../BioSQL/BioSeqDatabase.py", line 442, in load
>   File "../BioSQL/Loader.py", line 50, in load_seqrecord
>   File "../BioSQL/Loader.py", line 550, in _load_bioentry_table
>   File "../BioSQL/BioSeqDatabase.py", line 301, in execute
> IntegrityError: duplicate key value violates unique constraint
> "bioentry_accession_key"
> 
> And i've got the error, that is what it is expected as a normal behaviour.
> So now i've only to trap the exception or pre-check duplications.

Great.

It will be down to BioSQL to change the schema (in conjunction with BioPerl),
but Hilmar seems to be looking into this:
http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006084.html

I suppose in the short term we could change our local copy of the schema used
in the Biopython unit tests...

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Fri May 22 08:27:06 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 22 May 2009 13:27:06 +0100
Subject: [Biopython-dev] RULES in BioSQL PostgreSQL schema
Message-ID: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>

Hi all,

This is a continuation of a thread / bug report from Biopython (Bug 2833)
where attempting to import duplicate entries into BioSQL did not raise an
error on PostgreSQL (but does on MySQL). Cymon traced this to the
RULES present in the schema to help bioperl-db.

On Fri, May 22, 2009 at 3:05 AM, Hilmar Lapp <hlapp at gmx.net> wrote:
>
> On May 21, 2009, at 6:52 PM, Cymon Cox wrote:
>
>> [...]
>>
>> Hi Andrea,
>>
>> The problem appears to be related to the BioSQL schema/PostGreSQL.
>>
>> As you indicated, adding a duplicate entry to bioentry returns a "INSERT 0
>> 0" and doesnt throw an IntegrityError which is what the code is looking
>> from and presumably what MySQL throws.
>>
>> The reason it doesnt throw an error is because of one (or both) of the
>> RULES in the schema:
>
> Indeed, I'd almost forgotten. The rules are there mostly as a remnant from
> earlier versions of PostgreSQL to support transactional loading the way
> bioperl-db (the object-relational mapping for BioPerl) is optimized. You
> probably don't need them anywhere else.
>
> ? ? ? ?-hilmar
>
> <gory-details>
> Bioperl-db is optimized such that entities that very likely don't exist yet
> in the database are attempted for insert right away. If the insert fails due
> to a unique key violation, the record is looked up (and then expected to be
> found). In Oracle and MySQL you can do this and the transaction remains
> healthy; i.e., you can commit the transaction later and all statements
> except those that failed will be committed. In PostgreSQL any failed
> statement dooms the entire transaction, and the only way out is a rollback.
> In this case, if you want the loading of one sequence record as one
> transaction, failing to insert a single feature record will doom the entire
> sequence load and you would need to start over with the sequence. To fix
> this, I wrote the rules, which in essence do do the lookups for PostgreSQL
> that the bioperl-db code would otherwise avoid, and on insert do nothing if
> the record is found, which results in zero rows affected when you would
> expect one (which is what bioperl-db cues off of and then triggers a
> lookup).
> The right way to do this meanwhile is to use nested transactions, which
> PostgreSQL supports since v8.0.x, but I haven't gotten around to implement
> support for that in Bioperl-db.
> </gory-details>

Hilmar,

It seems for Biopython to work properly with BioSQL on PostgreSQL
these bioentry rules should be removed from the schema (as the
comments in the schema do suggest). Obviously doing this would
break any installation also using the current version of bioperl-db.

Do the RULES affect BioJava or BioRuby using BioSQL on
PostgreSQL?

Are you happy to remove these RULES in BioSQL v1.0.x (after
making the outlined transactional changes in bioperl-db)?

Thanks,

Peter


From hlapp at gmx.net  Fri May 22 11:03:11 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Fri, 22 May 2009 11:03:11 -0400
Subject: [Biopython-dev] RULES in BioSQL PostgreSQL schema
In-Reply-To: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>
References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>
Message-ID: <CFCA6516-FFEA-4D8A-9E51-BCBAFC0F67A7@gmx.net>


On May 22, 2009, at 8:27 AM, Peter wrote:

> Are you happy to remove these RULES in BioSQL v1.0.x (after
> making the outlined transactional changes in bioperl-db)?

In principle yes. It would also mean dropping support for PostgreSQL  
v7.x, but I would hope that that's a non-issue.

But if anyone here is still using and relying on PostgreSQL v7.x (or  
earlier?) do let us know, please.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Fri May 22 11:57:38 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 22 May 2009 16:57:38 +0100
Subject: [Biopython-dev] RULES in BioSQL PostgreSQL schema
In-Reply-To: <CFCA6516-FFEA-4D8A-9E51-BCBAFC0F67A7@gmx.net>
References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>
	<CFCA6516-FFEA-4D8A-9E51-BCBAFC0F67A7@gmx.net>
Message-ID: <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com>

On Fri, May 22, 2009 at 4:03 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>
> On May 22, 2009, at 8:27 AM, Peter wrote:
>
>> Are you happy to remove these RULES in BioSQL v1.0.x (after
>> making the outlined transactional changes in bioperl-db)?
>
> In principle yes. It would also mean dropping support for PostgreSQL v7.x,
> but I would hope that that's a non-issue.
>
> But if anyone here is still using and relying on PostgreSQL v7.x (or
> earlier?) do let us know, please.

Great.

In the meantime could you add a big warning about this issue to the
INSTALL notes for PostgreSQL (i.e. recommend removing the RULES
section if not using bioper-db)?
http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/INSTALL

Peter

From biopython at maubp.freeserve.co.uk  Fri May 22 12:06:21 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 22 May 2009 17:06:21 +0100
Subject: [Biopython-dev] Peter at a conference next week
Message-ID: <320fb6e00905220906l2446afbfk9804599db74a4d66@mail.gmail.com>

Hi all,

Just to let you know I will be at a conference next week, so don't
expect (Biopython) email replies as promptly as usual. I may even
leave my laptop at home ;)

Peter

From hlapp at gmx.net  Fri May 22 14:20:58 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Fri, 22 May 2009 14:20:58 -0400
Subject: [Biopython-dev] RULES in BioSQL PostgreSQL schema
In-Reply-To: <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com>
References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>
	<CFCA6516-FFEA-4D8A-9E51-BCBAFC0F67A7@gmx.net>
	<320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com>
Message-ID: <410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net>

Yes, agree. Would you mind filing this in Bugzilla for BioSQL? -hilmar

On May 22, 2009, at 11:57 AM, Peter wrote:

> On Fri, May 22, 2009 at 4:03 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>>
>> On May 22, 2009, at 8:27 AM, Peter wrote:
>>
>>> Are you happy to remove these RULES in BioSQL v1.0.x (after
>>> making the outlined transactional changes in bioperl-db)?
>>
>> In principle yes. It would also mean dropping support for  
>> PostgreSQL v7.x,
>> but I would hope that that's a non-issue.
>>
>> But if anyone here is still using and relying on PostgreSQL v7.x (or
>> earlier?) do let us know, please.
>
> Great.
>
> In the meantime could you add a big warning about this issue to the
> INSTALL notes for PostgreSQL (i.e. recommend removing the RULES
> section if not using bioper-db)?
> http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/INSTALL
>
> Peter

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From bugzilla-daemon at portal.open-bio.org  Fri May 22 14:37:21 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 May 2009 14:37:21 -0400
Subject: [Biopython-dev] [Bug 2837] New: Reading Roche 454 SFF sequence read
	files in Bio.SeqIO
Message-ID: <bug-2837-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2837

           Summary: Reading Roche 454 SFF sequence read files in Bio.SeqIO
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk


Roche 454 sequencing returns the read data in SFF files, a documented binary
format, capturing the sequence letters and qualities together with trimming
information. It would be nice to support reading (and in the longer term also
writing) these files directly with Bio.SeqIO.

See this thread for background:
http://lists.open-bio.org/pipermail/biopython/2009-April/005083.html


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri May 22 14:39:26 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 May 2009 14:39:26 -0400
Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read
	files in Bio.SeqIO
In-Reply-To: <bug-2837-42@http.bugzilla.open-bio.org/>
Message-ID: <200905221839.n4MIdQU5008555@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2837


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-22 14:39 EST -------
Created an attachment (id=1303)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1303&action=view)
Bio/SeqIO/RocheSffIO.py

This is a rough SeqIO parser constructing SeqRecord objects using a parser
contributed by Jose Blanca. Additional work would be required for paired end
reads - and even more work to be able to write out these files.

Potentially Jose's parser could be exposed as a public module under
Bio.Sequencing, but here is it just two private classes.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Fri May 22 14:40:45 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 22 May 2009 19:40:45 +0100
Subject: [Biopython-dev] sff reader
In-Reply-To: <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
	<320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com>
	<200904171246.46568.jblanca@btc.upv.es>
	<320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com>
Message-ID: <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com>

On Fri, Apr 17, 2009 at 12:08 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Fri, Apr 17, 2009 at 11:46 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
>> Hi Peter:
>> Here you have some code to read the sff files.
>
> Thanks - I'm not sure when I'll get to look at this, maybe next week.
>
>> For the time being it creates a dict for the sequences. I'm not sure about
>> how to integrate the generated data in BioPython. The sequence and
>> qualities should go to a SeqRecord, but there is also the information
>> about the clipping.
>
> For Bio.SeqIO, we would need to use a SeqRecord. ?Ideally we'd want to
> be able to read and write SFF files, and to do that we'll have to record all
> the essential annotation (i.e. clipping) somehow.

I've had a look at your code this evening, and written a rough SeqIO
module using it, available here on enhancement Bug 2837,
http://bugzilla.open-bio.org/show_bug.cgi?id=2837

> Can you write SFF files?
>
>> For my work I use a kind of SeqRecord with a mask property and the
>> mask is a Location that shows which part of the sequence is ok. I don't
>> know if that's a valid model for BioPython.
>
> A mask could be done as a list of booleans, and we can treat it as
> another per-letter-annotation in the SeqRecord. ?I'm not sure if this
> is helpful or not.
>
> The Roche tools let you choose to extract trimmed reads as FASTA
> and QUAL, or untrimmed. ?Perhaps for reading SFF files with
> Bio.SeqIO we should get the user to choose between these
> options (e.g. format names "roche-sff" and "roche-sff-notrim")?

This would work...

> Roche's FASTA files use upper case for the trimmed region, and
> lower case for the start/end which would get trimmed off. This is
> simple and we could do this for Biopython too - meaning you'd get
> the same data if you read the SFF file directly, or used Roche's
> FASTA+QUAL files with SeqIO. ?Note that when reading an SFF
> file directly, we should probably record the real trim data as well.

In my current code, I decided to use the same quality trimming
representation that Roche use if converting the SFF file into FASTA
format (the leading and trailing trim regions are in lower case). We
may want to record the trim positions in the SeqRecord's annotation
as well.

>> There's also a couple of more tricks with the clipping.
>> In theory there's clip_qual and clip_adapter, but in the files
>> we've seen clip_adapter is always zero and clip_quality is used
>> instead for both quality and adapter. I think we could generate
>> one clipping combining both. Let me know what do you think.
>> Also take into account that in some cases the generated clipping
>> from the 454 software are just wrong.
>
> I'll need to learn more about the details before coming to any
> conclusions about how to deal with this information in Biopython.

Right now I have not looked at the left/right adaptor clipping information,
as you found, in the example file I have looked at these fields are zero.

Note I will be away for the next week, so am unlikely to respond to
any emails on this.

Peter


From bugzilla-daemon at portal.open-bio.org  Fri May 22 15:23:44 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 May 2009 15:23:44 -0400
Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read
	files in Bio.SeqIO
In-Reply-To: <bug-2837-42@http.bugzilla.open-bio.org/>
Message-ID: <200905221923.n4MJNiAe013574@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2837


spenthil at gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |spenthil at gmail.com


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri May 22 17:16:07 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 May 2009 17:16:07 -0400
Subject: [Biopython-dev] [Bug 2838] New: If a SeqRecord containing Genbank
	information is read from BioSQL,
	it cannot be written to another BioSQL database
Message-ID: <bug-2838-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2838

           Summary: If a SeqRecord containing Genbank information is read
                    from BioSQL, it cannot be written to another BioSQL
                    database
           Product: Biopython
           Version: 1.49
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: BioSQL
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: david.wyllie at ndm.ox.ac.uk


I've been trying to annotate some microbial sequences; some are from genbank.
So the proposed series of events was:
1) get sequences from genbank
2) store in BioSQL database called One
3) recover them from BioSql
4) annotate the recovered SeqRecords [this works, but isn't necessary for this
problem to be reproduced - here, I'm making no changes at all to the SeqRecord]
5) store the annotated SeqRecords in a different BioSQL database called Two.

The problem is that Step 5 fails when the original record was recovered from
Genbank.

The traceback (below) indicates a problem with the BioSQL loader in 
_load_bioentry_date

Here is the screen output, including traceback.
The program (attached) first loads a record from Genbank,
writes it to One, recovers it from One; at this point it has changed, in
particular in the way date fields are represented.

 the entrez load has a /date feature which is not a list
 /date=26-MAY-2005
 while the reloaded version has two date fields
 /dates=['26-MAY-2005']
 /date=['26-MAY-2005']  

Whether this is relevant I'm not sure. 

The subsequent write of the recovered version to Two fails.
As a control, I've checked that the original version can be written to Two
successfully.

I'm a novice with Python and Biopython so please accept my apologies if there
is something obvious and very stupid responsible for this.

---------------------------------------------------------------------------
dwyllie at dwyllie:~/programs/Project/src$ python dbtestcase.py
OK, going to recover record 28804743  from genbank....
Record loaded looks like this:
ID: AB098727.1
Name: AB098727
Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal
protein S11, cytochromoe b/f complex subunit IV, partial cds.
Number of features: 5
/sequence_version=1
/source=chloroplast Ceratodon purpureus
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta',
'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales',
'Ditrichaceae', 'Ceratodon']
/keywords=['']
/references=[<Bio.SeqFeature.Reference instance at 0x2190b90>,
<Bio.SeqFeature.Reference instance at 0x219a5a8>, <Bio.SeqFeature.Reference
instance at 0x219a5f0>, <Bio.SeqFeature.Reference instance at 0x219a6c8>]
/accessions=['AB098727']
/data_file_division=PLN
/date=26-MAY-2005
/organism=Ceratodon purpureus
/gi=28804743
Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA',
IUPACAmbiguousDNA())
========================================================================
Load from Entrez completed, records= 1
Here is the loaded record:
========================================================================
ID: AB098727.1
Name: AB098727
Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal
protein S11, cytochromoe b/f complex subunit IV, partial cds.
Number of features: 5
/sequence_version=1
/source=chloroplast Ceratodon purpureus
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta',
'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales',
'Ditrichaceae', 'Ceratodon']
/keywords=['']
/references=[<Bio.SeqFeature.Reference instance at 0x2190b90>,
<Bio.SeqFeature.Reference instance at 0x219a5a8>, <Bio.SeqFeature.Reference
instance at 0x219a5f0>, <Bio.SeqFeature.Reference instance at 0x219a6c8>]
/accessions=['AB098727']
/data_file_division=PLN
/date=26-MAY-2005
/organism=Ceratodon purpureus
/gi=28804743
Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA',
IUPACAmbiguousDNA())
========================================================================
Now loading these records into a BioSQL database One.
/var/lib/python-support/python2.6/MySQLdb/__init__.py:34: DeprecationWarning:
the sets module is deprecated
  from sets import ImmutableSet
Creating a new database  One
========================================================================
Load from database One completed, records= 1
========================================================================
Here is the record recovered from database One:
ID: AB098727.1
Name: AB098727
Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal
protein S11, cytochromoe b/f complex subunit IV, partial cds.
Number of features: 5
/dates=['26-MAY-2005']
/ncbi_taxid=3225
/date=['26-MAY-2005']
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Bryopsida',
'Dicranidae', 'Dicranales', 'Ditrichaceae', 'Ceratodon', 'Ceratodon purpureus']
/source=['chloroplast Ceratodon purpureus']
/references=[<Bio.SeqFeature.Reference instance at 0x235d9e0>,
<Bio.SeqFeature.Reference instance at 0x235db90>, <Bio.SeqFeature.Reference
instance at 0x235dcf8>, <Bio.SeqFeature.Reference instance at 0x235de60>]
/gi=28804743
/data_file_division=PLN
/keywords=['']
/organism=Ceratodon purpureus
/sequence_version=['1']
/accessions=['AB098727']
DBSeq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA',
DNAAlphabet())
========================================================================
Creating a new database  Two
Traceback (most recent call last):
  File "dbtestcase.py", line 206, in <module>
    from dbtestcase import AuthDetails
  File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 225, in
<module>
    DemonstrateProblem(problemgi,ad)
  File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 199, in
DemonstrateProblem
    db2.load(listtoload)
  File "/var/lib/python-support/python2.6/BioSQL/BioSeqDatabase.py", line 430,
in load
    db_loader.load_seqrecord(cur_record)
  File "/var/lib/python-support/python2.6/BioSQL/Loader.py", line 50, in
load_seqrecord
    self._load_bioentry_date(record, bioentry_id)
  File "/var/lib/python-support/python2.6/BioSQL/Loader.py", line 577, in
_load_bioentry_date
    self.adaptor.execute(sql, (bioentry_id, date_id, date))
  File "/var/lib/python-support/python2.6/BioSQL/BioSeqDatabase.py", line 289,
in execute
    self.cursor.execute(sql, args or ())
  File "/var/lib/python-support/python2.6/MySQLdb/cursors.py", line 166, in
execute
    self.errorhandler(self, exc, value)
  File "/var/lib/python-support/python2.6/MySQLdb/connections.py", line 35, in
defaulterrorhandler
    raise errorclass, errorvalue
_mysql_exceptions.ProgrammingError: (1064, "You have an error in your SQL
syntax; check the manual that corresponds to your MySQL server version for the
right syntax to use near '), 1)' at line 1")
dwyllie at dwyllie:~/programs/Project/src$


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri May 22 17:19:03 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 May 2009 17:19:03 -0400
Subject: [Biopython-dev] [Bug 2838] If a SeqRecord containing Genbank
	information is read from BioSQL,
	it cannot be written to another BioSQL database
In-Reply-To: <bug-2838-42@http.bugzilla.open-bio.org/>
Message-ID: <200905222119.n4MLJ3d3026350@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2838


------- Comment #1 from david.wyllie at ndm.ox.ac.uk  2009-05-22 17:19 EST -------
Created an attachment (id=1304)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1304&action=view)
A python script which reproduces the error.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri May 22 18:46:04 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 May 2009 18:46:04 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905222246.n4MMk4QO000548@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  BugsThisDependsOn|                            |2839


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Fri May 22 18:46:54 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 22 May 2009 23:46:54 +0100
Subject: [Biopython-dev] RULES in BioSQL PostgreSQL schema
In-Reply-To: <410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net>
References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>
	<CFCA6516-FFEA-4D8A-9E51-BCBAFC0F67A7@gmx.net>
	<320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com>
	<410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net>
Message-ID: <320fb6e00905221546i26edc7a2u2a02fb0d01c374ea@mail.gmail.com>

On 5/22/09, Hilmar Lapp <hlapp at gmx.net> wrote:
> Yes, agree. Would you mind filing this in Bugzilla for BioSQL? -hilmar

I've filed Bug 2839, hopefully this is what you had in mind:
http://bugzilla.open-bio.org/show_bug.cgi?id=2839

Peter

From chapmanb at 50mail.com  Fri May 22 18:54:32 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 22 May 2009 18:54:32 -0400
Subject: [Biopython-dev] sff reader
In-Reply-To: <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
	<320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com>
	<200904171246.46568.jblanca@btc.upv.es>
	<320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com>
	<320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com>
Message-ID: <20090522225432.GU84112@sobchak.mgh.harvard.edu>

Peter and Jose;
I haven't used SFF files myself as we don't have a 454 machine, but
do know of a couple of implementations of SFF TO Fastq/Fasta. 
Flower is a Haskell implementation:

http://blog.malde.org/index.php/flower/

And PyroBayes is a 454 base caller:

http://bioinformatics.bc.edu/marthlab/PyroBayes

Depending on what you all end up doing, these might be useful as
comparison points, or for wrapping with Application command lines.

Brad

> On Fri, Apr 17, 2009 at 12:08 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> > On Fri, Apr 17, 2009 at 11:46 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> >> Hi Peter:
> >> Here you have some code to read the sff files.
> >
> > Thanks - I'm not sure when I'll get to look at this, maybe next week.
> >
> >> For the time being it creates a dict for the sequences. I'm not sure about
> >> how to integrate the generated data in BioPython. The sequence and
> >> qualities should go to a SeqRecord, but there is also the information
> >> about the clipping.
> >
> > For Bio.SeqIO, we would need to use a SeqRecord. ?Ideally we'd want to
> > be able to read and write SFF files, and to do that we'll have to record all
> > the essential annotation (i.e. clipping) somehow.
> 
> I've had a look at your code this evening, and written a rough SeqIO
> module using it, available here on enhancement Bug 2837,
> http://bugzilla.open-bio.org/show_bug.cgi?id=2837
> 
> > Can you write SFF files?
> >
> >> For my work I use a kind of SeqRecord with a mask property and the
> >> mask is a Location that shows which part of the sequence is ok. I don't
> >> know if that's a valid model for BioPython.
> >
> > A mask could be done as a list of booleans, and we can treat it as
> > another per-letter-annotation in the SeqRecord. ?I'm not sure if this
> > is helpful or not.
> >
> > The Roche tools let you choose to extract trimmed reads as FASTA
> > and QUAL, or untrimmed. ?Perhaps for reading SFF files with
> > Bio.SeqIO we should get the user to choose between these
> > options (e.g. format names "roche-sff" and "roche-sff-notrim")?
> 
> This would work...
> 
> > Roche's FASTA files use upper case for the trimmed region, and
> > lower case for the start/end which would get trimmed off. This is
> > simple and we could do this for Biopython too - meaning you'd get
> > the same data if you read the SFF file directly, or used Roche's
> > FASTA+QUAL files with SeqIO. ?Note that when reading an SFF
> > file directly, we should probably record the real trim data as well.
> 
> In my current code, I decided to use the same quality trimming
> representation that Roche use if converting the SFF file into FASTA
> format (the leading and trailing trim regions are in lower case). We
> may want to record the trim positions in the SeqRecord's annotation
> as well.
> 
> >> There's also a couple of more tricks with the clipping.
> >> In theory there's clip_qual and clip_adapter, but in the files
> >> we've seen clip_adapter is always zero and clip_quality is used
> >> instead for both quality and adapter. I think we could generate
> >> one clipping combining both. Let me know what do you think.
> >> Also take into account that in some cases the generated clipping
> >> from the 454 software are just wrong.
> >
> > I'll need to learn more about the details before coming to any
> > conclusions about how to deal with this information in Biopython.
> 
> Right now I have not looked at the left/right adaptor clipping information,
> as you found, in the example file I have looked at these fields are zero.
> 
> Note I will be away for the next week, so am unlikely to respond to
> any emails on this.
> 
> Peter
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev

From bugzilla-daemon at portal.open-bio.org  Fri May 22 18:58:24 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 May 2009 18:58:24 -0400
Subject: [Biopython-dev] [Bug 2838] If a SeqRecord containing Genbank
	information is read from BioSQL,
	it cannot be written to another BioSQL database
In-Reply-To: <bug-2838-42@http.bugzilla.open-bio.org/>
Message-ID: <200905222258.n4MMwOXA001311@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2838


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-22 18:58 EST -------
(In reply to comment #0)
> I've been trying to annotate some microbial sequences; some are from genbank.
> So the proposed series of events was:
> 1) get sequences from genbank
> 2) store in BioSQL database called One
> 3) recover them from BioSql
> 4) annotate the recovered SeqRecords [this works, but isn't
>    necessary for this problem to be reproduced - here, I'm
>    making no changes at all to the SeqRecord]
> 5) store the annotated SeqRecords in a different BioSQL database called Two.
> 
> The problem is that Step 5 fails when the original record was recovered from
> Genbank.
> 
> The traceback (below) indicates a problem with the BioSQL loader in 
> _load_bioentry_date
> ...
> I'm a novice with Python and Biopython so please accept my apologies if
> there is something obvious and very stupid responsible for this.

What you are trying to do sounds very reasonable (although I have never
actually needed to or tried to do this myself). You were right about the date
thing, the loader code only expected a string, not a list. Fixed in CVS
revision 1.40 of BioSQL/Loader.py, and I have also added a unit test for this
use case in Tests/test_BioSQL.py revision 1.36.

Note there is a known minor discrepancy with dates (see Bug 2681) when
comparing the original SeqRecord to the DBSeqRecord after loading/retrieving
from BioSQL.

If you could confirm this solves your problem, I think we can close this bug.
Thank you!

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From chapmanb at 50mail.com  Fri May 22 18:54:32 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 22 May 2009 18:54:32 -0400
Subject: [Biopython-dev] sff reader
In-Reply-To: <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
	<320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com>
	<200904171246.46568.jblanca@btc.upv.es>
	<320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com>
	<320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com>
Message-ID: <20090522225432.GU84112@sobchak.mgh.harvard.edu>

Peter and Jose;
I haven't used SFF files myself as we don't have a 454 machine, but
do know of a couple of implementations of SFF TO Fastq/Fasta. 
Flower is a Haskell implementation:

http://blog.malde.org/index.php/flower/

And PyroBayes is a 454 base caller:

http://bioinformatics.bc.edu/marthlab/PyroBayes

Depending on what you all end up doing, these might be useful as
comparison points, or for wrapping with Application command lines.

Brad

> On Fri, Apr 17, 2009 at 12:08 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> > On Fri, Apr 17, 2009 at 11:46 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> >> Hi Peter:
> >> Here you have some code to read the sff files.
> >
> > Thanks - I'm not sure when I'll get to look at this, maybe next week.
> >
> >> For the time being it creates a dict for the sequences. I'm not sure about
> >> how to integrate the generated data in BioPython. The sequence and
> >> qualities should go to a SeqRecord, but there is also the information
> >> about the clipping.
> >
> > For Bio.SeqIO, we would need to use a SeqRecord. ?Ideally we'd want to
> > be able to read and write SFF files, and to do that we'll have to record all
> > the essential annotation (i.e. clipping) somehow.
> 
> I've had a look at your code this evening, and written a rough SeqIO
> module using it, available here on enhancement Bug 2837,
> http://bugzilla.open-bio.org/show_bug.cgi?id=2837
> 
> > Can you write SFF files?
> >
> >> For my work I use a kind of SeqRecord with a mask property and the
> >> mask is a Location that shows which part of the sequence is ok. I don't
> >> know if that's a valid model for BioPython.
> >
> > A mask could be done as a list of booleans, and we can treat it as
> > another per-letter-annotation in the SeqRecord. ?I'm not sure if this
> > is helpful or not.
> >
> > The Roche tools let you choose to extract trimmed reads as FASTA
> > and QUAL, or untrimmed. ?Perhaps for reading SFF files with
> > Bio.SeqIO we should get the user to choose between these
> > options (e.g. format names "roche-sff" and "roche-sff-notrim")?
> 
> This would work...
> 
> > Roche's FASTA files use upper case for the trimmed region, and
> > lower case for the start/end which would get trimmed off. This is
> > simple and we could do this for Biopython too - meaning you'd get
> > the same data if you read the SFF file directly, or used Roche's
> > FASTA+QUAL files with SeqIO. ?Note that when reading an SFF
> > file directly, we should probably record the real trim data as well.
> 
> In my current code, I decided to use the same quality trimming
> representation that Roche use if converting the SFF file into FASTA
> format (the leading and trailing trim regions are in lower case). We
> may want to record the trim positions in the SeqRecord's annotation
> as well.
> 
> >> There's also a couple of more tricks with the clipping.
> >> In theory there's clip_qual and clip_adapter, but in the files
> >> we've seen clip_adapter is always zero and clip_quality is used
> >> instead for both quality and adapter. I think we could generate
> >> one clipping combining both. Let me know what do you think.
> >> Also take into account that in some cases the generated clipping
> >> from the 454 software are just wrong.
> >
> > I'll need to learn more about the details before coming to any
> > conclusions about how to deal with this information in Biopython.
> 
> Right now I have not looked at the left/right adaptor clipping information,
> as you found, in the example file I have looked at these fields are zero.
> 
> Note I will be away for the next week, so am unlikely to respond to
> any emails on this.
> 
> Peter
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev

From biopython at maubp.freeserve.co.uk  Fri May 22 19:09:56 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 23 May 2009 00:09:56 +0100
Subject: [Biopython-dev] sff reader
In-Reply-To: <20090522225432.GU84112@sobchak.mgh.harvard.edu>
References: <200904161146.28203.jblanca@btc.upv.es>
	<320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com>
	<200904171246.46568.jblanca@btc.upv.es>
	<320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com>
	<320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com>
	<20090522225432.GU84112@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com>

On 5/22/09, Brad Chapman <chapmanb at 50mail.com> wrote:
> Peter and Jose;
>  I haven't used SFF files myself as we don't have a 454 machine,

We don't have one in house either, and have instead out-sourced to a
couple of sequencing centres in the UK with 454 machines.

>  but do know of a couple of implementations of SFF TO
>  Fastq/Fasta.
>  Flower is a Haskell implementation:
>
>  http://blog.malde.org/index.php/flower/
>
>  And PyroBayes is a 454 base caller:
>
>  http://bioinformatics.bc.edu/marthlab/PyroBayes
>
>  Depending on what you all end up doing, these might be useful as
>  comparison points, or for wrapping with Application command lines.

I would say Roche's own tools are the best reference, but these only
output FASTA and QUAL, not FASTQ files (at the moment at least). So
yes, being able to compare a Biopython SFF to FASTQ conversion with
that by Flower (or anything else) would be handy.

Peter

From spenthil at gmail.com  Fri May 22 19:52:30 2009
From: spenthil at gmail.com (Senthil Palanisami)
Date: Fri, 22 May 2009 16:52:30 -0700
Subject: [Biopython-dev] sff reader
In-Reply-To: <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
	<320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> 
	<200904171246.46568.jblanca@btc.upv.es>
	<320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> 
	<320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> 
	<20090522225432.GU84112@sobchak.mgh.harvard.edu>
	<320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com>
Message-ID: <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com>

I have been working with SFF files for the past month, and can say it's
definitely frustrating working with custom binary formats.

Take a look at sff_extract which is written in python. It converts sff files
into fasta and xml or caf files:
http://bioinf.comav.upv.es/sff_extract/index.html

You can find detailed specs of the format @
http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#header-global


--
Senthil Palanisami
http://spenthil.com


On Fri, May 22, 2009 at 4:09 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On 5/22/09, Brad Chapman <chapmanb at 50mail.com> wrote:
> > Peter and Jose;
> >  I haven't used SFF files myself as we don't have a 454 machine,
>
> We don't have one in house either, and have instead out-sourced to a
> couple of sequencing centres in the UK with 454 machines.
>
> >  but do know of a couple of implementations of SFF TO
> >  Fastq/Fasta.
> >  Flower is a Haskell implementation:
> >
> >  http://blog.malde.org/index.php/flower/
> >
> >  And PyroBayes is a 454 base caller:
> >
> >  http://bioinformatics.bc.edu/marthlab/PyroBayes
> >
> >  Depending on what you all end up doing, these might be useful as
> >  comparison points, or for wrapping with Application command lines.
>
> I would say Roche's own tools are the best reference, but these only
> output FASTA and QUAL, not FASTQ files (at the moment at least). So
> yes, being able to compare a Biopython SFF to FASTQ conversion with
> that by Flower (or anything else) would be handy.
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>

From biopython at maubp.freeserve.co.uk  Fri May 22 20:10:57 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 23 May 2009 01:10:57 +0100
Subject: [Biopython-dev] sff reader
In-Reply-To: <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
	<320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com>
	<200904171246.46568.jblanca@btc.upv.es>
	<320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com>
	<320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com>
	<20090522225432.GU84112@sobchak.mgh.harvard.edu>
	<320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com>
	<21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com>
Message-ID: <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com>

On 5/23/09, Senthil Palanisami <spenthil at gmail.com> wrote:
> I have been working with SFF files for the past month, and can say it's
>  definitely frustrating working with custom binary formats.

At least in this case it is publicly documented. Have you needed to
write out (or edit) an SFF file yet? Have you used any paired end
reads in SFF format?

>  Take a look at sff_extract which is written in python. It converts sff files
>  into fasta and xml or caf files:
>  http://bioinf.comav.upv.es/sff_extract/index.html

That is what this code is based on - Jose Blanca is one of the authors
of  sff_extract.

>  You can find detailed specs of the format @
>  http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#header-global

I think you must have missed this thread last month ;)
http://lists.open-bio.org/pipermail/biopython/2009-April/005084.html

Peter

From bugzilla-daemon at portal.open-bio.org  Fri May 22 21:16:54 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 May 2009 21:16:54 -0400
Subject: [Biopython-dev] [Bug 2838] If a SeqRecord containing Genbank
	information is read from BioSQL,
	it cannot be written to another BioSQL database
In-Reply-To: <bug-2838-42@http.bugzilla.open-bio.org/>
Message-ID: <200905230116.n4N1GsRl010917@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2838


------- Comment #3 from david.wyllie at ndm.ox.ac.uk  2009-05-22 21:16 EST -------
Thank you!

Unfortunately I'm not sure it's fixed, or maybe there is another problem:

I have uninstalled the BioPython package using Synaptic package manager
(previously I was using 1.49), downloaded from cvs checkout.

Thanks for your message
http://osdir.com/ml/python.bio.general/2008-07/msg00035.html
I can confirm that the default ubuntu 9.0 install lacks the python-dev package,
with the necessary Python.h headers. 

After python-dev is installed, 
build is OK, 
Tests pass
running test
test_Ace ... ok
test_AlignIO ... ok
test_BioSQL ... /var/lib/python-support/python2.6/MySQLdb/__init__.py:34:
DeprecationWarning: the sets module is deprecated
  from sets import ImmutableSet
/home/dwyllie/biopython/build/lib.linux-x86_64-2.6/BioSQL/BioSeqDatabase.py:144:
Warning: 'TYPE=storage_engine' is deprecated; use 'ENGINE=storage_engine'
instead
  self.adaptor.cursor.execute(sql_line)
ok
test_BioSQL_SeqIO ... ok
test_CAPS ... ok
test_Clustalw ... ok
..

and install is OK too.  This is all new to me but it seems to work OK.

I have checked the source code and I think your modification is correctly in
place

I think I have your patch in place:

  def _load_bioentry_date(self, record, bioentry_id):
        """Add the effective date of the entry into the database.

        record - a SeqRecord object with an annotated date
        bioentry_id - corresponding database identifier
        """
        # dates are GenBank style, like:
        # 14-SEP-2000
        date = record.annotations.get("date",
                                      strftime("%d-%b-%Y", gmtime()).upper())
        if isinstance(date, list) : date = date[0]
        annotation_tags_id = self._get_ontology_id("Annotation Tags")
        date_id = self._get_term_id("date_changed", annotation_tags_id)
        sql = r"INSERT INTO bioentry_qualifier_value" \
              r" (bioentry_id, term_id, value, rank)" \
              r" VALUES (%s, %s, %s, 1)" 
        self.adaptor.execute(sql, (bioentry_id, date_id, date))


Now when I re-run dbtestcase.py (attached previously) I get a different error
message.

dwyllie at dwyllie:~/programs/CheckleyProject/src$ python dbtestcase.py
OK, going to recover record 28804743  from genbank....
Record loaded looks like this:
ID: AB098727.1
Name: AB098727
Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal
protein S11, cytochromoe b/f complex subunit IV, partial cds.
Number of features: 5
/sequence_version=1
/source=chloroplast Ceratodon purpureus
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta',
'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales',
'Ditrichaceae', 'Ceratodon']
/keywords=['']
/references=[<Bio.SeqFeature.Reference object at 0x26e7a10>,
<Bio.SeqFeature.Reference object at 0x26e7a90>, <Bio.SeqFeature.Reference
object at 0x26e7b50>, <Bio.SeqFeature.Reference object at 0x26e7bd0>]
/accessions=['AB098727']
/data_file_division=PLN
/date=26-MAY-2005
/organism=Ceratodon purpureus
/gi=28804743
Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA',
IUPACAmbiguousDNA())
========================================================================
Load from Entrez completed, records= 1
Here is the loaded record:
========================================================================
ID: AB098727.1
Name: AB098727
Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal
protein S11, cytochromoe b/f complex subunit IV, partial cds.
Number of features: 5
/sequence_version=1
/source=chloroplast Ceratodon purpureus
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta',
'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales',
'Ditrichaceae', 'Ceratodon']
/keywords=['']
/references=[<Bio.SeqFeature.Reference object at 0x26e7a10>,
<Bio.SeqFeature.Reference object at 0x26e7a90>, <Bio.SeqFeature.Reference
object at 0x26e7b50>, <Bio.SeqFeature.Reference object at 0x26e7bd0>]
/accessions=['AB098727']
/data_file_division=PLN
/date=26-MAY-2005
/organism=Ceratodon purpureus
/gi=28804743
Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA',
IUPACAmbiguousDNA())
========================================================================
Now loading these records into a BioSQL database One.
/var/lib/python-support/python2.6/MySQLdb/__init__.py:34: DeprecationWarning:
the sets module is deprecated
  from sets import ImmutableSet
Creating a new database  One
========================================================================
Load from database One completed, records= 1
========================================================================
Here is the record recovered from database One:
Traceback (most recent call last):
  File "dbtestcase.py", line 165, in <module>
    from dbtestcase import AuthDetails
  File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 182, in
<module>
    DemonstrateProblem(problemgi,ad)
  File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 138, in
DemonstrateProblem
    print recordrecovered
  File "/usr/local/lib/python2.6/dist-packages/Bio/SeqRecord.py", line 489, in
__str__
    if self.letter_annotations :
  File "/usr/local/lib/python2.6/dist-packages/Bio/SeqRecord.py", line 165, in
<lambda>
    fget=lambda self : self._per_letter_annotations,
AttributeError: 'DBSeqRecord' object has no attribute '_per_letter_annotations'
dwyllie at dwyllie:~/programs/CheckleyProject/src$ 


Have I failed to install something?
Unfortunately, I wasn't running off CVS before your change.

Best wishes
d


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From spenthil at gmail.com  Fri May 22 21:48:24 2009
From: spenthil at gmail.com (Senthil Palanisami)
Date: Fri, 22 May 2009 18:48:24 -0700
Subject: [Biopython-dev] sff reader
In-Reply-To: <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
	<320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> 
	<200904171246.46568.jblanca@btc.upv.es>
	<320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> 
	<320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> 
	<20090522225432.GU84112@sobchak.mgh.harvard.edu>
	<320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> 
	<21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> 
	<320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com>
Message-ID: <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com>

Sorry, I only recently joined this list - should have gone through the
archives first.

I have done some minimal SFF tweaking, but only by first converting them to
CA format.

No paired end reads yet, but I do know my PI wants me to start looking at
some in the next month or two.

--
Senthil Palanisami
http://spenthil.com


On Fri, May 22, 2009 at 5:10 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On 5/23/09, Senthil Palanisami <spenthil at gmail.com> wrote:
> > I have been working with SFF files for the past month, and can say it's
> >  definitely frustrating working with custom binary formats.
>
> At least in this case it is publicly documented. Have you needed to
> write out (or edit) an SFF file yet? Have you used any paired end
> reads in SFF format?
>
> >  Take a look at sff_extract which is written in python. It converts sff
> files
> >  into fasta and xml or caf files:
> >  http://bioinf.comav.upv.es/sff_extract/index.html
>
> That is what this code is based on - Jose Blanca is one of the authors
> of  sff_extract.
>
> >  You can find detailed specs of the format @
> >
> http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#header-global
>
> I think you must have missed this thread last month ;)
> http://lists.open-bio.org/pipermail/biopython/2009-April/005084.html
>
> Peter
>

From biopython at maubp.freeserve.co.uk  Sat May 23 07:28:36 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 23 May 2009 12:28:36 +0100
Subject: [Biopython-dev] sff reader
In-Reply-To: <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
	<320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com>
	<200904171246.46568.jblanca@btc.upv.es>
	<320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com>
	<320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com>
	<20090522225432.GU84112@sobchak.mgh.harvard.edu>
	<320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com>
	<21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com>
	<320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com>
	<21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com>
Message-ID: <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com>

On Sat, May 23, 2009 at 2:48 AM, Senthil Palanisami <spenthil at gmail.com> wrote:
> Sorry, I only recently joined this list - should have gone through the
> archives first.

Don't worry - and if I sounded grumpy, sorry - I was up late last night.

> I have done some minimal SFF tweaking, but only by first converting them
> to CA format.

What do you mean by CA format? I don't recall seeing that abbreviation
before.

> No paired end reads yet, but I do know my PI wants me to start looking
> at some in the next month or two.

I haven't had any paired end 454 reads to work with personally, but I'm
sure there are some examples available online somewhere.

Peter

From bugzilla-daemon at portal.open-bio.org  Sat May 23 07:49:18 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 23 May 2009 07:49:18 -0400
Subject: [Biopython-dev] [Bug 2838] If a SeqRecord containing Genbank
	information is read from BioSQL,
	it cannot be written to another BioSQL database
In-Reply-To: <bug-2838-42@http.bugzilla.open-bio.org/>
Message-ID: <200905231149.n4NBnIEQ023192@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2838


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-23 07:49 EST -------
(In reply to comment #3)
> Thank you!
> 
> Unfortunately I'm not sure it's fixed, or maybe there is another problem:
> ...
> Now when I re-run dbtestcase.py (attached previously) I get a different error
> message.
> ...
> Traceback (most recent call last):
> ...
>   File "/usr/local/lib/python2.6/dist-packages/Bio/SeqRecord.py", line 489, in
> __str__
>     if self.letter_annotations :
>   File "/usr/local/lib/python2.6/dist-packages/Bio/SeqRecord.py", line 165, in
> <lambda>
>     fget=lambda self : self._per_letter_annotations,
> AttributeError: 'DBSeqRecord' object has no attribute '_per_letter_annotations'
> dwyllie at dwyllie:~/programs/CheckleyProject/src$ 
> 
> 
> Have I failed to install something?

No - everything looks OK, and the deprecation warnings are known about and not
in Biopython anyway.

> Unfortunately, I wasn't running off CVS before your change.

The original problem is fixed. However, you've found a new bug in the __str__
method for the DBSeqRecord related to the fact there is no
per-letter-annotation (this would have been introduced in Biopython 1.50 when I
added the letter_annotations dictionary to the SeqRecord class). I'm a little
surprised that our unit tests didn't catch this - but its fixed now:

Tests/test_BioSQL.py CVS revision 1.37
BioSQL/BioSeq.py CVS revision 1.36

Note BioSQL doesn't yet support recording anything more complicated than
strings, although we've started talking about using XML or JSON for this. As a
result, Biopython does not attempt to record any per-letter-annotation in the
BioSQL database. With the fix the DBSeqRecord now has an empty
per-letter-annotation dictionary. Before it didn't, hense the AttributeError.

Hopefully you won't find any more issues, but if you do, please file another
bug - I'm marking this one as fixed.

Thanks for your report and time David,

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From spenthil at gmail.com  Sat May 23 12:11:22 2009
From: spenthil at gmail.com (Senthil Palanisami)
Date: Sat, 23 May 2009 09:11:22 -0700
Subject: [Biopython-dev] sff reader
In-Reply-To: <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
	<200904171246.46568.jblanca@btc.upv.es> 
	<320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> 
	<320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> 
	<20090522225432.GU84112@sobchak.mgh.harvard.edu>
	<320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> 
	<21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> 
	<320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com> 
	<21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> 
	<320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com>
Message-ID: <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com>

You didn't sound particularly grumpy, I am just aware of the annoyances
related to people too lazy to do a quick search of through a mailing list
before spamming.

I pulled 'CA' straight out of a wgs assembler program:
http://apps.sourceforge.net/mediawiki/wgs-assembler/index.php?title=Formatting_Inputs#sffToCA
I think 'frg' is the real file format name.

--
Senthil Palanisami
http://spenthil.com


On Sat, May 23, 2009 at 4:28 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Sat, May 23, 2009 at 2:48 AM, Senthil Palanisami <spenthil at gmail.com>
> wrote:
> > Sorry, I only recently joined this list - should have gone through the
> > archives first.
>
> Don't worry - and if I sounded grumpy, sorry - I was up late last night.
>
> > I have done some minimal SFF tweaking, but only by first converting them
> > to CA format.
>
> What do you mean by CA format? I don't recall seeing that abbreviation
> before.
>
> > No paired end reads yet, but I do know my PI wants me to start looking
> > at some in the next month or two.
>
> I haven't had any paired end 454 reads to work with personally, but I'm
> sure there are some examples available online somewhere.
>
> Peter
>

From mjldehoon at yahoo.com  Sun May 24 00:10:28 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Sat, 23 May 2009 21:10:28 -0700 (PDT)
Subject: [Biopython-dev] SwissProt DE lines and bioentry.description
	field in BioSQL
Message-ID: <867081.50034.qm@web62404.mail.re1.yahoo.com>


I suggest that for the short term, we store the DE lines as one string in the same way as Bioperl 1.5 and 1.6, until we decide on a more advanced way to treat these lines. Currently Bio.SeqIO and Bio.SwissProt use different ways to handle the DE lines, and neither of them agrees with Bioperl.

--Michiel.


--- On Mon, 5/18/09, Peter <biopython at maubp.freeserve.co.uk> wrote:

> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: Re: [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL
> To: "Hilmar Lapp" <hlapp at gmx.net>
> Cc: "Chris Fields" <cjfields at illinois.edu>, "BioPerl List" <bioperl-l at lists.open-bio.org>, "biosql-l" <biosql-l at lists.open-bio.org>, biopython-dev at biopython.org
> Date: Monday, May 18, 2009, 9:38 AM
> On Sun, May 17, 2009 at 4:21 PM,
> Hilmar Lapp <hlapp at gmx.net>
> wrote:
> >
> > On May 17, 2009, at 8:40 AM, Peter wrote:
> >>
> >> [...] Here you have mapped RecName and AltName
> fields in the DE lines to
> >> Name and Synonyms (shouldn't that be Synonym
> singular?).
> >
> > The example is for the GN lines in SwissProt, not the
> DE lines.
> 
> Ah, that probably explains some of my confusion.
> 
> >> In this example, searching the database using one
> of the SwissProt
> >> AltNames (synonyms), or filtering on the Flags
> sounds like a
> >> reasonable request - but this would be very
> difficult if the data is
> >> stored inside XML strings.
> >
> > Actually no. Modern full-text indexers (inside or
> outside the database) can
> > index XML text columns right away and very well. In
> fact, for the last
> > project that I built a full-text search for (on top of
> a BioSQL database) I
> > did that by writing custom XML documents to a separate
> table for each
> > record I wanted indexed. Oracle's full text indexer
> did the rest. I also built a
> > separate identifier/name/accession index that pulled
> all the gene names,
> > symbols, accession numbers, identifiers etc into a
> single table for
> > indexing.
> 
> OK, when I said searching "would be very difficult if the
> data is
> stored inside XML strings", maybe it wasn't so difficult
> for you - but
> that still sounds complicated!
> 
> Sticking with the GN lines and the synonym, if this was
> stored as a
> simple tag/value as usual in BioSQL, I would write my SQL
> statement to
> search the annotation table where the term id was that
> associated with
> a GN synonym, and the annotation value was "HABP1".?
> Simple.
> 
> Using the XML approach, are you suggesting you could do a
> full text
> search on the annotation value field, looking for any rows
> where the
> field contains "<Synonyms>HABP1</Synonyms>",
> where the term id matches
> the GN lines' XML string? This sounds simplistic and
> probably rather
> slow - presumably why you resorted to the more complicated
> indexing
> scheme described above?
> 
> > What I mean is, a fully normalized relational
> representation, especially if
> > nested, is often not the most efficient data structure
> for efficient
> > searching and filtering.
> 
> OK.? But do we really need to worry about complex
> nested structures
> for the SwissProt annotation (or in general)?
> 
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
> 


From biopython at maubp.freeserve.co.uk  Sun May 24 06:42:14 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 24 May 2009 11:42:14 +0100
Subject: [Biopython-dev] SwissProt DE lines and bioentry.description
	field in BioSQL
In-Reply-To: <867081.50034.qm@web62404.mail.re1.yahoo.com>
References: <867081.50034.qm@web62404.mail.re1.yahoo.com>
Message-ID: <320fb6e00905240342t7d59f783t8203cce581256f88@mail.gmail.com>

On Sun, May 24, 2009 at 5:10 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
> I suggest that for the short term, we store the DE lines as one
> string in the same way as Bioperl 1.5 and 1.6, until we decide
> on a more advanced way to treat these lines.

Agreed.

> Currently Bio.SeqIO and Bio.SwissProt use different ways to
> handle the DE lines, and neither of them agrees with Bioperl.

Well, Bio.SeqIO agrees with BioPerl modulo the white space -
but we might as well agree with the current BioPerl behaviour
until something is settled for storing more complex objects
than strings in BioSQL.

As I mentioned earlier, I'll be away for this week, so feel free
to press ahead with this.

Peter

From bugzilla-daemon at portal.open-bio.org  Mon May 25 14:21:26 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 May 2009 14:21:26 -0400
Subject: [Biopython-dev] [Bug 2840] New: When a record has been loaded from
	BioSQL, trying to save it to another database fails with loader
	db_loader.load_seqrecord fails in _load_reference
Message-ID: <bug-2840-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2840

           Summary: When a record has been loaded from BioSQL, trying to
                    save it to another database fails with loader
                    db_loader.load_seqrecord fails in _load_reference
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: BioSQL
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: david.wyllie at ndm.ox.ac.uk


Hi

I have been trying to load SeqRecords from BioSQL, annotate them, and then
write them to a different BioSQL database.  Reloading the record to the second
database fails.  This isn't to do with annotation - none is performed.

This issue is different from #2838, which has been addressed (thank you).

The sequence of events is
1) eFetch a SeqRecord from Genbank (succeeds)
2) write to BioSQL (succeeds)
3) recover from BioSQL (succeeds)
4) write to BioSQL (fails, although no modifications have been made).

The current problem seems related to references:
Loader.load_seqrecord._load_reference.
Error says:

_load_reference
    start = 1 + int(str(reference.location[0].start))
ValueError: invalid literal for int() with base 10: 'None'

Testing has been done on Ubuntu 9 x64 with Python 2.6 (debian package),
python-dev (debian package), load from CVS as of 24.5.09, and a testcase
program, dbtestcase.py, attached to the now fixed bug #2838.

To run dbtestcase.py, the mysql details will have to be altered on line
beginning
ad=AuthDetails(...
but otherwise it should I think run.

Traceback and program output from dbtestcase.py follow.
dwyllie at dwyllie:~/programs/CheckleyProject/src$ python dbtestcase.py
OK, going to recover record 28804743  from genbank....
Record loaded looks like this:
ID: AB098727.1
Name: AB098727
Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal
protein S11, cytochromoe b/f complex subunit IV, partial cds.
Number of features: 5
/sequence_version=1
/source=chloroplast Ceratodon purpureus
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta',
'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales',
'Ditrichaceae', 'Ceratodon']
/keywords=['']
/references=[<Bio.SeqFeature.Reference object at 0x2524a10>,
<Bio.SeqFeature.Reference object at 0x2524a90>, <Bio.SeqFeature.Reference
object at 0x2524b50>, <Bio.SeqFeature.Reference object at 0x2524bd0>]
/accessions=['AB098727']
/data_file_division=PLN
/date=26-MAY-2005
/organism=Ceratodon purpureus
/gi=28804743
Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA',
IUPACAmbiguousDNA())
========================================================================
Load from Entrez completed, records= 1
Here is the loaded record:
========================================================================
ID: AB098727.1
Name: AB098727
Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal
protein S11, cytochromoe b/f complex subunit IV, partial cds.
Number of features: 5
/sequence_version=1
/source=chloroplast Ceratodon purpureus
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta',
'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales',
'Ditrichaceae', 'Ceratodon']
/keywords=['']
/references=[<Bio.SeqFeature.Reference object at 0x2524a10>,
<Bio.SeqFeature.Reference object at 0x2524a90>, <Bio.SeqFeature.Reference
object at 0x2524b50>, <Bio.SeqFeature.Reference object at 0x2524bd0>]
/accessions=['AB098727']
/data_file_division=PLN
/date=26-MAY-2005
/organism=Ceratodon purpureus
/gi=28804743
Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA',
IUPACAmbiguousDNA())
========================================================================
Now loading these records into a BioSQL database One.
/var/lib/python-support/python2.6/MySQLdb/__init__.py:34: DeprecationWarning:
the sets module is deprecated
  from sets import ImmutableSet
Creating a new database  One
========================================================================
Load from database One completed, records= 1
========================================================================
Here is the record recovered from database One:
ID: AB098727.1
Name: AB098727
Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal
protein S11, cytochromoe b/f complex subunit IV, partial cds.
Number of features: 5
/dates=['26-MAY-2005']
/ncbi_taxid=3225
/date=['26-MAY-2005']
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Bryopsida',
'Dicranidae', 'Dicranales', 'Ditrichaceae', 'Ceratodon', 'Ceratodon purpureus']
/source=['chloroplast Ceratodon purpureus']
/references=[<Bio.SeqFeature.Reference object at 0x269e710>,
<Bio.SeqFeature.Reference object at 0x269e810>, <Bio.SeqFeature.Reference
object at 0x269e910>, <Bio.SeqFeature.Reference object at 0x269ea10>]
/gi=28804743
/data_file_division=PLN
/keywords=['']
/organism=Ceratodon purpureus
/sequence_version=['1']
/accessions=['AB098727']
DBSeq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA',
DNAAlphabet())
========================================================================
Creating a new database  Two
Traceback (most recent call last):
  File "dbtestcase.py", line 165, in <module>
    from dbtestcase import AuthDetails
  File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 182, in
<module>
    DemonstrateProblem(problemgi,ad)
  File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 158, in
DemonstrateProblem
    db2.load(listtoload)
  File "/usr/local/lib/python2.6/dist-packages/BioSQL/BioSeqDatabase.py", line
442, in load
    db_loader.load_seqrecord(cur_record)
  File "/usr/local/lib/python2.6/dist-packages/BioSQL/Loader.py", line 57, in
load_seqrecord
    self._load_reference(reference, rank, bioentry_id)
  File "/usr/local/lib/python2.6/dist-packages/BioSQL/Loader.py", line 733, in
_load_reference
    start = 1 + int(str(reference.location[0].start))
ValueError: invalid literal for int() with base 10: 'None'
dwyllie at dwyllie:~/programs/CheckleyProject/src$


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon May 25 14:23:52 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 May 2009 14:23:52 -0400
Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from
	BioSQL, trying to save it to another database fails with loader
	db_loader.load_seqrecord in _load_reference
In-Reply-To: <bug-2840-42@http.bugzilla.open-bio.org/>
Message-ID: <200905251823.n4PINq60005295@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2840


david.wyllie at ndm.ox.ac.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|When a record has been      |When a record has been
                   |loaded from BioSQL, trying  |loaded from BioSQL, trying
                   |to save it to another       |to save it to another
                   |database fails with loader  |database fails with loader
                   |db_loader.load_seqrecord    |db_loader.load_seqrecord in
                   |fails in _load_reference    |_load_reference


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon May 25 18:23:20 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 May 2009 18:23:20 -0400
Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from
	BioSQL, trying to save it to another database fails with loader
	db_loader.load_seqrecord in _load_reference
In-Reply-To: <bug-2840-42@http.bugzilla.open-bio.org/>
Message-ID: <200905252223.n4PMNKL7023601@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2840


------- Comment #1 from david.wyllie at ndm.ox.ac.uk  2009-05-25 18:23 EST -------
I have modified the dbtestcase.py script to show the contents of the reference
of the record downloaded from genbank, and from the record recovered from
BioSQL.

Here is a print out of the last two references before saving to BioSQL:

authors: Sugita,M., Sugiura,C., Arikawa,T. and Higuchi,M.
title: Molecular evidence of an rpoA gene in the basal moss chloroplast
genomes: rpoA is a useful molecular marker for phylogenetic analysis of mosses
journal: Hikobia 14, 171-175 (2004)
medline id: 
pubmed id: 
comment: 

location: [0:789]
authors: Sugita,M.
title: Direct Submission
journal: Submitted (25-DEC-2002) Mamoru Sugita, Nagoya University, Center for
Gene Research; Chikusa-ku, Nagoya, Aichi 464-8602, Japan
(E-mail:sugita at gene.nagoya-u.ac.jp, Tel:81-52-789-3080(ex.3080),
Fax:81-52-789-3080)
medline id: 
pubmed id: 
comment: 

--- note: no location in the first one; only a location in the last reference
(why? - should references have a location?  I suppose they might, if they
referred to a part of a chromosome?)

Now, after saving to BioSQL and recovering, all the records have a location,
but in some cases, it is [None:None]; here are the same two records.

location: [None:None]
authors: Sugita,M., Sugiura,C., Arikawa,T. and Higuchi,M.
title: Molecular evidence of an rpoA gene in the basal moss chloroplast
genomes: rpoA is a useful molecular marker for phylogenetic analysis of mosses
journal: Hikobia 14, 171-175 (2004)
medline id: 
pubmed id: 
comment: 

location: [0:789]
authors: Sugita,M.
title: Direct Submission
journal: Submitted (25-DEC-2002) Mamoru Sugita, Nagoya University, Center for
Gene Research; Chikusa-ku, Nagoya, Aichi 464-8602, Japan
(E-mail:sugita at gene.nagoya-u.ac.jp, Tel:81-52-789-3080(ex.3080),
Fax:81-52-789-3080)
medline id: 
pubmed id: 
comment: 


After this, the db.load method calls _load_reference.  

I think the problem is because the last line doesn't cope with none values.
If one edits 
_load_reference to put the last reference inside a test for the null condition

     if (start is not None and end is not None):        
            sql = "INSERT INTO bioentry_reference (bioentry_id, reference_id,"
\
                  " start_pos, end_pos, rank)" \
                  " VALUES (%s, %s, %s, %s, %s)"
            self.adaptor.execute(sql, (bioentry_id, reference_id,
                                       start, end, rank + 1))

Then the problem is solved, but I'm not sure how this fits in the bigger scheme
of things.

d


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon May 25 18:26:21 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 May 2009 18:26:21 -0400
Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from
	BioSQL, trying to save it to another database fails with loader
	db_loader.load_seqrecord in _load_reference
In-Reply-To: <bug-2840-42@http.bugzilla.open-bio.org/>
Message-ID: <200905252226.n4PMQK9o023893@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2840


------- Comment #2 from david.wyllie at ndm.ox.ac.uk  2009-05-25 18:26 EST -------
Created an attachment (id=1305)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1305&action=view)
A program which tests for the problem.  Alter the ad=AuthDetails line to
include MySQl passwords for your system; using root and no password in the
script as is.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon May 25 20:14:40 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 May 2009 20:14:40 -0400
Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from
	BioSQL, trying to save it to another database fails with loader
	db_loader.load_seqrecord in _load_reference
In-Reply-To: <bug-2840-42@http.bugzilla.open-bio.org/>
Message-ID: <200905260014.n4Q0EeBh030704@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2840


------- Comment #3 from cymon.cox at gmail.com  2009-05-25 20:14 EST -------
(In reply to comment #1)
> I have modified the dbtestcase.py script to show the contents of the reference
> of the record downloaded from genbank, and from the record recovered from
> BioSQL.
> 
> Here is a print out of the last two references before saving to BioSQL:
> 
> authors: Sugita,M., Sugiura,C., Arikawa,T. and Higuchi,M.
> title: Molecular evidence of an rpoA gene in the basal moss chloroplast
> genomes: rpoA is a useful molecular marker for phylogenetic analysis of mosses
> journal: Hikobia 14, 171-175 (2004)
> medline id: 
> pubmed id: 
> comment: 
> 
> location: [0:789]
> authors: Sugita,M.
> title: Direct Submission
> journal: Submitted (25-DEC-2002) Mamoru Sugita, Nagoya University, Center for
> Gene Research; Chikusa-ku, Nagoya, Aichi 464-8602, Japan
> (E-mail:sugita at gene.nagoya-u.ac.jp, Tel:81-52-789-3080(ex.3080),
> Fax:81-52-789-3080)
> medline id: 
> pubmed id: 
> comment: 
> 
> --- note: no location in the first one; only a location in the last reference
> (why? - should references have a location?  I suppose they might, if they
> referred to a part of a chromosome?)
> 
> Now, after saving to BioSQL and recovering, all the records have a location,
> but in some cases, it is [None:None]; here are the same two records.
> 
> location: [None:None]
> authors: Sugita,M., Sugiura,C., Arikawa,T. and Higuchi,M.
> title: Molecular evidence of an rpoA gene in the basal moss chloroplast
> genomes: rpoA is a useful molecular marker for phylogenetic analysis of mosses
> journal: Hikobia 14, 171-175 (2004)
> medline id: 
> pubmed id: 
> comment: 
> 
> location: [0:789]
> authors: Sugita,M.
> title: Direct Submission
> journal: Submitted (25-DEC-2002) Mamoru Sugita, Nagoya University, Center for
> Gene Research; Chikusa-ku, Nagoya, Aichi 464-8602, Japan
> (E-mail:sugita at gene.nagoya-u.ac.jp, Tel:81-52-789-3080(ex.3080),
> Fax:81-52-789-3080)
> medline id: 
> pubmed id: 
> comment: 
> 
> 
> After this, the db.load method calls _load_reference.  
> 
> I think the problem is because the last line doesn't cope with none values.
> If one edits 
> _load_reference to put the last reference inside a test for the null condition
> 
>      if (start is not None and end is not None):        
>             sql = "INSERT INTO bioentry_reference (bioentry_id, reference_id,"
> \
>                   " start_pos, end_pos, rank)" \
>                   " VALUES (%s, %s, %s, %s, %s)"
>             self.adaptor.execute(sql, (bioentry_id, reference_id,
>                                        start, end, rank + 1))
> 
> Then the problem is solved, but I'm not sure how this fits in the bigger scheme
> of things.
> 
> d
> 

The BioSQL loader uses None for "start" and "end" if a reference doesn't have a
location. When the reference is retrieved the location remains set to
["None","None"]

Try this alteration to BioSeq.py, it should solve your problem:
cymon at gyra:~/git/github-master/BioSQL$ git diff BioSeq.py
diff --git a/BioSQL/BioSeq.py b/BioSQL/BioSeq.py
index cc47cf4..8d1e02a 100644
--- a/BioSQL/BioSeq.py
+++ b/BioSQL/BioSeq.py
@@ -351,8 +351,11 @@ def _retrieve_reference(adaptor, primary_id):
     references = []
     for start, end, location, title, authors, dbname, accession in refs:
         reference = SeqFeature.Reference()
-        if start: start -= 1
-        reference.location = [SeqFeature.FeatureLocation(start, end)]
+        if start:
+            start -= 1
+            reference.location = [SeqFeature.FeatureLocation(start, end)]
+        else:
+            reference.location = []
         #Don't replace the default "" with None.
         if authors : reference.authors = authors
         if title : reference.title = title


Heres a patch for the unittest to compare locations of injected and retrieved
records:
diff --git a/Tests/test_BioSQL_SeqIO.py b/Tests/test_BioSQL_SeqIO.py
index 2d8caf8..9479e02 100644
--- a/Tests/test_BioSQL_SeqIO.py
+++ b/Tests/test_BioSQL_SeqIO.py
@@ -360,6 +360,19 @@ def compare_records(old, new) :
             assert len(old.annotations[key]) == len(new.annotations[key])
             for old_r, new_r in zip(old.annotations[key],
new.annotations[key]) :
                 compare_references(old_r, new_r)
+            for old_ref, new_ref in zip(old.annotations[key],
+                    new.annotations[key]):
+                if old_ref.location == []:
+                    assert new_ref.location == [], "old_reference.location %s
!=" \
+                        "new_reference location %s" % (old_ref.location,
+                        new_ref.location)
+                else:
+                    assert old_ref.location[0].start ==
new_ref.location[0].start, \
+                    "old ref.location[0].start %s != new ref.location[0].start
%s" % \
+                    (old_ref.location[0].start, new_ref.location[0].start)
+                    assert old_ref.location[0].end == new_ref.location[0].end,
\
+                    "old ref.location[0].end %s != new ref.location[0].end %s"
% \
+                    (old_ref.location[0].end, new_ref.location[0].end)
         elif key == "comment":
             if isinstance(old.annotations[key], list):
                 old_comment = [comm.replace("\n", " ") for comm in \

Cheers, Cymon


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Tue May 26 10:17:48 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 26 May 2009 10:17:48 -0400
Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from
	BioSQL, trying to save it to another database fails with loader
	db_loader.load_seqrecord in _load_reference
In-Reply-To: <bug-2840-42@http.bugzilla.open-bio.org/>
Message-ID: <200905261417.n4QEHmf9007821@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2840


------- Comment #4 from cymon.cox at gmail.com  2009-05-26 10:17 EST -------
(In reply to comment #3)
> (In reply to comment #1)

The functions in old Tests/BioSQL_Seq.py have moved to seq_tests_common.py. So
ive updated the seq_tests_common:

diff --git a/Tests/seq_tests_common.py b/Tests/seq_tests_common.py
index d3b7fb4..392a96c 100644
--- a/Tests/seq_tests_common.py
+++ b/Tests/seq_tests_common.py
@@ -40,10 +40,17 @@ def compare_references(old_r, new_r) :
     #allow us to store a consortium.
     assert new_r.consrtm == ""

-    #TODO - reference location?
-    #The parser seems to give a location object (i.e. which
-    #nucleotides from the file is the reference for), while the
-    #we seem to use the database to hold the journal details (!)
+    # Reference location
+    if old_r.location == []:
+        assert new_r.location == [], "old_r.location %s != " \
+            "new_r.location %s" % (old_r.location, new_r.location)
+    else:
+        assert old_r.location[0].start == new_r.location[0].start, \
+        "old_r.location[0].start %s != new_r.location[0].start %s" % \
+        (old_r.location[0].start, new_r.location[0].start)
+        assert old_r.location[0].end == new_r.location[0].end, \
+        "old_r.location[0].end %s != new_r.location[0].end %s" % \
+        (old_r.location[0].end, new_r.location[0].end)
     return True

Pushed to http://github.com/cymon/biopython-github-master/tree/bug2840
C.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Tue May 26 13:32:34 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 26 May 2009 13:32:34 -0400
Subject: [Biopython-dev] [Bug 2841] New: SeqFeature constructor ignores
	qualifiers and sub_features arguments
Message-ID: <bug-2841-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2841

           Summary: SeqFeature constructor ignores qualifiers and
                    sub_features arguments
           Product: Biopython
           Version: 1.50
          Platform: All
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: n.j.loman at bham.ac.uk


The constructor to Bio.SeqFeature.SeqFeature ignores qualifiers and
sub_features, although the prototype to the constructor allows these keyword
arguments to be specified.

I see in the code there is a reason for it to be ignored:
        # XXX right now sub_features and qualifiers cannot be set
        # from the initializer because this causes all kinds
        # of recursive import problems. I can't understand why this is
        # at all :-<
        self.qualifiers = {}
        self.sub_features = []

However, would it not be better to get rid of the keyword arguments from the
constructor prototype to stop people getting confused? I keep stumbling over
this problem myself and forgetting about it.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Wed May 27 03:57:05 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 27 May 2009 03:57:05 -0400
Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from
	BioSQL, trying to save it to another database fails with loader
	db_loader.load_seqrecord in _load_reference
In-Reply-To: <bug-2840-42@http.bugzilla.open-bio.org/>
Message-ID: <200905270757.n4R7v5iv004300@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2840


------- Comment #5 from david.wyllie at ndm.ox.ac.uk  2009-05-27 03:57 EST -------
Thank you very much!

I haven't tested the unit tests but the patch in #3 resolves the problem.

With best wishes


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From mjldehoon at yahoo.com  Sat May 30 05:37:35 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Sat, 30 May 2009 02:37:35 -0700 (PDT)
Subject: [Biopython-dev] More SwissProt inconsistencies
Message-ID: <880385.97797.qm@web62401.mail.re1.yahoo.com>


Looking some more at how Bio.SeqIO and Bio.SwissProt store the information in a SwissProt file, I found the following two inconsistencies:

1) A multi-line author list such as the following:
RA   Pain A., Renauld H., Berriman M., Murphy L., Yeats C.A., Weir W.,
RA   Kerhornou A., Aslett M., Bishop R., Bouchier C., Cochet M.,
RA   Coulson R.M.R., Cronin A., de Villiers E.P., Fraser A., Fosker N.,
RA   Gardner M., Goble A., Griffiths-Jones S., Harris D.E., Katzer F.,
RA   Larke N., Lord A., Maser P., McKellar S., Mooney P., Morton F.,
RA   Nene V., O'Neil S., Price C., Quail M.A., Rabbinowitsch E.,
RA   Rawlings N.D., Rutter S., Saunders D., Seeger K., Shah T., Squares R.,
RA   Squares S., Tivey A., Walker A.R., Woodward J., Dobbelaere D.A.E.,
RA   Langsley G., Rajandream M.A., McKeever D., Shiels B., Tait A.,
RA   Barrell B.G., Hall N.;
is stored without newlines by Bio.SeqIO:
>>> seq_record.annotations['references'][0].authors
"Pain A., Renauld H., Berriman M., Murphy L., Yeats C.A., Weir W.,Kerhornou A., Aslett M., Bishop R., Bouchier C., Cochet M.,Coulson R.M.R., Cronin A., de Villiers E.P., Fraser A., Fosker N.,Gardner M., Goble A., Griffiths-Jones S., Harris D.E., Katzer F.,Larke N., Lord A., Maser P., McKellar S., Mooney P., Morton F.,Nene V., O'Neil S., Price C., Quail M.A., Rabbinowitsch E.,Rawlings N.D., Rutter S., Saunders D., Seeger K., Shah T., Squares R.,Squares S., Tivey A., Walker A.R., Woodward J., Dobbelaere D.A.E.,Langsley G., Rajandream M.A., McKeever D., Shiels B., Tait A.,Barrell B.G., Hall N.;"
but with newlines by Bio.SwissProt:
>>> swiss_record.references[0].authors
"Pain A., Renauld H., Berriman M., Murphy L., Yeats C.A., Weir W.,\nKerhornou A., Aslett M., Bishop R., Bouchier C., Cochet M.,\nCoulson R.M.R., Cronin A., de Villiers E.P., Fraser A., Fosker N.,\nGardner M., Goble A., Griffiths-Jones S., Harris D.E., Katzer F.,\nLarke N., Lord A., Maser P., McKellar S., Mooney P., Morton F.,\nNene V., O'Neil S., Price C., Quail M.A., Rabbinowitsch E.,\nRawlings N.D., Rutter S., Saunders D., Seeger K., Shah T., Squares R.,\nSquares S., Tivey A., Walker A.R., Woodward J., Dobbelaere D.A.E.,\nLangsley G., Rajandream M.A., McKeever D., Shiels B., Tait A.,\nBarrell B.G., Hall N.;"

To me, the Bio.SeqIO approach seems more reasonable. I think we should add a space though at places where there is a newline in the file.

The same happens for multiline RL such as

RL   (In) Baker M.J., Crush J.R., Humphreys L.R. (eds.);
RL   Proceedings of the XVII international grassland congress,
RL   pp.2:1033-1034, Dunmore Press, Palmerston North (1993).

and for multiline RT lines such as

RT   "Genome of the host-cell transforming parasite Theileria annulata
RT   compared with T. parva.";

This is stored by Bio.SeqIO as

'"Genome of the host-cell transforming parasite Theileria annulatacompared with T. parva.";'

and by Bio.SwissProt as

'"Genome of the host-cell transforming parasite Theileria annulata\ncompared with T. parva.";'

whereas I think that both should be stored as

'"Genome of the host-cell transforming parasite Theileria annulata compared with T. parva.";'


2) Comments in a references such as the following:
RC   STRAIN=cv. VF36; TISSUE=Anther;
are stored as a single string by Bio.SeqIO:
>>> seq_record.annotations['references'][i].comment
'STRAIN=cv. VF36; TISSUE=Anther;'
but as a list of (key, value) pairs by Bio.SwissProt:
[('STRAIN', 'cv. VF36'), ('TISSUE', 'Anther')]
Whereas I think both are reasonable, Bio.SeqIO drops the space between two (key, value) pairs if they are on two separate lines:
RC   STRAIN=C57BL/6J;
RC   TISSUE=Bone marrow, Embryo, Kidney, Liver, Thymus, and Visual cortex;
is stored as
>>> seq_record.annotations['references'][i].comment
'STRAIN=C57BL/6J;TISSUE=Bone marrow, Embryo, Kidney, Liver, Thymus, and Visual cortex;'
I think we should add a space here, or just store these as (key, value) pairs as Bio.SwissProt is doing.

Any objections or comments?

--Michiel


From chapmanb at 50mail.com  Fri May  1 12:11:25 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 1 May 2009 08:11:25 -0400
Subject: [Biopython-dev] MUMmer
In-Reply-To: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca>
References: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca>
Message-ID: <20090501121125.GD50777@sobchak.mgh.harvard.edu>

Marcin;

> I guess I should start with a nice 'hi' to everybody, now that I am
> sending my first message to this group. So: Hi, Everybody! 

Welcome. We are happy to have you.

> Now, that we have the formality out of the way, I will get to the point.
> Recently, I have written some Python code for parsing and processing the
> output of MUMmer tool (http://mummer.sourceforge.net/). More
> specifically, the code I have manages invocations and handles outputs of
> the nucmer pipeline (alignment of multiple closely related nucleotide
> sequences) and of mummer itself (short exact matches). Obviously, the
> results are ultimately rendered as pairs of biopython's Seq objects. 

This is great -- we don't have support for MUMmer alignments so this
is very welcome.

> I use this stuff only myself, in work on bacterial genomes, but I would
> be more than willing to contribute it to the project. It may be rough
> around the edges at the moment, but I think I could easily give it the
> necessary polish if there is interest in having it included. 

As Bartek mentioned, the first step is to organize the code you have
and start it as a branch on GitHub. Being able to see the code will
help us make specific suggestions. Generally, based on what you've
written it sounds like this will fit into the alignment interfaces.
Peter and Cymon have been working on organizing this. Support for
command lines and running programs lives in:

http://github.com/biopython/biopython/tree/master/Bio/Align/Applications

Parsing output and returning alignment objects is organized in the
AlignIO module:

http://github.com/biopython/biopython/tree/master/Bio/AlignIO
http://www.biopython.org/wiki/AlignIO

Tests are an important part of the submission process and many
examples are found here:

http://github.com/biopython/biopython/tree/master/Tests

test_Clustalw.py is an example of a print and compare style test,
and test_Mafft_tool.py is a unittest style test. We are more
concerned with good testing coverage then how exactly the tests get
written.

We can definitely help with more specific feedback but hopefully
this gives you a general idea to get started.

Looking forward to seeing the code,
Brad


From chapmanb at 50mail.com  Fri May  1 12:28:06 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 1 May 2009 08:28:06 -0400
Subject: [Biopython-dev] XML parsing library for new modules
In-Reply-To: <3f6baf360904291228m1bba6008kb26e345510772e7a@mail.gmail.com>
References: <3f6baf360904291228m1bba6008kb26e345510772e7a@mail.gmail.com>
Message-ID: <20090501122806.GE50777@sobchak.mgh.harvard.edu>

Eric;
Thanks for summarizing the issues. I know Peter is taking a few well
deserved days off but I suspect he will have some thoughts when he
returns. We'd love to hear the experience of others who have used
different python XML parsers.

My lean is towards ElementTree for reasons of code clarity. SAX
parsers require a lot of boilerplate style code. They also can be
tricky with nested elements; I always find myself using a lot of "if
in_tag; else if in_tag" style code. ElementTree eliminates a lot of
these issues which should result in easier to maintain code.

Brad

> I'm writing a parser for the PhyloXML format for Google Summer of Code this
> year, and as the name would imply, it requires parsing some large XML files.
> The existing modules in Biopython for parsing XML formats seem to use
> xml.sax in the standard library. In Python 2.5, a faster and more Pythonic
> parser was added to the standard lib: ElementTree (xml.etree), in
> pure-Python and C-enhanced flavors. How do you feel about each of these
> libraries as the basis for a new Biopython module?
> 
> Here are some interesting benchmarks:
> http://effbot.org/zone/celementtree.htm#benchmarks
> 
> The ElementTree library is also available as a standalone package,
> compatible back to Python 2.1, and the lxml package also offers an
> independent implementation. So maintaining compatibility with Python 2.4
> would require the availability of one of these third-party packages, and my
> code would try each of these imports in order:
> 
> from xml.etree import cElementTree as ElementTree
> from xml.etree import ElementTree
> # Separate lxml package
> from lxml.etree import ElementTree
> # Standalone elementtree package
> import cElementTree as ElementTree
> from elementtree import ElementTree
> 
> Then one day, when Python 2.4 is no longer supported, only the first two
> lines would be needed. (The second line is for sites that disable C
> extensions, like Google App Engine, or alternate Python implementations like
> Jython.)
> 
> Another option is xml.parsers.expat, but just Googling around, it appears
> that the Python zeitgeist is strongly in favor of xml.etree for new code.
> 
> Thoughts?
> 
> Thanks,
> Eric
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From marcin.swiatek at mail.mcgill.ca  Fri May  1 18:17:14 2009
From: marcin.swiatek at mail.mcgill.ca (Marcin Swiatek)
Date: Fri, 1 May 2009 14:17:14 -0400
Subject: [Biopython-dev] MUMmer
In-Reply-To: <8b34ec180904300950n2c75f010oed27493f52d0da14@mail.gmail.com>
References: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca>
	<8b34ec180904300950n2c75f010oed27493f52d0da14@mail.gmail.com>
Message-ID: <176A06E658ED0745965C072C5F2C116A037F084C@EXCHANGE2VS2.campus.mcgill.ca>

Bartek, Brad,

Thank you for the suggestions. I will set myself up as proposed and see
what I can do to align my code with local customs and traditions. If
questions arise, I will post again. 

As for the use of alignment object, I have actually chosen to represent
'candidate' matches by my own simplistic class. Nucmer, the way I use
it, generates lots of spurious matches, which I always need to somehow
filter. Thus, it seemed perfectly reasonable at the time to create the
proper representation of alignment later on, in a separate function
call. Following your suggestion I will probably change it to return an
alignment object, rather than a pair of sequences. But details are best
discussed once the code is available, so I think we will return to this
matter later. 

Regards,

Marcin


-----Original Message-----
From: barwil at gmail.com [mailto:barwil at gmail.com] On Behalf Of Bartek
Wilczynski
Sent: Thursday, April 30, 2009 12:51 PM
To: Marcin Swiatek
Cc: biopython-dev at biopython.org
Subject: Re: [Biopython-dev] MUMmer

Hi Marcin,

On Thu, Apr 30, 2009 at 5:23 PM, Marcin Swiatek
<marcin.swiatek at mail.mcgill.ca> wrote:
> Hello,
>
>
>
> I use this stuff only myself, in work on bacterial genomes, but I
would
> be more than willing to contribute it to the project. It may be rough
> around the edges at the moment, but I think I could easily give it the
> necessary polish if there is interest in having it included.
>
Contributions are always welome

>
>
> Should that be the case, could one of the project leads point me in
the
> right direction, please? How should I go about the submission?
>
>
I don't think I qualify as a lead, but nonetheless I think I can help
here.

I think that the best way to submit your code currently is to create a
branch (fork) of
biopython on github and submit your changes there and then notify
people on biopython-dev
that there is new code to review. You can also submit an enhancement
bug to bugzilla.

There are a couple of wiki pages which might be of interest  to you:
- http://biopython.org/wiki/Contributing
- http://biopython.org/wiki/GitUsage

If you have any questions or problems during the process, ask on the
list.

As for the code, I'm not sure, but maybe instead of returning a pair
of sequences, an alignment object might be a better choice?

You might want to also check out a recent code on application wrappers:
http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005766.html

cheers
  Bartek


From bugzilla-daemon at portal.open-bio.org  Fri May  1 18:16:57 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 1 May 2009 14:16:57 -0400
Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest
In-Reply-To: <bug-2820-42@http.bugzilla.open-bio.org/>
Message-ID: <200905011816.n41IGvXO012709@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2820


------- Comment #8 from eric.talevich at gmail.com  2009-05-01 14:16 EST -------
(In reply to comment #7)
> (In reply to comment #2)
> > Python 2.6 includes a context manager that makes all these problems
> > *completely* go away, by catching all of the warnings raised within a
> > context and optionally storing them as a list of warning objects that
> > can be inspected.
> 
> That sounds much better :)
> 
> > Would you be interested in having a unit test that does a more thorough
> > check of the warnings system, but only runs on Py2.6? I'm guessing no,
> > but hey, worth a shot.
> 
> Yes - other than using the old print-and-compare test, this seems worth doing
> in order to actually test the warnings we expect are being issued.  It could be
> a whole new file, test_PDB_warnings.py which required Python 2.6+, but as its
> just one or two tests, maybe just use conditional method(s) within the
> test_PDB_unit.py file.
> 
> Peter
> 

I have something that works on both Py2.5 and Py2.6 now:
http://github.com/etal/biopython/tree/pdbtidy

I added a new file called _PDB_extra.py which test_PDB_unit.py imports if an
attribute called 'catch_warnings' is available in the current warnings module.
If so, the method test_warnings is added to the class, otherwise nothing
happens. So Py2.6 runs 9 tests in test_PDB_unit.py, while Py2.5 only runs 8.

This seemed easier than creating a whole separate unittest suite for one tricky
test, but I defer to you on the organization and naming. I think I'll need to
do a similar separation of tests for PhyloXML, so I'd like to have a consistent
pattern to follow here.

Also, apparently tests are run in alphabetical order, and Exposure was jumping
ahead of PDBExceptionTest. I renamed PDBExceptionTest to ExceptionTest to
restore the natural order of things and stop setting off the warnings
prematurely. Maybe test suites with multiple TestCase classes should be
arranged alphabetically in the code to avoid confusion in the future.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Mon May  4 10:57:33 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 4 May 2009 06:57:33 -0400
Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline -
	properties and kwargs
In-Reply-To: <bug-2822-42@http.bugzilla.open-bio.org/>
Message-ID: <200905041057.n44AvXil006684@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2822


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1288 is|0                           |1
           obsolete|                            |


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-04 06:57 EST -------
Created an attachment (id=1289)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1289&action=view)
Patch to add keyword arguments and properties to command line wrappers

Brad likes the idea, and as the Bio.Application module owner that's good :)
http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005963.html

This patch makes a very slight difference to reduce the changes needed to old
code (i.e. in the __init__ method use self.parameters = [...] as before) with
the bonus that the base class and subclasses have the same __init__ signature
(argument list).

This patch also now covers Bio.Align.Applications, Bio.Motif.Applications and
Bio.AlignAce.Applications as well as Bio.Emboss.Applications (i.e. all affected
files).

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From p.j.a.cock at googlemail.com  Mon May  4 12:02:59 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 4 May 2009 13:02:59 +0100
Subject: [Biopython-dev] MUMmer
In-Reply-To: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca>
References: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca>
Message-ID: <320fb6e00905040502y4785a0f9t4475ab0868a791c@mail.gmail.com>

On Thu, Apr 30, 2009 at 4:23 PM, Marcin Swiatek
<marcin.swiatek at mail.mcgill.ca> wrote:
> Hello,
>
> I guess I should start with a nice 'hi' to everybody, now that I am
> sending my first message to this group. So: Hi, Everybody!

Hi!

> Now, that we have the formality out of the way, I will get to the point.
> Recently, I have written some Python code for parsing and processing the
> output of MUMmer tool (http://mummer.sourceforge.net/). More
> specifically, the code I have manages invocations and handles outputs of
> the nucmer pipeline (alignment of multiple closely related nucleotide
> sequences) and of mummer itself (short exact matches). Obviously, the
> results are ultimately rendered as pairs of biopython's Seq objects.
>
> I use this stuff only myself, in work on bacterial genomes, but I would
> be more than willing to contribute it to the project. It may be rough
> around the edges at the moment, but I think I could easily give it the
> necessary polish if there is interest in having it included.

Great!  I assume your OK with our licence, and there are no problems
from your employer/University with a contribution like this?

> Should that be the case, could one of the project leads point me in the
> right direction, please? How should I go about the submission?

In terms of showing us the code, how do you feel about trying out
github (see Bartek's email)?  Alternatively file and enhancement bug
on our bugzilla and upload your current python file (or a zip file if this
is split up into several modules).

>From your description above it sounds like you have two main lumps
of code: a pairwise alignment parser, and some command line tool
wrappers.

Brad and Bartek have already mentioned returning Alignment objects,
that would let us integrate MUMmer as an input format for Bio.AlignIO,
http://biopython.org/wiki/AlignIO
It may be helpful to have a look at how we parse FASTA output into
pairwise alignments, and also the EMBOSS "pairs" files from needle
and water.

Although (as Brad mentioned), this is currently undergoing a little flux,
for the command line wrappers I'd like this to use our Bio.Application
framework to represent the command line object, giving a string the
user can then invoke as the prefer.  Having the MUMmer wrapper
under Bio.Align.Applications seems sensible at this point.

If you have been lurking on the dev mailing list for a while, these
topics may be familiar already.  If not, have a look over the last
month or so in the archives here:
http://lists.open-bio.org/pipermail/biopython-dev/

Thanks,

Peter


From p.j.a.cock at googlemail.com  Mon May  4 12:15:04 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 4 May 2009 13:15:04 +0100
Subject: [Biopython-dev] XML parsing library for new modules
In-Reply-To: <20090501122806.GE50777@sobchak.mgh.harvard.edu>
References: <3f6baf360904291228m1bba6008kb26e345510772e7a@mail.gmail.com>
	<20090501122806.GE50777@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00905040515q63312065oec904ed3f5ffab93@mail.gmail.com>

On Fri, May 1, 2009 at 1:28 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Eric;
> Thanks for summarizing the issues. I know Peter is taking a few well
> deserved days off but I suspect he will have some thoughts when he
> returns. We'd love to hear the experience of others who have used
> different python XML parsers.

I would be interested to hear Michiel's views on this, as he knows
more about the specifics of the existing XML parsers in Biopython
(e.g. Bio.Entrez).

> My lean is towards ElementTree for reasons of code clarity. SAX
> parsers require a lot of boilerplate style code. They also can be
> tricky with nested elements; I always find myself using a lot of "if
> in_tag; else if in_tag" style code. ElementTree eliminates a lot of
> these issues which should result in easier to maintain code.

We have been trying to avoid external library dependencies where
possible (moving away from Martel for parsing has really helped here).
Given ElementTree and cElementTree are included with Python 2.5+,
this is only an issue for Biopython running on Python 2.4.  Both
ElementTree and cElementTree are available as separate downloads
(with Windows installers).  I think under their licence we could even
bundle it with Biopython if need be.

So, while it is a shame ElementTree isn't part of Python 2.4, if it is
the best technical solution, that shouldn't stop us from using it.  Note
we should ONLY use those core features which are included with
Python 2.5+ inself.

Peter

P.S. I wonder if our BLAST XML parser would get a big speed boost
if we switched it to ElementTree instead of xml.sax?


From bugzilla-daemon at portal.open-bio.org  Mon May  4 13:47:25 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 4 May 2009 09:47:25 -0400
Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline -
	properties and kwargs
In-Reply-To: <bug-2822-42@http.bugzilla.open-bio.org/>
Message-ID: <200905041347.n44DlPQD018238@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2822


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1289 is|0                           |1
           obsolete|                            |


------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-04 09:47 EST -------
Created an attachment (id=1290)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1290&action=view)
Patch to add keyword arguments, properties and __repr__ to command line
wrappers

Extended to include __repr__ support (using the new keyword arguments support).

Note that the Muscle wrapper will need an alternative python valid identifier
for the -in argument, e.g. "input", because we can't use just "in" as a
property or keyword argument.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Mon May  4 14:07:57 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 4 May 2009 10:07:57 -0400
Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline -
	properties and kwargs
In-Reply-To: <bug-2822-42@http.bugzilla.open-bio.org/>
Message-ID: <200905041407.n44E7vI9020041@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2822


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1290 is|0                           |1
           obsolete|                            |


------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-04 10:07 EST -------
Created an attachment (id=1291)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1291&action=view)
Patch to add keyword arguments, properties and __repr__ to command line
wrappers

As in previous patch but with support for clearing parameters by "deleting" the
property, and some basic doctests in Bio.Application.

Still need to co-ordinate with Cymon to give the Muscle wrapper a valid python
identifier as an alias for the -in argument, e.g. "input", because we can't use
just "in" as a property or keyword argument.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Mon May  4 14:48:53 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 4 May 2009 15:48:53 +0100
Subject: [Biopython-dev] Properties in Bio.Application interface?
In-Reply-To: <20090430120532.GA50777@sobchak.mgh.harvard.edu>
References: <320fb6e00904260546s2d3eeb73iec08df89d93f9908@mail.gmail.com>
	<320fb6e00904290325w756493f2i264f79cd3f7e08b8@mail.gmail.com>
	<320fb6e00904290834g7a73c7au487564e3b103250@mail.gmail.com>
	<20090430120532.GA50777@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00905040748w7a0b940aub82220b9c78e7dc3@mail.gmail.com>

On Thu, Apr 30, 2009 at 1:05 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> I love what you are doing here. The keywords and properties make
> it much more Pythonic; the old way reeks of Java-style get/sets. My
> vote is to put them both in.

Cool - I was hoping people would agree it is more pythonic.

I have some follow up thoughts, or points for discussion ...

Peter


From biopython at maubp.freeserve.co.uk  Mon May  4 14:53:37 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 4 May 2009 15:53:37 +0100
Subject: [Biopython-dev]  Properties names in command line wrappers
Message-ID: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com>

On Mon, May 4, 2009 at 3:48 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Thu, Apr 30, 2009 at 1:05 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>> I love what you are doing here. The keywords and properties make
>> it much more Pythonic; the old way reeks of Java-style get/sets. My
>> vote is to put them both in.
>
> Cool - I was hoping people would agree it is more pythonic.
>
> I have some follow up thoughts, or points for discussion ...
>

I updated the patch on Bug 2822 to cover all the Bio.Application
command line wrapper subclasses, and included __repr__ support.
However, that has raised a real example of a parameter where the
current "human readable" name is not a valid python identifier ("in",
for "-in" in Muscle).  I think the pragmatic solution is to add a
sensible alternative which we can use for the property and keyword
argument name (e.g. "input" in this case) while in general keeping
these names as close as possible to the actual parameter name as used
at the command line.

On the other hand, some might argue for giving all the options
meaningful names.  The (hardly used) existing blastall wrapper in
Bio/Blast/Applications.py gives the "-a" argument a human readable
name of "nprocessors", and "-A" gets "window_size". With the old
set_parameter call either alias could be used.  However, with a python
property we need to pick one as a preferred name - and I'm not 100%
sure being helpful and using "nprocessors" (e.g. cline.nprocessors=4)
is actually better than using the actual argument name (e.g. cline.a =
4).

My instinct is that these are low level wrappers, which don't try to
second guess the user.  To take full advantage of any command line
tool you will need to read the tool's documentation to know what the
arguments are - and having Biopython making up its own aliases just
makes things more complicated.  Therefore I think the property names
in the command line wrapper objects should be as close as possible to
the actual command line arguments.  In this case, for blastall use "a"
for number of processors and "A" for window size.

However, I see the existing "helper functions" in
Bio/Blast/NCBIStandalone.py as a higher level wrapper, which tries to
insulate the user from the precise details of the command line string,
and here using an argument name "nprocessors" makes more sense
(although again, it differs from the actual command line making cross
referencing to the NCBI documentation more difficult).

What are your thoughts Brad?

Peter


From biopython at maubp.freeserve.co.uk  Mon May  4 15:03:17 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 4 May 2009 16:03:17 +0100
Subject: [Biopython-dev]  Switches in the Bio.Application interface
Message-ID: <320fb6e00905040803v78029f26k48b8e6908e40cc5@mail.gmail.com>

On Mon, May 4, 2009 at 3:48 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Thu, Apr 30, 2009 at 1:05 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>> I love what you are doing here. The keywords and properties make
>> it much more Pythonic; the old way reeks of Java-style get/sets. My
>> vote is to put them both in.
>
> Cool - I was hoping people would agree it is more pythonic.
>
> I have some follow up thoughts, or points for discussion ...
>
> Peter
>

It seems sensible to me to allow "deleting" a property to clear it.
There is an example in the proposed Bio/Application/__init__.py
docstring of how this would work:

>>> from Bio.Emboss.Applications import WaterCommandline
>>> cline = WaterCommandline(gapopen=10, gapextend=0.5)
>>> cline
WaterCommandline(cmd='water', gapopen=10, gapextend=0.5)

You can also manipulate the parameters via their properties, e.g.

>>> cline.gapopen
10
>>> cline.gapopen = 20
>>> cline
WaterCommandline(cmd='water', gapopen=20, gapextend=0.5)

You can clear a parameter you have already added by 'deleting' the
corresponding property:

>>> del cline.gapopen
>>> cline.gapopen
>>> cline
WaterCommandline(cmd='water', gapextend=0.5)

That does seem to work and covers most situation, however there is a
special case of command line "switches" (arguments which don't take an
argument, like -kimura in ClustalW, or -l in ls).  There are a lot of
these cases in Cymon's new alignment wrappers.  These worked OK when
used with set_parameter("kimura"), the value is omitted and defaults
to None.  Using the current patch, to set this via the keyword
argument or property, it must explicitly be set to None, which is
ugly:

>>> from Bio.Align.Applications import ClustalwCommandline
>>> print ClustalwCommandline(gapopen=2, gapext=0.5, infile="demo.fasta")
clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5
>>> print ClustalwCommandline(gapopen=2, gapext=0.5, kimura=None, infile="demo.fasta")
clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5 -kimura

For these "switch" arguments, perhaps the value should be interpreted
as a boolean (should the switch be added or not?).  This would be a
change to the current API, but I don't think any of the existing
wrappers actually have this kind of parameter, so there shouldn't be a
backwards compatibility issue here.  Instead I want to do this:

>>> from Bio.Align.Applications import ClustalwCommandline
>>> print ClustalwCommandline(gapopen=2, gapext=0.5, infile="demo.fasta")
clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5
>>> print ClustalwCommandline(gapopen=2, gapext=0.5, kimura=True, infile="demo.fasta")
clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5 -kimura
>>> print ClustalwCommandline(gapopen=2, gapext=0.5, kimura=False, infile="demo.fasta")
clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5

An example use case is to allow parameter searches e.g.

from Bio.Align.Applications import ClustalwCommandline
for gap_open in [0, 1, 2, 10] :
    for gap_extend in [0, 0.25, 0.5] :
        for use_kimura in [True, False] :
            #Won't work yet!:
            cline = ClustalwCommandline(gapopen=gap_open,
gapext=gap_extend, kimura=use_kimura, infile="demo.fasta")
            print cline

Or, modifying and reusing a single command line wrapper object:

from Bio.Align.Applications import ClustalwCommandline
#Set standard options:
cline = ClustalwCommandline(infile="demo.fasta")
#Do parameter sweep:
for gap_open in [0, 1, 2, 10] :
    cline.gapopen = gap_open
    for gap_extend in [0, 0.25, 0.5] :
        cline.gapext = gap_extend
        for use_kimura in [True, False] :
            cline.kimura = use_kimura #Won't work yet!
            print cline


Peter


From bugzilla-daemon at portal.open-bio.org  Mon May  4 15:29:33 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 4 May 2009 11:29:33 -0400
Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline -
	properties and kwargs
In-Reply-To: <bug-2822-42@http.bugzilla.open-bio.org/>
Message-ID: <200905041529.n44FTXr9025530@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2822


------- Comment #7 from cymon.cox at gmail.com  2009-05-04 11:29 EST -------
(In reply to comment #6)
> Created an attachment (id=1291)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1291&action=view) [details]
> Patch to add keyword arguments, properties and __repr__ to command line
> wrappers
> 
> As in previous patch but with support for clearing parameters by "deleting" the
> property, and some basic doctests in Bio.Application.
> 
> Still need to co-ordinate with Cymon to give the Muscle wrapper a valid python
> identifier as an alias for the -in argument, e.g. "input", because we can't use
> just "in" as a property or keyword argument.

"input" for -in and maybe also "input1" "input2" as alternatives for -in1 -in2,
might the the way to go, and document it.

C. 


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From mjldehoon at yahoo.com  Mon May  4 15:25:17 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Mon, 4 May 2009 08:25:17 -0700 (PDT)
Subject: [Biopython-dev] XML parsing library for new modules
In-Reply-To: <320fb6e00905040515q63312065oec904ed3f5ffab93@mail.gmail.com>
Message-ID: <3493.66471.qm@web62406.mail.re1.yahoo.com>


--- On Mon, 5/4/09, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> > My lean is towards ElementTree for reasons of code
> clarity. SAX
> > parsers require a lot of boilerplate style code. They
> also can be
> > tricky with nested elements; I always find myself
> using a lot of "if
> > in_tag; else if in_tag" style code. ElementTree
> eliminates a lot of
> > these issues which should result in easier to maintain
> code.

This is partially true. SAX parsers can be complicated, but with some dedication reasonably clear code is also possible. The SAX parser in Bio.Entrez is not all that bad, and it can handle all kinds of different XML pages as long as a DTD is available. The prime motivation for ElementTree is that it's mutable; I don't know if that is really needed in this case. Another thing to consider is what to do with the result returned by ElementTree. Whereas it will contain all the information in the XML file, it may not represent it in a user-friendly way. You may want to take the output from ElementTree and store it in a more biopython-like object. Also keep in mind memory usage: ElementTree will keep the complete XML file in memory, whereas the SAX parser gives you more flexibility here (see below).

That said, I don't have any fundamental objections against using ElementTree.

> 
> We have been trying to avoid external library dependencies
> where
> possible (moving away from Martel for parsing has really
> helped here).
> Given ElementTree and cElementTree are included with Python
> 2.5+,
> this is only an issue for Biopython running on Python 2.4. 

I think it's OK to require Python 2.5 or later for Biopython.

> P.S. I wonder if our BLAST XML parser would get a big speed
> boost if we switched it to ElementTree instead of xml.sax?

I doubt it, since the SAX parser is pretty straightforward -- the hard part is to go through the DTD and find out how to interpret each element in the XML (this is not time-consuming though). The key point though is memory usage. With the SAX parser, you can parse the XML file in chunks, and use an iterator to return individual Blast records -- you don't need to keep the full XML file in memory. The Blast parser NCBIXML.parse does exactly that. With ElementTree, as far as I understand you read in the full XML file and keep it in memory.

--Michiel.


From cy at cymon.org  Mon May  4 15:34:52 2009
From: cy at cymon.org (Cymon Cox)
Date: Mon, 4 May 2009 16:34:52 +0100
Subject: [Biopython-dev] Switches in the Bio.Application interface
In-Reply-To: <320fb6e00905040803v78029f26k48b8e6908e40cc5@mail.gmail.com>
References: <320fb6e00905040803v78029f26k48b8e6908e40cc5@mail.gmail.com>
Message-ID: <7265d4f0905040834p419f49c5ka33ceef6f7dcab19@mail.gmail.com>

2009/5/4 Peter <biopython at maubp.freeserve.co.uk>

> On Mon, May 4, 2009 at 3:48 PM, Peter <biopython at maubp.freeserve.co.uk>
> wrote:
>
> That does seem to work and covers most situation, however there is a
> special case of command line "switches" (arguments which don't take an
> argument, like -kimura in ClustalW, or -l in ls).  There are a lot of
> these cases in Cymon's new alignment wrappers.  These worked OK when
> used with set_parameter("kimura"), the value is omitted and defaults
> to None.  Using the current patch, to set this via the keyword
> argument or property, it must explicitly be set to None, which is
> ugly:
>
> >>> from Bio.Align.Applications import ClustalwCommandline
> >>> print ClustalwCommandline(gapopen=2, gapext=0.5, infile="demo.fasta")
> clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5
> >>> print ClustalwCommandline(gapopen=2, gapext=0.5, kimura=None,
> infile="demo.fasta")
> clustalw -infile=demo.fasta -gapopen=2 -gapext=0.5 -kimura


Ugly, and very confusing.


> For these "switch" arguments, perhaps the value should be interpreted
> as a boolean (should the switch be added or not?).


This is what i did in my Muscle helper functions - so makes sense to me...

C.
-- 
____________________________________________________________________

Cymon J. Cox

Centro de Ciencias do Mar
Faculdade de Ciencias do Mar e Ambiente (FCMA)
Universidade do Algarve
Campus de Gambelas
8005-139 Faro
Portugal

Phone: +0351 289800909 ext 7909
Fax: +0351 289800051
Email: cy at cymon.org, cymon at ualg.pt, cymon.cox at gmail.com
HomePage : http://biology.duke.edu/bryology/cymon.html
-8.63/-6.77


From p.j.a.cock at googlemail.com  Mon May  4 15:45:12 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 4 May 2009 16:45:12 +0100
Subject: [Biopython-dev] XML parsing library for new modules
In-Reply-To: <3493.66471.qm@web62406.mail.re1.yahoo.com>
References: <320fb6e00905040515q63312065oec904ed3f5ffab93@mail.gmail.com>
	<3493.66471.qm@web62406.mail.re1.yahoo.com>
Message-ID: <320fb6e00905040845g67219f36r563f4dfa1b080125@mail.gmail.com>

Brad wrote:
>>> My lean[ing] is towards ElementTree for reasons of code
>>> clarity. SAX parsers require a lot of boilerplate style code.
>>> They also can be tricky with nested elements; I always
>>> find myself using a lot of "if in_tag; else if in_tag" style
>>> code. ElementTree eliminates a lot of these issues
>>> which should result in easier to maintain code.

Michiel wrote:
> This is partially true. SAX parsers can be complicated, but
> with some dedication reasonably clear code is also possible.
> The SAX parser in Bio.Entrez is not all that bad, and it can
> handle all kinds of different XML pages as long as a DTD
> is available. The prime motivation for ElementTree is that
> it's mutable; I don't know if that is really needed in this case.

Eric will have to answer that regarding PhyloXML, but if the
aim is to turn it into one of our existing tree objects, then
having the XML structure mutable is irrelevant.

> Another thing to consider is what to do with the result
> returned by ElementTree. Whereas it will contain all the
> information in the XML file, it may not represent it in a
> user-friendly way. You may want to take the output from
> ElementTree and store it in a more biopython-like object.
> Also keep in mind memory usage: ElementTree will keep
> the complete XML file in memory, whereas the SAX
> parser gives you more flexibility here (see below).

Something for Eric to consider.

Michiel wrote:
> That said, I don't have any fundamental objections
> against using ElementTree.

Peter wrote:
>> We have been trying to avoid external library dependencies
>> where possible (moving away from Martel for parsing has
>> really helped here). Given ElementTree and cElementTree
>> are included with Python 2.5+, this is only an issue for
>> Biopython running on Python 2.4.
>
> I think it's OK to require Python 2.5 or later for Biopython.

As this stage I disagree, Python 2.4 would still be widely
used on production servers running stable distributions.
Also we'd have to give a couple of releases notice about
dropping Python 2.4 support.  In any case, if we want to
use ElementTree with Python 2.4 this is possible.

Peter wrote:
>> P.S. I wonder if our BLAST XML parser would get a big speed
>> boost if we switched it to ElementTree instead of xml.sax?
>
> I doubt it, since the SAX parser is pretty straightforward --
> the hard part is to go through the DTD and find out how to
> interpret each element in the XML (this is not
> time-consuming though). The key point though is memory
> usage. With the SAX parser, you can parse the XML file in
> chunks, and use an iterator to return individual Blast records
> -- you don't need to keep the full XML file in memory. The
> Blast parser NCBIXML.parse does exactly that. With
> ElementTree, as far as I understand you read in the full
> XML file and keep it in memory.

Keeping a full BLAST XML file in memory would be a bad idea,
and would spoil the memory savings of the iterator approach
to parsing it.  So ElementTree isn't suitable for everything ;)

Peter


From biopython at maubp.freeserve.co.uk  Mon May  4 15:47:58 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 4 May 2009 16:47:58 +0100
Subject: [Biopython-dev] Switches in the Bio.Application interface
In-Reply-To: <7265d4f0905040834p419f49c5ka33ceef6f7dcab19@mail.gmail.com>
References: <320fb6e00905040803v78029f26k48b8e6908e40cc5@mail.gmail.com>
	<7265d4f0905040834p419f49c5ka33ceef6f7dcab19@mail.gmail.com>
Message-ID: <320fb6e00905040847s32bc9e4fr3f7fb045b2d3429b@mail.gmail.com>

On Mon, May 4, 2009 at 4:34 PM, Cymon Cox <cy at cymon.org> wrote:
>
>> For these "switch" arguments, perhaps the value should be interpreted
>> as a boolean (should the switch be added or not?).
>
> This is what i did in my Muscle helper functions - so makes sense to me...
>

Good :)

Peter


From bugzilla-daemon at portal.open-bio.org  Mon May  4 16:29:10 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 4 May 2009 12:29:10 -0400
Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline -
	properties and kwargs
In-Reply-To: <bug-2822-42@http.bugzilla.open-bio.org/>
Message-ID: <200905041629.n44GTAeq030521@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2822


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1291 is|0                           |1
           obsolete|                            |


------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-04 12:29 EST -------
(From update of attachment 1291)
Checked into CVS:

Checking in Tests/test_Prank_tool.py;
/home/repository/biopython/biopython/Tests/test_Prank_tool.py,v  <-- 
test_Prank_tool.py
new revision: 1.5; previous revision: 1.4
done
Checking in Tests/test_Muscle_tool.py;
/home/repository/biopython/biopython/Tests/test_Muscle_tool.py,v  <-- 
test_Muscle_tool.py
new revision: 1.7; previous revision: 1.6
done
Checking in Tests/test_Emboss.py;
/home/repository/biopython/biopython/Tests/test_Emboss.py,v  <-- 
test_Emboss.py
new revision: 1.20; previous revision: 1.19
done
Checking in Tests/test_Clustalw_tool.py;
/home/repository/biopython/biopython/Tests/test_Clustalw_tool.py,v  <-- 
test_Clustalw_tool.py
new revision: 1.13; previous revision: 1.12
done
Checking in Bio/Application/__init__.py;
/home/repository/biopython/biopython/Bio/Application/__init__.py,v  <-- 
__init__.py
new revision: 1.15; previous revision: 1.14
done
Checking in Bio/Emboss/Applications.py;
/home/repository/biopython/biopython/Bio/Emboss/Applications.py,v  <-- 
Applications.py
new revision: 1.23; previous revision: 1.22
done
Checking in Bio/AlignAce/Applications.py;
/home/repository/biopython/biopython/Bio/AlignAce/Applications.py,v  <-- 
Applications.py
new revision: 1.5; previous revision: 1.4
done
Checking in Bio/Motif/Applications/_AlignAce.py;
/home/repository/biopython/biopython/Bio/Motif/Applications/_AlignAce.py,v  <--
 _AlignAce.py
new revision: 1.3; previous revision: 1.2
done
Checking in Bio/Align/Applications/_Clustalw.py;
/home/repository/biopython/biopython/Bio/Align/Applications/_Clustalw.py,v  <--
 _Clustalw.py
new revision: 1.5; previous revision: 1.4
done
Checking in Bio/Align/Applications/_Mafft.py;
/home/repository/biopython/biopython/Bio/Align/Applications/_Mafft.py,v  <-- 
_Mafft.py
new revision: 1.4; previous revision: 1.3
done
Checking in Bio/Align/Applications/_Muscle.py;
/home/repository/biopython/biopython/Bio/Align/Applications/_Muscle.py,v  <-- 
_Muscle.py
new revision: 1.6; previous revision: 1.5
done
Checking in Bio/Align/Applications/_Prank.py;
/home/repository/biopython/biopython/Bio/Align/Applications/_Prank.py,v  <-- 
_Prank.py
new revision: 1.4; previous revision: 1.3
done

(In reply to comment #7)
> (In reply to comment #6)
> > Still need to co-ordinate with Cymon to give the Muscle wrapper a valid
> > python identifier as an alias for the -in argument, e.g. "input", because
> > we can't use just "in" as a property or keyword argument.
> 
> "input" for -in and maybe also "input1" "input2" as alternatives for -in1
> -in2, might the the way to go, and document it.

I've used "input" as the preferred alias for "-in".

Leaving this bug open to cover dealing with "switch" arguments like -kimura in
clustalw, where it makes sense to treat the value as a boolean (see dev mailing
list).

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Mon May  4 17:48:28 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 4 May 2009 13:48:28 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905041748.n44HmSaN003712@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #23 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-04 13:48 EST -------
In Prank, should realbranches take no arguments?  i.e. use the new _Switch
class?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Mon May  4 17:49:20 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 4 May 2009 13:49:20 -0400
Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline -
	properties and kwargs
In-Reply-To: <bug-2822-42@http.bugzilla.open-bio.org/>
Message-ID: <200905041749.n44HnK8j003766@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2822


------- Comment #9 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-04 13:49 EST -------
(In reply to comment #8)
> Leaving this bug open to cover dealing with "switch" arguments like -kimura in
> clustalw, where it makes sense to treat the value as a boolean (see dev mailing
> list).

Done in CVS, I think.  Next, more test and documentation...

Checking in Bio/Application/__init__.py;
/home/repository/biopython/biopython/Bio/Application/__init__.py,v  <-- 
__init__.py
new revision: 1.16; previous revision: 1.15
done
Checking in Bio/Align/Applications/_Clustalw.py;
/home/repository/biopython/biopython/Bio/Align/Applications/_Clustalw.py,v  <--
 _Clustalw.py
new revision: 1.6; previous revision: 1.5
done
Checking in Bio/Align/Applications/_Mafft.py;
/home/repository/biopython/biopython/Bio/Align/Applications/_Mafft.py,v  <-- 
_Mafft.py
new revision: 1.5; previous revision: 1.4
done
Checking in Bio/Align/Applications/_Muscle.py;
/home/repository/biopython/biopython/Bio/Align/Applications/_Muscle.py,v  <-- 
_Muscle.py
new revision: 1.7; previous revision: 1.6
done
Checking in Bio/Align/Applications/_Prank.py;
/home/repository/biopython/biopython/Bio/Align/Applications/_Prank.py,v  <-- 
_Prank.py
new revision: 1.5; previous revision: 1.4
done
Checking in Tests/test_Clustalw_tool.py;
/home/repository/biopython/biopython/Tests/test_Clustalw_tool.py,v  <-- 
test_Clustalw_tool.py
new revision: 1.14; previous revision: 1.13
done
Checking in Tests/test_Muscle_tool.py;
/home/repository/biopython/biopython/Tests/test_Muscle_tool.py,v  <-- 
test_Muscle_tool.py
new revision: 1.8; previous revision: 1.7
done
Checking in Tests/test_Prank_tool.py;
/home/repository/biopython/biopython/Tests/test_Prank_tool.py,v  <-- 
test_Prank_tool.py
new revision: 1.6; previous revision: 1.5
done


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Tue May  5 12:04:09 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 5 May 2009 08:04:09 -0400
Subject: [Biopython-dev] [Bug 2294] Writing GenBank files with Bio.SeqIO
In-Reply-To: <bug-2294-42@http.bugzilla.open-bio.org/>
Message-ID: <200905051204.n45C4987022142@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2294


------- Comment #14 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-05 08:04 EST -------
Created an attachment (id=1292)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1292&action=view)
Patch to Bio/SeqIO/InsdcIO.py to write GenBank features

This patch adds basic support for writing features in GenBank files.

There is still plenty to do:
* Full testing, both manual and with extended unit test coverage
* Wrapping long feature locations
* Writing references
* Extending to cover writing EBML files

Note that this requires the latest Bio.GenBank code from CVS, as during this
work I found and fixed two small issues with the location parsing.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From chapmanb at 50mail.com  Tue May  5 12:36:57 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 5 May 2009 08:36:57 -0400
Subject: [Biopython-dev] Properties names in command line wrappers
In-Reply-To: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com>
References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com>
Message-ID: <20090505123656.GB15113@sobchak.mgh.harvard.edu>

Hi Peter;
Nice to have you back. Hope you had a relaxing few days away.

> I updated the patch on Bug 2822 to cover all the Bio.Application
> command line wrapper subclasses, and included __repr__ support.
> However, that has raised a real example of a parameter where the
> current "human readable" name is not a valid python identifier ("in",
> for "-in" in Muscle).  I think the pragmatic solution is to add a
> sensible alternative which we can use for the property and keyword
> argument name (e.g. "input" in this case) while in general keeping
> these names as close as possible to the actual parameter name as used
> at the command line.

Agreed. This is the best solution for these few conflicting cases.

> On the other hand, some might argue for giving all the options
> meaningful names.  The (hardly used) existing blastall wrapper in
> Bio/Blast/Applications.py gives the "-a" argument a human readable
> name of "nprocessors", and "-A" gets "window_size". With the old
> set_parameter call either alias could be used.  However, with a python
> property we need to pick one as a preferred name - and I'm not 100%
> sure being helpful and using "nprocessors" (e.g. cline.nprocessors=4)
> is actually better than using the actual argument name (e.g. cline.a =
> 4).

Could we support both the original argument and optional human
readable arguments? I know the code in Application is a bit
hard coded for the first argument as the real name and the last
argument as the readable name; the cleanest solution would be to
generalize this to have multiple names where it makes sense.

More practically, it always makes sense to have the low level
standard arguments from the program itself. Even if it is
non-intuitive like BLASTs switches, people who already understand
the program can just use their existing knowledge without any
specific knowledge of how Biopython. Where someone wants to 
support more useful names, they can add those in.

You have been digging around in this so probably have a good idea
how hard this is to implement practically. If it's a pain, I'd argue
to just have the original arguments now, and the useful names can do
on a todo list.

Brad


From chapmanb at 50mail.com  Tue May  5 12:50:59 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 5 May 2009 08:50:59 -0400
Subject: [Biopython-dev] XML parsing library for new modules
In-Reply-To: <320fb6e00905040845g67219f36r563f4dfa1b080125@mail.gmail.com>
References: <320fb6e00905040515q63312065oec904ed3f5ffab93@mail.gmail.com>
	<3493.66471.qm@web62406.mail.re1.yahoo.com>
	<320fb6e00905040845g67219f36r563f4dfa1b080125@mail.gmail.com>
Message-ID: <20090505125058.GC15113@sobchak.mgh.harvard.edu>

Peter, Michiel and Eric;

> > Another thing to consider is what to do with the result
> > returned by ElementTree. Whereas it will contain all the
> > information in the XML file, it may not represent it in a
> > user-friendly way. You may want to take the output from
> > ElementTree and store it in a more biopython-like object.

Agreed. Most of the fun creative parts of the project, as opposed to
the parsing nuts and bolts, will be in developing the object
representations.

> > Also keep in mind memory usage: ElementTree will keep
> > the complete XML file in memory, whereas the SAX
> > parser gives you more flexibility here (see below).

ElementTree can do incremental parsing, so you can also deal with
large files using it:

http://effbot.org/zone/element-iterparse.htm

Brad


From biopython at maubp.freeserve.co.uk  Tue May  5 13:58:04 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 5 May 2009 14:58:04 +0100
Subject: [Biopython-dev] Properties names in command line wrappers
In-Reply-To: <20090505123656.GB15113@sobchak.mgh.harvard.edu>
References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com>
	<20090505123656.GB15113@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00905050658h2cabf55dhfbb467042135843a@mail.gmail.com>

On Tue, May 5, 2009 at 1:36 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> Could we support both the original argument and optional human
> readable arguments? I know the code in Application is a bit
> hard coded for the first argument as the real name and the last
> argument as the readable name; the cleanest solution would be to
> generalize this to have multiple names where it makes sense.

You mean for these BLAST examples, create two properties "a" and
"nprocessors", both controlling the "-a" parameter, and also two
properties "A" and "window_size" both controlling "-A"?  From a code
point of view, this would be moderately straight forward - but I'm not
convinced about this.

> More practically, it always makes sense to have the low level
> standard arguments from the program itself. Even if it is
> non-intuitive like BLASTs switches, people who already understand
> the program can just use their existing knowledge without any
> specific knowledge of how Biopython.

Yes :)

Personally I initially found it very frustrating when using the
Bio.Blast.NCBIStandalone.blastall wrapper because the NCBI switches
had all been given friendly names, and it wasn't clear without looking
at the source code what mapped to what.  As a minor change, I think
the Bio.Blast.NCBIStandalone.blastall docstring should actually
include the real NCBI switch used by each Biopython keyword.

> Where someone wants to support more useful names, they can
> add those in.

So that we cater to those familiar with the NCBI command line
arguments, but also give a more human alternative?  On the downside,
it means there are two ways to set these parameters.  Also, if we go
down this route for consistency for all command line wrappers we may
want to invent more human readable aliases (if the tool arguments are
too cryptic).  We are also opening up a potential problem if the tool
later adds a new argument whose name clashes with one of our
inventions.  Also would we care about the lack of consistency between
tools (e.g. infile versus input?), and should we try and be consistent
in our new names?

I favour using only a single property for each parameter, with the
name as similar as possible to the actual command line switch (i.e.
property name "a" for "-a", not "nprocessors").  Note each property
would have a docstring which will say what is it for ("Number of
processors to use.").

In the case of the existing blastall wrapper in
Bio.Blast.Applications, I would use change names=["-a", "nprocessors"]
to ["-a", "nprocessors", "a"], meaning "a" (last entry) would be the
property name used, "-a" (first entry) would be used for the actual
command line string.  I would keep the "nprocessors" alias for
backwards compatibility only - all three aliases would be available to
the (legacy) method set_parameter.

> You have been digging around in this so probably have a good idea
> how hard this is to implement practically. If it's a pain, I'd argue
> to just have the original arguments now, and the useful names can do
> on a todo list.

It is certainly possible, although probably a bit tedious due to
changing the "boilerplate" code.

Peter


From bugzilla-daemon at portal.open-bio.org  Tue May  5 14:37:56 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 5 May 2009 10:37:56 -0400
Subject: [Biopython-dev] [Bug 2294] Writing GenBank files with Bio.SeqIO
In-Reply-To: <bug-2294-42@http.bugzilla.open-bio.org/>
Message-ID: <200905051437.n45EbuNA006427@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2294


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1292 is|0                           |1
           obsolete|                            |


------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-05 10:37 EST -------
(From update of attachment 1292)
Checked into CVS now.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From p.j.a.cock at googlemail.com  Tue May  5 15:26:20 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 5 May 2009 16:26:20 +0100
Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was
	Bio.GFF)
In-Reply-To: <320fb6e00904210655u5f497227o24457ec2ee3d711f@mail.gmail.com>
References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com>
	<320fb6e00904210655u5f497227o24457ec2ee3d711f@mail.gmail.com>
Message-ID: <320fb6e00905050826y5ae0b13eiaa6d9e56fd9049e9@mail.gmail.com>

On Tue, Apr 21, 2009 at 2:55 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> I have also been thinking about how I would (re)design the SeqFeature
>> and FeatureLocation objects. ?In particular I would want to put the
>> strand as part of the same object as the location, and also any
>> join-locations. ?I would still want to cope with fuzzy locations, but
>> make the non-fuzzy approximations more prominent in comparison. ?Also,
>> I really don't like the way joins are currently stored as more
>> SeqFeatures in the sub_features list (plus this kind of blocks
>> alternative usage for child/parent nesting that might be nice for GFF
>> files).
>>
>> The prime use case to keep in mind is taking a feature location (even
>> a join), and using this to extract that region of nucleotides from the
>> parent sequence (i.e. a Seq object or a SeqRecord object, as now both
>> can be sliced).

I've written code to do this in test_SeqIO_features.py, which cross
checks the nucleotides pulled out from a GenBank files based on the
SeqFeature, against what the NCBI provide in FASTA format.  This seems
to work OK, but has not been tested extensively (e.g. running it on
drosophila or arabidopsis would be good).

It could make sense to expose this functionality directly in
Biopython, maybe as a method of the SeqRecord taking a SeqFeature (or
the index of a feature in that record), returning a Seq object (or
perhaps a SeqRecord using the feature's annotation).

e.g.
>>> from Bio import SeqIO
>>> record = SeqIO.read(open("NC_005816.gb"),"genbank")
>>> record.extract_feature_seq(6)
Seq('GTGAACAAACAACAACAAACTGCGCTGAATATGGCGCGATTTATCAGAAGCCAG...TAA',
IUPACAmbiguousDNA())
>>> feature = record.features[6]
>>> record.extract_feature_seq(feature)
Seq('GTGAACAAACAACAACAAACTGCGCTGAATATGGCGCGATTTATCAGAAGCCAG...TAA',
IUPACAmbiguousDNA())

Alternatively, rather than introducing a new method (e.g.
"extract_feature_seq" as in the above example) we could overload the
__getitem__ method of the SeqRecord, i.e. overloading the slice
mechanism so a SeqFeature can alternatively be given, e.g.
record[feature].  Note that passing the index of a feature wouldn't
work as record[6] currently means the seventh letter, rather than the
seventh feature.

Note that just passing a SeqFeature's FeatureLocation is not enough,
as this lacks the strand information, and also any sub-features and
associated location operator (i.e. join).

> I forgot to mention the second major use case I'm concerned about,
> which is recovering the GenBank/EMBL style location string. ?I have
> looked at this in the past, by adding methods to the FeatureLocation
> and all the Position objects, but it is complicated by the fact the
> Position objects don't know if they are at the start or end (and for
> the start locations we need to add one to convert from Python
> counting). ?This is the main block on having Bio.SeqIO support writing
> GenBank (or EMBL) files with their features included.

See Bug 2294 for writing GenBank files:
http://bugzilla.open-bio.org/show_bug.cgi?id=2294
I've just checked in some code to record the features when writing
GenBank files with Bio.SeqIO.  I solved the feature location issue by
introducing a private function which knows about all the currently
used AbstractPosition objects - the code is actually pretty short.

Peter


From p.j.a.cock at googlemail.com  Tue May  5 16:41:31 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 5 May 2009 17:41:31 +0100
Subject: [Biopython-dev] Dropping Python 2.3 support in Biopython
Message-ID: <320fb6e00905050941m37725eb5ibba02ca99236212e@mail.gmail.com>

Hello all,

This is a final warning that the next release of Biopython will not
support Python 2.3.

As far as we are aware, no-one has come forward with a need for
continued support for Python 2.3, so we will soon begin removing the
special case code needed to keep Biopython working on Python 2.3.
This will give us a simpler code base, less platforms to test on, and
we can also take advantage of various language features only available
in Python 2.4+ (e.g. generator expressions and decorators).

Any last minute requests to postpone this should be made to the main
Biopython mailing list by Friday 8 May.

Thank you,

Peter


From sbassi at clubdelarazon.org  Tue May  5 22:49:11 2009
From: sbassi at clubdelarazon.org (Sebastian Bassi)
Date: Tue, 5 May 2009 19:49:11 -0300
Subject: [Biopython-dev] Missing directories with easy_install?
Message-ID: <9e2f512b0905051549w688195f6w719572fe42ee2678@mail.gmail.com>

When I install Biopython 1.5 (and previous versions too) using
easy_install, it seems that docs, test and scripts directories are not
installed (see here for a screenshot, panel at left is easy_install
product while right panel is when I manually uncompress biopython
tarball: http://www.genesdigitales.com/bioinfo/biopy.jpg).
Is this expected or an oversight?


From biopython at maubp.freeserve.co.uk  Tue May  5 22:56:00 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 5 May 2009 23:56:00 +0100
Subject: [Biopython-dev] Missing directories with easy_install?
In-Reply-To: <9e2f512b0905051549w688195f6w719572fe42ee2678@mail.gmail.com>
References: <9e2f512b0905051549w688195f6w719572fe42ee2678@mail.gmail.com>
Message-ID: <320fb6e00905051556g2821e4f0k3db66ec545dcd399@mail.gmail.com>

On Tue, May 5, 2009 at 11:49 PM, Sebastian Bassi
<sbassi at clubdelarazon.org> wrote:
> When I install Biopython 1.5 (and previous versions too) using
> easy_install, it seems that docs, test and scripts directories are not
> installed (see here for a screenshot, panel at left is easy_install
> product while right panel is when I manually uncompress biopython
> tarball: http://www.genesdigitales.com/bioinfo/biopy.jpg).
> Is this expected or an oversight?

You'd have to ask Brad for an expert opinion, but I think this is
probably to be expected.  If you install from source, the only folders
copied to site-packages are Bio, BioSQL, and Martel.

See also this thread:
http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005924.html

Peter

P.S. I assume you meant Biopython 1.50 and not 1.5 ;)


From sbassi at clubdelarazon.org  Tue May  5 23:05:46 2009
From: sbassi at clubdelarazon.org (Sebastian Bassi)
Date: Tue, 5 May 2009 20:05:46 -0300
Subject: [Biopython-dev] Missing directories with easy_install?
In-Reply-To: <320fb6e00905051556g2821e4f0k3db66ec545dcd399@mail.gmail.com>
References: <9e2f512b0905051549w688195f6w719572fe42ee2678@mail.gmail.com>
	<320fb6e00905051556g2821e4f0k3db66ec545dcd399@mail.gmail.com>
Message-ID: <9e2f512b0905051605k663035d7td84372847675c7d4@mail.gmail.com>

On Tue, May 5, 2009 at 7:56 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> You'd have to ask Brad for an expert opinion, but I think this is
> probably to be expected.  If you install from source, the only folders
> copied to site-packages are Bio, BioSQL, and Martel.
> See also this thread:
> http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005924.html

OK, so that is.

> P.S. I assume you meant Biopython 1.50 and not 1.5 ;)

yes!.


From biopython at maubp.freeserve.co.uk  Tue May  5 23:33:16 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 6 May 2009 00:33:16 +0100
Subject: [Biopython-dev] SeqRecord per-letter-annotation : avoid lists?
Message-ID: <320fb6e00905051633i70604746i332b3bfaf3476876@mail.gmail.com>

Hi all,

I was thinking that about the SeqRecord object's letter_annotations,
and that perhaps we should only allow strings and tuples (which are
immutable), but not lists.  Because lists are mutable, the user can
(accidentaly) alter the list such that its length doesn't match that
of the associated sequence (which would be bad). Currently we do use
lists in the SeqRecord's letter_annotations, e.g. for qualities. I
don't recall having any particular reason for using a list rather than
a tuple.

Any thoughts on this?

Peter


From p.j.a.cock at googlemail.com  Wed May  6 10:32:01 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 6 May 2009 11:32:01 +0100
Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was
	Bio.GFF)
In-Reply-To: <320fb6e00905050826y5ae0b13eiaa6d9e56fd9049e9@mail.gmail.com>
References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com>
	<320fb6e00904210655u5f497227o24457ec2ee3d711f@mail.gmail.com>
	<320fb6e00905050826y5ae0b13eiaa6d9e56fd9049e9@mail.gmail.com>
Message-ID: <320fb6e00905060332t2b9d9595pca68b83db8cef28f@mail.gmail.com>

On Tue, May 5, 2009 at 4:26 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Apr 21, 2009 at 2:55 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>> The prime use case to keep in mind is taking a feature location (even
>>> a join), and using this to extract that region of nucleotides from the
>>> parent sequence (i.e. a Seq object or a SeqRecord object, as now both
>>> can be sliced).
>
> I've written code to do this in test_SeqIO_features.py, which cross
> checks the nucleotides pulled out from a GenBank files based on the
> SeqFeature, against what the NCBI provide in FASTA format. ?This seems
> to work OK, but has not been tested extensively (e.g. running it on
> drosophila or arabidopsis would be good).

Yep - found a corner case my code can't yet cope with, from the
Arabidopsis thaliana chloroplasts (NC_000932).  This has some
pathological mixed strand locations, like
join(complement(69611..69724),139856..140650) which is for a
trans-spliced ribosomal protein.

> It could make sense to expose this functionality directly in
> Biopython, ...

Given this code is non-trivial to implement, this seems worth doing.

Peter


From bugzilla-daemon at portal.open-bio.org  Wed May  6 22:50:08 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 6 May 2009 18:50:08 -0400
Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest
In-Reply-To: <bug-2820-42@http.bugzilla.open-bio.org/>
Message-ID: <200905062250.n46Mo8EM023616@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2820


------- Comment #9 from eric.talevich at gmail.com  2009-05-06 18:50 EST -------
Created an attachment (id=1293)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1293&action=view)
Additional warnings test for Py2.6+

This is the file that test_PDB_unit.py can import to plug in an additional test
for specific warnings.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Wed May  6 22:54:06 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 6 May 2009 18:54:06 -0400
Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest
In-Reply-To: <bug-2820-42@http.bugzilla.open-bio.org/>
Message-ID: <200905062254.n46Ms6YP023831@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2820


------- Comment #10 from eric.talevich at gmail.com  2009-05-06 18:54 EST -------
Created an attachment (id=1294)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1294&action=view)
test_PDB_unit.py, with conditional import

This is a modified test_PDB_unit.py that checks whether the necessary context
manager is available (it will be for Py2.6+), and if so, imports the additional
unit test from _PDB_extra.py into the current class.

(Sorry it's a whole file, I was having trouble diffing between git branches.)


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu May  7 08:51:35 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 7 May 2009 04:51:35 -0400
Subject: [Biopython-dev] [Bug 2824] New: Bio.Entrez.epost is using an HTTP
	GET not an HTTP POST
Message-ID: <bug-2824-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2824

           Summary: Bio.Entrez.epost is using an HTTP GET not an HTTP POST
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk


Following from a query on our mailing list suggesting Bio.Entrez.epost is
failing with long ID lists, I looked a little more closely at the code and it
is actually using an HTTP GET instead of an HTTP POST (which would avoid the
long URL problem).

See:
http://lists.open-bio.org/pipermail/biopython/2009-May/005149.html

We can still use urllib to do this with its data argument...
http://docs.python.org/library/urllib.html


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu May  7 09:18:58 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 7 May 2009 05:18:58 -0400
Subject: [Biopython-dev] [Bug 2824] Bio.Entrez.epost is using an HTTP GET
	not an HTTP POST
In-Reply-To: <bug-2824-42@http.bugzilla.open-bio.org/>
Message-ID: <200905070918.n479IwHQ031195@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2824


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-07 05:18 EST -------
Created an attachment (id=1295)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1295&action=view)
Patch for Bio/Entrez/__init__.py

This patch does two things,
(1) Makes Bio.Entrez.epost do an HTTP POST
(2) Catches the too long URL error 414 messages and raises an IOError

Without the patch:

>>> print Entrez.epost("pubmed", id=",".join(str(i) for i in range(1,10000))).read()
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>414 Request-URI Too Large</title>
</head><body>
<h1>Request-URI Too Large</h1>
<p>The requested URL's length exceeds the capacity
limit for this server.<br />
</p>
</body></html>

>>> print Entrez.efetch("pubmed", id=",".join(str(i) for i in range(1,10000)), retmode="text", rettype="uilist").read()
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>414 Request-URI Too Large</title>
</head><body>
<h1>Request-URI Too Large</h1>
<p>The requested URL's length exceeds the capacity
limit for this server.<br />
</p>
</body></html>

Note both the above trigger the Error 414 message, but it does not get caught.

With the patch:

>>> print Entrez.epost("pubmed", id=",".join(str(i) for i in range(1,10000))).read()
<?xml version="1.0"?>
<!DOCTYPE ePostResult PUBLIC "-//NLM//DTD ePostResult, 11 May 2002//EN"
"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ePost_020511.dtd">
<ePostResult>
        <QueryKey>1</QueryKey>
        <WebEnv>NCID_01_264798363_130.14.18.47_9001_1241687667</WebEnv>
</ePostResult>

>>> print Entrez.efetch("pubmed", id=",".join(str(i) for i in range(1,10000)), retmode="text", rettype="uilist").read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "Bio/Entrez/__init__.py", line 126, in efetch
    return _open(cgi, variables)
  File "Bio/Entrez/__init__.py", line 370, in _open
    raise IOError("Requested URL too long (try using EPost?)")
IOError: Requested URL too long (try using EPost?)

Now epost works with long arguments, and using the other tools with too long a
URL will trigger an IOError.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu May  7 10:20:10 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 7 May 2009 06:20:10 -0400
Subject: [Biopython-dev] [Bug 2824] Bio.Entrez.epost is using an HTTP GET
	not an HTTP POST
In-Reply-To: <bug-2824-42@http.bugzilla.open-bio.org/>
Message-ID: <200905071020.n47AKAGD002826@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2824


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-07 06:20 EST -------
Patch checked in (OK'd with Michiel), marking as fixed.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu May  7 13:56:09 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 7 May 2009 09:56:09 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905071356.n47Du9iQ018532@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #24 from cymon.cox at gmail.com  2009-05-07 09:56 EST -------
(In reply to comment #23)
> In Prank, should realbranches take no arguments?  i.e. use the new _Switch
> class?

Yes, verified and done; pushed to applic-int branch.
C.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu May  7 14:07:23 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 7 May 2009 10:07:23 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905071407.n47E7Nn7019531@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #25 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-07 10:07 EST -------
(In reply to comment #24)
> (In reply to comment #23)
> > In Prank, should realbranches take no arguments?  i.e. use the new _Switch
> > class?
> 
> Yes, verified and done; pushed to applic-int branch.
> C.

Thanks for checking - that's done in CVS now.

I think the final bit of new code is _Dialign.py which still needs to be
updated for the new style __init__ method.  

Then there are your unit tests...


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu May  7 14:39:40 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 7 May 2009 10:39:40 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905071439.n47Edeaj022126@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #26 from cymon.cox at gmail.com  2009-05-07 10:39 EST -------
(In reply to comment #25)
> (In reply to comment #24)
> > (In reply to comment #23)
> > > In Prank, should realbranches take no arguments?  i.e. use the new _Switch
> > > class?
> > 
> > Yes, verified and done; pushed to applic-int branch.
> > C.
> 
> Thanks for checking - that's done in CVS now.
> 
> I think the final bit of new code is _Dialign.py which still needs to be
> updated for the new style __init__ method.

Done - pushed to applic-int (Note windows path stuff absent from _Dialign)

> Then there are your unit tests...

As they are at present, unittests for Muscle, Mafft, Dialign and Prank all
pass. They could of course be made arbitrarily more complex... they should
probably have at least one test that uses the properties style parameter
setting rather than just set_paramter()
C.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu May  7 15:22:35 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 7 May 2009 11:22:35 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905071522.n47FMZ16025500@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #27 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-07 11:22 EST -------
(In reply to comment #26)
> > I think the final bit of new code is _Dialign.py which still needs to be
> > updated for the new style __init__ method.
> 
> Done - pushed to applic-int (Note windows path stuff absent from _Dialign)
> 

OK, that is in CVS now.

> > Then there are your unit tests...
> 
> As they are at present, unittests for Muscle, Mafft, Dialign and Prank all
> pass. They could of course be made arbitrarily more complex... they should
> probably have at least one test that uses the properties style parameter
> setting rather than just set_paramter()
> C.

I've added test_Dialign_tool.py to CVS, and then switched a few to using
keyword arguments and properties.  As far as I can see from here, the tool
isn't expected to work on Windows (although it might still be possible with
cygwin):
http://bibiserv.techfak.uni-bielefeld.de/download/tools/DIALIGN_221.html

Is that everything?  You'd mentioned a more general test which just builds the
strings, but doesn't actually need to run any of the tools themselves.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri May  8 12:07:03 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 8 May 2009 08:07:03 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905081207.n48C73cT012732@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #28 from cymon.cox at gmail.com  2009-05-08 08:07 EST -------
(In reply to comment #27)
> (In reply to comment #26)
> > > I think the final bit of new code is _Dialign.py which still needs to be
> > > updated for the new style __init__ method.
> > 
> > Done - pushed to applic-int (Note windows path stuff absent from _Dialign)
> > 
> 
> OK, that is in CVS now.
> 
> > > Then there are your unit tests...
> > 
> > As they are at present, unittests for Muscle, Mafft, Dialign and Prank all
> > pass. They could of course be made arbitrarily more complex... they should
> > probably have at least one test that uses the properties style parameter
> > setting rather than just set_paramter()
> > C.
> 
> I've added test_Dialign_tool.py to CVS, and then switched a few to using
> keyword arguments and properties.  As far as I can see from here, the tool
> isn't expected to work on Windows (although it might still be possible with
> cygwin):
> http://bibiserv.techfak.uni-bielefeld.de/download/tools/DIALIGN_221.html
> 
> Is that everything?

That's everything currently written. I still want to add interfaces to ProbCons
and T-Coffee.

  You'd mentioned a more general test which just builds the
> strings, but doesn't actually need to run any of the tools themselves.

Yes, I'll do that.
C.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri May  8 12:23:03 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 8 May 2009 08:23:03 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905081223.n48CN3nV013977@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #29 from chapmanb at 50mail.com  2009-05-08 08:23 EST -------
Created an attachment (id=1296)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1296&action=view)
Start of TCoffee command line

Cymon;
Here is the start of a TCoffee command line object. It's not up to date with
the latest changes y'all have been making and doesn't have all the options, but
should save some typing.

Brad


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri May  8 19:14:27 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 8 May 2009 15:14:27 -0400
Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest
In-Reply-To: <bug-2820-42@http.bugzilla.open-bio.org/>
Message-ID: <200905081914.n48JERYx012798@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2820


eric.talevich at gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1293 is|0                           |1
           obsolete|                            |
Attachment #1294 is|0                           |1
           obsolete|                            |


------- Comment #11 from eric.talevich at gmail.com  2009-05-08 15:14 EST -------
Created an attachment (id=1297)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1297&action=view)
Py2.6-only unit test of PDB warnings

I pushed a branch called bug2820 to github containing just this commit, if
that's easier:

http://github.com/etal/biopython/tree/bug2820

Any suggestions for naming the new file?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri May  8 21:45:53 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 8 May 2009 17:45:53 -0400
Subject: [Biopython-dev] [Bug 2817] Meta-bug for cleanup once we drop Python
	2.3 support
In-Reply-To: <bug-2817-42@http.bugzilla.open-bio.org/>
Message-ID: <200905082145.n48Ljr4L023802@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2817


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-08 17:45 EST -------
I've started removing support for Python 2.3 in CVS, including removing all the
sets and subprocess special case code.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri May  8 22:14:36 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 8 May 2009 18:14:36 -0400
Subject: [Biopython-dev] [Bug 2825] New: SeqIO does not successfully parse
	Genbank records related to whole genome sequencing deposits,
	as Did not recognise the LOCUS line layout
Message-ID: <bug-2825-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2825

           Summary: SeqIO does not successfully parse Genbank records
                    related to whole genome sequencing deposits, as Did not
                    recognise the LOCUS line layout
           Product: Biopython
           Version: 1.49
          Platform: All
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: david.wyllie at ndm.ox.ac.uk


Hi

I'm using the BioPython distribution 1.49 obtained as a Package using the
Ubuntu 9 synaptic package manager.  The below describes the problem:

NCBI has a record type which describes the contents of whole-genome sequencing
projects.  The record doesn't itself contain sequence, by constrast to most
genbank records.

this URL gives an example
http://www.ncbi.nlm.nih.gov/nuccore/162285818
should the SeqIO parser be able to read this? it cannot.  Here is an example:

# import modules
from Bio import Entrez
from Bio import SeqIO

# read the record from NCBI, print out the contents.
handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818")
masterrecord=handle.readlines()
for line in masterrecord:
        print line
handle.close()

# let's read it again, and try to parse with with SeqIO.
handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818")

# this line causes the crash
seq_record = SeqIO.read(handle, "genbank")

handle.close()

# fails.  the traceback reads
"""
Traceback (most recent call last):
  File "bugreport.py", line 25, in <module>
    seq_record = SeqIO.read(handle, "genbank")
  File "/var/lib/python-support/python2.6/Bio/SeqIO/__init__.py", line 435, in
read
    first = iterator.next()
  File "/var/lib/python-support/python2.6/Bio/GenBank/Scanner.py", line 410, in
parse_records
    record = self.parse(handle)
  File "/var/lib/python-support/python2.6/Bio/GenBank/Scanner.py", line 393, in
parse
    if self.feed(handle, consumer) :
  File "/var/lib/python-support/python2.6/Bio/GenBank/Scanner.py", line 360, in
feed
    self._feed_first_line(consumer, self.line)
  File "/var/lib/python-support/python2.6/Bio/GenBank/Scanner.py", line 907, in
_feed_first_line
    raise ValueError('Did not recognise the LOCUS line layout:\n' + line)
ValueError: Did not recognise the LOCUS line layout:
LOCUS       ABIN01000000             353 rc    DNA     linear   BCT 10-DEC-2007
"""

# by contrast, reading one of the constituent genbank records, like this one
# http://www.ncbi.nlm.nih.gov/nuccore/162285817
# works correctly;

handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285817")
seq_record = SeqIO.read(handle, "genbank")
handle.close()
print "Successfully loaded record GI=162285817"
print seq_record.description


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri May  8 22:37:47 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 8 May 2009 18:37:47 -0400
Subject: [Biopython-dev] [Bug 2825] Parsing whole genome sequencing (WGS)
	Genbank records
In-Reply-To: <bug-2825-42@http.bugzilla.open-bio.org/>
Message-ID: <200905082237.n48MbleU027475@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2825


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
            Summary|SeqIO does not successfully |Parsing whole genome
                   |parse Genbank records       |sequencing (WGS) Genbank
                   |related to whole genome     |records
                   |sequencing deposits, as Did |
                   |not recognise the LOCUS line|
                   |layout                      |


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-08 18:37 EST -------
Hi David,

This is not expected to work in Bio.SeqIO or Bio.GenBank at the moment.  For
the LOCUS line we expect things like "123 aa" for proteins, or "123 bp" for
nucleotides.  Here you have "353 rc" (rc for record count), which as our error
message says, is unexpected.  At the end of the record, there are also WGS
and/or WGS_SCAFLD lines to worry about:

http://www.ncbi.nlm.nih.gov/Genbank/wgs.html
http://www.bio.net/bionet/mm/genbankb/2009-January/000299.html

Given these WGS files have no sequence, and no real sequence associated
features either, it stikes me that supporting this in Bio.SeqIO is a stretch
(these records are not really sequences, nor are they about a sequence).

However, Bio.GenBank should perhaps be updated to cope... so I'll leave this
bug open for that as a possible enhancement.  Note I have changed the bug title
from "SeqIO does not successfully parse Genbank records related to whole genome
sequencing deposits, as Did not recognise the LOCUS line layout", to "Parsing
whole genome sequencing (WGS) Genbank records", and changed the bug priority to
an enhancement.

What information do you want from this file?  In the meantime, I suggest you
fetch the record as XML, which you can parse using Bio.Entrez.read() or your
XML parser of choice.

Peter

P.S. This is a shorter way to dump the file to screen in python:

>>> from Bio import Entrez
>>> handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818")
>>> print handle.read()
LOCUS       ABIN01000000             353 rc    DNA     linear   BCT 10-DEC-2007
DEFINITION  Mycobacterium intracellulare ATCC 13950, whole genome shotgun
            sequencing project.
ACCESSION   ABIN00000000
VERSION     ABIN00000000.1  GI:162285818
DBLINK      Project:27955
KEYWORDS    WGS.
SOURCE      Mycobacterium intracellulare ATCC 13950
  ORGANISM  Mycobacterium intracellulare ATCC 13950
            Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales;
            Corynebacterineae; Mycobacteriaceae; Mycobacterium; Mycobacterium
            avium complex (MAC).
REFERENCE   1  (bases 1 to 353)
  AUTHORS   Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
            Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
  TITLE     Mycobacterium intracellulare Genome Project
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 353)
  AUTHORS   Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
            Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
  TITLE     Direct Submission
  JOURNAL   Submitted (30-NOV-2007) McGill University and Genome Quebec
            Innovation Centre, 740 Avenue Docteur Penfield, Montreal, Quebec
            H3A 1A4, Canada
COMMENT     The Mycobacterium intracellulare ATCC 13950 whole genome shotgun
            (WGS) project has the project accession ABIN00000000.  This version
            of the project (01) has the accession number ABIN01000000, and
            consists of sequences ABIN01000001-ABIN01000353.
            The whole genome shotgun sequence was generated by the McGill
            University and Genome Quebec Innovation Centre using the GS De Novo
            Assembler from GS-FLX reads.  This strain is available from the
            American Type Culture Collection (www.atcc.org).
FEATURES             Location/Qualifiers
     source          1..353
                     /organism="Mycobacterium intracellulare ATCC 13950"
                     /mol_type="genomic DNA"
                     /strain="ATCC 13950"
                     /serovar="16"
                     /isolation_source="human lymph node"
                     /db_xref="taxon:487521"
                     /note="type strain of Mycobacterium intracellulare ATCC
                     13950
                     associated with disease"
WGS         ABIN01000001-ABIN01000353
//


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri May  8 23:12:43 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 8 May 2009 19:12:43 -0400
Subject: [Biopython-dev] [Bug 2825] Parsing whole genome sequencing (WGS)
	Genbank records
In-Reply-To: <bug-2825-42@http.bugzilla.open-bio.org/>
Message-ID: <200905082312.n48NChKL030485@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2825


------- Comment #2 from david.wyllie at ndm.ox.ac.uk  2009-05-08 19:12 EST -------
Thank you for your help.  
I just wanted to extract the WGS line, which I'm able to do.


(In reply to comment #1)
> Hi David,
> 
> This is not expected to work in Bio.SeqIO or Bio.GenBank at the moment.  For
> the LOCUS line we expect things like "123 aa" for proteins, or "123 bp" for
> nucleotides.  Here you have "353 rc" (rc for record count), which as our error
> message says, is unexpected.  At the end of the record, there are also WGS
> and/or WGS_SCAFLD lines to worry about:
> 
> http://www.ncbi.nlm.nih.gov/Genbank/wgs.html
> http://www.bio.net/bionet/mm/genbankb/2009-January/000299.html
> 
> Given these WGS files have no sequence, and no real sequence associated
> features either, it stikes me that supporting this in Bio.SeqIO is a stretch
> (these records are not really sequences, nor are they about a sequence).
> 
> However, Bio.GenBank should perhaps be updated to cope... so I'll leave this
> bug open for that as a possible enhancement.  Note I have changed the bug title
> from "SeqIO does not successfully parse Genbank records related to whole genome
> sequencing deposits, as Did not recognise the LOCUS line layout", to "Parsing
> whole genome sequencing (WGS) Genbank records", and changed the bug priority to
> an enhancement.
> 
> What information do you want from this file?  In the meantime, I suggest you
> fetch the record as XML, which you can parse using Bio.Entrez.read() or your
> XML parser of choice.
> 
> Peter
> 
> P.S. This is a shorter way to dump the file to screen in python:
> 
> >>> from Bio import Entrez
> >>> handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818")
> >>> print handle.read()
> LOCUS       ABIN01000000             353 rc    DNA     linear   BCT 10-DEC-2007
> DEFINITION  Mycobacterium intracellulare ATCC 13950, whole genome shotgun
>             sequencing project.
> ACCESSION   ABIN00000000
> VERSION     ABIN00000000.1  GI:162285818
> DBLINK      Project:27955
> KEYWORDS    WGS.
> SOURCE      Mycobacterium intracellulare ATCC 13950
>   ORGANISM  Mycobacterium intracellulare ATCC 13950
>             Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales;
>             Corynebacterineae; Mycobacteriaceae; Mycobacterium; Mycobacterium
>             avium complex (MAC).
> REFERENCE   1  (bases 1 to 353)
>   AUTHORS   Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
>             Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
>   TITLE     Mycobacterium intracellulare Genome Project
>   JOURNAL   Unpublished
> REFERENCE   2  (bases 1 to 353)
>   AUTHORS   Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
>             Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
>   TITLE     Direct Submission
>   JOURNAL   Submitted (30-NOV-2007) McGill University and Genome Quebec
>             Innovation Centre, 740 Avenue Docteur Penfield, Montreal, Quebec
>             H3A 1A4, Canada
> COMMENT     The Mycobacterium intracellulare ATCC 13950 whole genome shotgun
>             (WGS) project has the project accession ABIN00000000.  This version
>             of the project (01) has the accession number ABIN01000000, and
>             consists of sequences ABIN01000001-ABIN01000353.
>             The whole genome shotgun sequence was generated by the McGill
>             University and Genome Quebec Innovation Centre using the GS De Novo
>             Assembler from GS-FLX reads.  This strain is available from the
>             American Type Culture Collection (www.atcc.org).
> FEATURES             Location/Qualifiers
>      source          1..353
>                      /organism="Mycobacterium intracellulare ATCC 13950"
>                      /mol_type="genomic DNA"
>                      /strain="ATCC 13950"
>                      /serovar="16"
>                      /isolation_source="human lymph node"
>                      /db_xref="taxon:487521"
>                      /note="type strain of Mycobacterium intracellulare ATCC
>                      13950
>                      associated with disease"
> WGS         ABIN01000001-ABIN01000353
> //
> 

(In reply to comment #1)
> Hi David,
> 
> This is not expected to work in Bio.SeqIO or Bio.GenBank at the moment.  For
> the LOCUS line we expect things like "123 aa" for proteins, or "123 bp" for
> nucleotides.  Here you have "353 rc" (rc for record count), which as our error
> message says, is unexpected.  At the end of the record, there are also WGS
> and/or WGS_SCAFLD lines to worry about:
> 
> http://www.ncbi.nlm.nih.gov/Genbank/wgs.html
> http://www.bio.net/bionet/mm/genbankb/2009-January/000299.html
> 
> Given these WGS files have no sequence, and no real sequence associated
> features either, it stikes me that supporting this in Bio.SeqIO is a stretch
> (these records are not really sequences, nor are they about a sequence).
> 
> However, Bio.GenBank should perhaps be updated to cope... so I'll leave this
> bug open for that as a possible enhancement.  Note I have changed the bug title
> from "SeqIO does not successfully parse Genbank records related to whole genome
> sequencing deposits, as Did not recognise the LOCUS line layout", to "Parsing
> whole genome sequencing (WGS) Genbank records", and changed the bug priority to
> an enhancement.
> 
> What information do you want from this file?  In the meantime, I suggest you
> fetch the record as XML, which you can parse using Bio.Entrez.read() or your
> XML parser of choice.
> 
> Peter
> 
> P.S. This is a shorter way to dump the file to screen in python:
> 
> >>> from Bio import Entrez
> >>> handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818")
> >>> print handle.read()
> LOCUS       ABIN01000000             353 rc    DNA     linear   BCT 10-DEC-2007
> DEFINITION  Mycobacterium intracellulare ATCC 13950, whole genome shotgun
>             sequencing project.
> ACCESSION   ABIN00000000
> VERSION     ABIN00000000.1  GI:162285818
> DBLINK      Project:27955
> KEYWORDS    WGS.
> SOURCE      Mycobacterium intracellulare ATCC 13950
>   ORGANISM  Mycobacterium intracellulare ATCC 13950
>             Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales;
>             Corynebacterineae; Mycobacteriaceae; Mycobacterium; Mycobacterium
>             avium complex (MAC).
> REFERENCE   1  (bases 1 to 353)
>   AUTHORS   Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
>             Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
>   TITLE     Mycobacterium intracellulare Genome Project
>   JOURNAL   Unpublished
> REFERENCE   2  (bases 1 to 353)
>   AUTHORS   Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
>             Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
>   TITLE     Direct Submission
>   JOURNAL   Submitted (30-NOV-2007) McGill University and Genome Quebec
>             Innovation Centre, 740 Avenue Docteur Penfield, Montreal, Quebec
>             H3A 1A4, Canada
> COMMENT     The Mycobacterium intracellulare ATCC 13950 whole genome shotgun
>             (WGS) project has the project accession ABIN00000000.  This version
>             of the project (01) has the accession number ABIN01000000, and
>             consists of sequences ABIN01000001-ABIN01000353.
>             The whole genome shotgun sequence was generated by the McGill
>             University and Genome Quebec Innovation Centre using the GS De Novo
>             Assembler from GS-FLX reads.  This strain is available from the
>             American Type Culture Collection (www.atcc.org).
> FEATURES             Location/Qualifiers
>      source          1..353
>                      /organism="Mycobacterium intracellulare ATCC 13950"
>                      /mol_type="genomic DNA"
>                      /strain="ATCC 13950"
>                      /serovar="16"
>                      /isolation_source="human lymph node"
>                      /db_xref="taxon:487521"
>                      /note="type strain of Mycobacterium intracellulare ATCC
>                      13950
>                      associated with disease"
> WGS         ABIN01000001-ABIN01000353
> //
> 


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Sat May  9 11:59:32 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 9 May 2009 07:59:32 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905091159.n49BxWpM015484@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #30 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-09 07:59 EST -------
I've got test_Mafft_tool.py working on one Linux machine using MAFFT v6.626b
(2009/03/16) installed from source.

However, test_Mafft_tool.py fails on another Linux machine using MAFFT v6.240
(2007/04/04) installed using the distribution's package, in this case Ubuntu
Jaunty:
http://packages.ubuntu.com/jaunty/mafft

Note that the next version of Ubuntu currently also uses the same old package:
http://packages.ubuntu.com/karmic/mafft

As does Debian unstable:
http://packages.debian.org/unstable/science/mafft

>From trying mafft v6.240 by hand at the command line, it never seems to
actually print anything to the console.  Either the MAFFT API changed (which
doesn't seem to be the case), or the version Ubuntu installed on this machine
is broken.  This could be due to something else like the version of awk or gcc
(guesses based on the MAFFT change log):
http://align.bmr.kyushu-u.ac.jp/mafft/software/

Note that the latest version is now MAFFT 6.704, so we should try that too. If
I am right about the current Ubuntu/Debian package being broken, we should get
in touch with them about updating it... otherwise we can look forward to bug
reports about our wrapper and/or test_Mafft_tool.py failing.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Sat May  9 12:31:55 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 9 May 2009 08:31:55 -0400
Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest
In-Reply-To: <bug-2820-42@http.bugzilla.open-bio.org/>
Message-ID: <200905091231.n49CVtUj017919@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2820


------- Comment #12 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-09 08:31 EST -------
(In reply to comment #8)
> I have something that works on both Py2.5 and Py2.6 now:
> http://github.com/etal/biopython/tree/pdbtidy

Would it be easy for you to test your code on Python 2.4?  I can probably do
that but not right now...

I would prefer to avoid the extra file by writing this test as part of
test_PDB_unit.py - but the "with" statement isn't valid syntax on Python 2.4,
although it can be used on Python 2.5 via:
from __future__ import with_statement

Could you re-write this to avoid the with statement?

> Also, apparently tests are run in alphabetical order, ...

Yes, that is expected.

> ... and Exposure was jumping ahead of PDBExceptionTest. I renamed
> PDBExceptionTest to ExceptionTest to restore the natural order of
> things and stop setting off the warnings prematurely. Maybe test
> suites with multiple TestCase classes should be arranged alphabetically
> in the code to avoid confusion in the future.

Ideally the unit tests should work in any order - and this is generally a
reasonable assumption, as they should be independent.  Having some carefully
named unit tests will only hide the ordering problem (which is due to the 
global state information in the warnings module).  At the very least, we should
probably have comments in the code about this (to avoid issues in the future)
and maybe use an eye-catching name like AAAAA which should always come first.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Sat May  9 13:06:15 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 9 May 2009 14:06:15 +0100
Subject: [Biopython-dev] PhyloXML read/parse functions and handles
Message-ID: <320fb6e00905090606g7235d4a6t22914f3ff9c293be@mail.gmail.com>

Hi Eric,

Are you happy to have feedback on your PhyloXML code in public?  In
this case I wanted to make a fairly general observation about parsing
files using handles, so I have cc'd the dev list.

I just had a look at the stub in Bio/PhyloXML/__init__.py and
Bio/PhyloXML/Parser.py on your github branch,
http://github.com/etal/biopython/tree/phyloxml

The convention we are following in Biopython for parsing functions is
as follows:
read(handle, ...) - returns a single object (e.g. a tree in your case)
parse(handle, ...) - returns an iterator (e.g. returning multiple trees)

[This naming convention is arbitrary, but we should try to stick to it
in all new parsers for consistency.]

In Bio/PhyloXML/Parser.py you have a parse() sub function which
according to the comment appears to return a single tree.  If so, this
should be a read() function instead of a parse() function.

You seem to have a read() stub function in Bio/PhyloXML/__init__.py
which returns a single tree (good), but takes a (zip) filename (not a
handle - bad). Taking just a filename prevents using a whole range of
handle objects as input - e.g. StringIO handles, URL handles, piped
output from a command line tool etc.  This flexibility is why we focus
on dealing with handles for parsers.

On a related point, you should leave unzipping the file to the user -
this is not specific to dealing with XML tree files.  Plus, in
addition to zip files (i.e. pkzip/winzip format), there are other
compressed fileformats to consider, such as tarballs.  They too can be
opened and compressed on the fly as a handle (e.g. see the gzip python
library).  By taking a handle as the input your parser can then be
used with any of these import sources.

Peter

P.S. Finally, a more general note about a possible "Bio.TreeIO"
module. For simple Newick trees, a single file can contain one or more
trees (e.g. from bootstrapping).  A tree can be split over multiple
lines (but may be one long line), but multiple trees can be split up
because they should all have a semicolon terminator.  For Nexus files,
I'm not sure off hand if there can be more than one tree.  If you are
going to use the Tree objects from Bio.Nexus, then we could provide a
"Bio.TreeIO" module with read/parse/write methods coping with
"newick", "nexus", "phyloxml" formats, all using the same tree
objects.


From bugzilla-daemon at portal.open-bio.org  Sat May  9 16:40:27 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 9 May 2009 12:40:27 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905091640.n49GeRvY002521@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #31 from cymon.cox at gmail.com  2009-05-09 12:40 EST -------
(In reply to comment #30)
> I've got test_Mafft_tool.py working on one Linux machine using MAFFT v6.626b
> (2009/03/16) installed from source.

That was my reference installation when writing the command line tool (on
Jaunty/RHE 5.3).

> However, test_Mafft_tool.py fails on another Linux machine using MAFFT v6.240
> (2007/04/04) installed using the distribution's package, in this case Ubuntu
> Jaunty:
> http://packages.ubuntu.com/jaunty/mafft
> 
> Note that the next version of Ubuntu currently also uses the same old package:
> http://packages.ubuntu.com/karmic/mafft
> 
> As does Debian unstable:
> http://packages.debian.org/unstable/science/mafft
> 
> From trying mafft v6.240 by hand at the command line, it never seems to
> actually print anything to the console.  Either the MAFFT API changed (which
> doesn't seem to be the case), or the version Ubuntu installed on this machine
> is broken.  This could be due to something else like the version of awk or gcc
> (guesses based on the MAFFT change log):
> http://align.bmr.kyushu-u.ac.jp/mafft/software/

Hadn't tried the Ubuntu package...

On the upside, the Muscle3.7 package installed from Ubuntu passes our tests,
whereas the source compiles but core-dumps. Similarly, ProbCons1.2 won't
compile but the Ubuntu package looks good (havent written the tests yet).

> Note that the latest version is now MAFFT 6.704, so we should try that too. If
> I am right about the current Ubuntu/Debian package being broken, we should get
> in touch with them about updating it... otherwise we can look forward to bug
> reports about our wrapper and/or test_Mafft_tool.py failing.

Built from source on Jaunty; it passes our tests.
C.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From eric.talevich at gmail.com  Sun May 10 05:22:46 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sat, 9 May 2009 22:22:46 -0700
Subject: [Biopython-dev] PhyloXML read/parse functions and handles
In-Reply-To: <320fb6e00905090606g7235d4a6t22914f3ff9c293be@mail.gmail.com>
References: <320fb6e00905090606g7235d4a6t22914f3ff9c293be@mail.gmail.com>
Message-ID: <3f6baf360905092222w305556c0kdbf94e3c336b5958@mail.gmail.com>

On Sat, May 9, 2009 at 6:06 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> Hi Eric,
>
> Are you happy to have feedback on your PhyloXML code in public?


Sure am! I was just getting around to drafting up some questions for
biopython-dev, but I'm glad to receive some preemptive advice.

I just had a look at the stub in Bio/PhyloXML/__init__.py and
> Bio/PhyloXML/Parser.py on your github branch,
> http://github.com/etal/biopython/tree/phyloxml
>
> The convention we are following in Biopython for parsing functions is
> as follows:
> read(handle, ...) - returns a single object (e.g. a tree in your case)
> parse(handle, ...) - returns an iterator (e.g. returning multiple trees)
>
>
I noticed that; I'll change the Bio.PhyloXML.Parser.parse() stub to read()
and have it behave as expected.

The function currently allows either filenames or file handles as the source
because ElementTree.iterparse() also accepts either object as a source. The
read() function could "assert not isinstance(infile, str)", I guess...

The existing Java implementation in Forester/ATV has even more magic,
automatically performing Zip extraction if the given filename ends with
'.zip'. Since this looks like it will be a pretty common use case, at least
for big files, I thought it would be nice to also offer a wrapper function
that takes a filename and does the Right Thing -- that's what
__init__.read() does currently. Is there a precedent for this in Biopython?
The name should probably be something different; in the pdbtidy branch I
used load(), to match the Pickle module, since the wrapper function does
more than just parse or read a file.

So how about:

from Bio import PhyloXML
handle = open('somefile', 'r') # file-like object from any source
tree = PhyloXML.read(handle)

Equivalent to:

from Bio import PhyloXML
tree = PhyloXML.load('somefile') # DTRT for xml, zip, gz, ...?

Or, to be explicit, offer a read_zip or load_zip function. I'd leave well
enough alone, but the incantation to extract a character stream from a
single zipped file is kind of unintuitive, and one of the three example
files on phyloxml.org is already zipped. (I should really ask Christian
Zmasek about this to see if that's a real convention or not.)

P.S. Finally, a more general note about a possible "Bio.TreeIO"
> module. For simple Newick trees, a single file can contain one or more
> trees (e.g. from bootstrapping).  A tree can be split over multiple
> lines (but may be one long line), but multiple trees can be split up
> because they should all have a semicolon terminator.  For Nexus files,
> I'm not sure off hand if there can be more than one tree.  If you are
> going to use the Tree objects from Bio.Nexus, then we could provide a
> "Bio.TreeIO" module with read/parse/write methods coping with
> "newick", "nexus", "phyloxml" formats, all using the same tree
> objects.
>

OK, I'll give it a try. Brad recommended that I just get a simple PhyloXML
parser working first before attempting integration, but if some of Bio.Nexus
can be reused in that process, great. I'm about to go dark from the end of
this week until 3/31 (getting married, yaknow), but I'll fix all this code
when I get back and have access to git again.

Thanks for your help,
Eric


From biopython at maubp.freeserve.co.uk  Sun May 10 09:22:21 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 10 May 2009 10:22:21 +0100
Subject: [Biopython-dev] PhyloXML read/parse functions and handles
In-Reply-To: <3f6baf360905092222w305556c0kdbf94e3c336b5958@mail.gmail.com>
References: <320fb6e00905090606g7235d4a6t22914f3ff9c293be@mail.gmail.com>
	<3f6baf360905092222w305556c0kdbf94e3c336b5958@mail.gmail.com>
Message-ID: <320fb6e00905100222n22b7670dre26f9368726fce68@mail.gmail.com>

On Sun, May 10, 2009 at 6:22 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> The function currently allows either filenames or file handles as the source
> because ElementTree.iterparse() also accepts either object as a source. The
> read() function could "assert not isinstance(infile, str)", I guess...

Interesting - ReportLab also allows filenames or handles.  If this truely is a
widespread or growing trend in Python libraries, maybe we should do this
as well.

> The existing Java implementation in Forester/ATV has even more magic,
> automatically performing Zip extraction if the given filename ends with
> '.zip'. Since this looks like it will be a pretty common use case, at least
> for big files, I thought it would be nice to also offer a wrapper function
> that takes a filename and does the Right Thing -- that's what
> __init__.read() does currently. Is there a precedent for this in Biopython?

Note that Bio.Nexus does this already, making it a bit inconsistent with the
rest of Biopython.  I guess no one noticed or commented back when it was
added.

> The name should probably be something different; in the pdbtidy branch I
> used load(), to match the Pickle module, since the wrapper function does
> more than just parse or read a file.
>
> So how about:
>
> from Bio import PhyloXML
> handle = open('somefile', 'r') # file-like object from any source
> tree = PhyloXML.read(handle)
>
> Equivalent to:
>
> from Bio import PhyloXML
> tree = PhyloXML.load('somefile') # DTRT for xml, zip, gz, ...?
>
> Or, to be explicit, offer a read_zip or load_zip function.

I prefer the more explicit read_zip idea, your would also have an optional
argument for the filename within the zip file.  However, I'm not yet
convinced we need this function.

> I'd leave well enough alone, but the incantation to extract a character
> stream from a single zipped file is kind of unintuitive, and one of the
> three example files on phyloxml.org is already zipped. (I should really
> ask Christian Zmasek about this to see if that's a real convention or
> not.)

Do you want to find out if this really is a phyloxml.org convention first?

If this is their convention, it surprises me they didn't go for .gz files,
which in my experience are more widley used in Bioinformatics (e.g.
at the NCBI and PDB).  These are supported cross platform and hold
one single file (often a tarred file containing multiple files).  A zip file
can hold multiple files, which means you have to make extra
asumptions (e.g. you are using the first file in your code).

>> P.S. Finally, a more general note about a possible "Bio.TreeIO"
>> module. For simple Newick trees, a single file can contain one or more
>> trees (e.g. from bootstrapping).  A tree can be split over multiple
>> lines (but may be one long line), but multiple trees can be split up
>> because they should all have a semicolon terminator.  For Nexus files,
>> I'm not sure off hand if there can be more than one tree.  If you are
>> going to use the Tree objects from Bio.Nexus, then we could provide a
>> "Bio.TreeIO" module with read/parse/write methods coping with
>> "newick", "nexus", "phyloxml" formats, all using the same tree
>> objects.
>>
>
> OK, I'll give it a try. Brad recommended that I just get a simple PhyloXML
> parser working first before attempting integration, but if some of Bio.Nexus
> can be reused in that process, great.

Brad is right - getting a simple PhyloXML parser working is the first step.
It would be sensible to look at the Bio.Nexus tree structure though.

> I'm about to go dark from the end of this week until 3/31 (getting
> married, yaknow), but I'll fix all this code when I get back and have
> access to git again.

Congratulations - it looks like you've got a proper break sheduled as well :)

Peter


From bugzilla-daemon at portal.open-bio.org  Sun May 10 13:50:50 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 10 May 2009 09:50:50 -0400
Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest
In-Reply-To: <bug-2820-42@http.bugzilla.open-bio.org/>
Message-ID: <200905101350.n4ADoo7x001186@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2820


------- Comment #13 from eric.talevich at gmail.com  2009-05-10 09:50 EST -------
(In reply to comment #12)
> Would it be easy for you to test your code on Python 2.4?  I can probably do
> that but not right now...

Yes, I can do that, but only on Linux. I don't think there's anything
platform-specific here, though.

> I would prefer to avoid the extra file by writing this test as part of
> test_PDB_unit.py - but the "with" statement isn't valid syntax on Python 2.4,
> although it can be used on Python 2.5 via:
> from __future__ import with_statement
> 
> Could you re-write this to avoid the with statement?

I think the with statement is isomorphic to a try-except-finally arrangement,
calling the context manager's __enter__ method in the try block and __exit__ in
the finally block. I'll look at the source code of the warnings module and
maybe just copy a substantial chunk of it into this unit test (assuming it's
pure Python). That might make it possible to support Py2.4, too.

> Ideally the unit tests should work in any order - and this is generally a
> reasonable assumption, as they should be independent.  Having some carefully
> named unit tests will only hide the ordering problem (which is due to the 
> global state information in the warnings module).  At the very least, we should
> probably have comments in the code about this (to avoid issues in the future)
> and maybe use an eye-catching name like AAAAA which should always come first.
> 

Agreed. I'll tinker with it some more to see what can be improved here.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Mon May 11 12:40:49 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 11 May 2009 08:40:49 -0400
Subject: [Biopython-dev] [Bug 2783] Using alternative start codons in
	Bio.Seq translate method/function
In-Reply-To: <bug-2783-42@http.bugzilla.open-bio.org/>
Message-ID: <200905111240.n4BCenqD006754@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2783


------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-11 08:40 EST -------
Created an attachment (id=1298)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1298&action=view)
Patch for Bio/Seq.py to support complete CDS translation with non-standard
start codons

I've recently been doing CDS translations for viral/bacterial genes with
alternative start codons - and would like to fix this limitation in Biopython,
rather than having to hack around it.

On Bug 2381, comment #14, I wrote:
> For comparison, the following is copied from the BioPerl documentation about
> their sequence object's translate method.  It would be nice to follow some of
> the same naming conventions for any optional arguments.
> 
> http://www.bioperl.org/Core/Latest/bptutorial.html#iii_3_1_manipulating_sequence_data_with_seq_methods
> 
> If we want to translate full coding regions (CDS) the way major nucleotide
> databanks EMBL, GenBank and DDBJ do it, the translate() method has to perform
> more checks. Specifically, translate() needs to confirm that the sequence has
> appropriate start and terminator codons at the very beginning and the very end
> of the sequence and that there are no terminator codons present within the
> sequence in frame 0. In addition, if the genetic code being used has an
> atypical (non-ATG) start codon, the translate() method needs to convert the
> initial amino acid to methionine. These checks and conversions are triggered
> by setting ``complete'' to 1:
> 
>   $prot_obj = $my_seq_object->translate(-complete => 1);
> 

On Bug 2381, comment #51, Leighton wrote:
> In terms of nomenclature:
> 
> The default behaviour of translate() as Peter proposed: read through in-frame
> and translate with the appropriate codon table - is fine in nearly all
> circumstances.  Most other circumstances are covered by stopping at the first
> in-frame stop codon, which Peter has implemented, and is an option we all seem
> to agree on.
> 
> Biologically-speaking, this behaviour is not always correct for CDS in
> prokaryotes, where alternative start codons may occur a significant minority
> of the time.  These will be mistranslated if no provision is made for them.  I
> think a useful biological sequence object should at least try to mimic actual
> biology, so we should provide an option to handle this.
> 
> We should not assume that a sequence is a CDS unless it is specified by the
> user.  It seems reasonable to me that the term 'cds' should occur in any such
> argument from the user.
> 
> We have at least two options for how to proceed with a CDS: i) we can provide
> a strict CDS-type translation, which requires confirmation that the sequence
> is, in fact, a CDS; ii) we can provide a weak CDS-type translation, which only
> modifies the way the start codon is translated.  In both cases, behaviour is
> specific to CDS, and so having 'cds' in the argument name *somewhere* seems
> obvious, and entirely reasonable.

Leighton's option (ii) is start codon only modification.  This is what I
implemented in the patch on comment 1 (attachment 1259).  We haven't agreed on
a good name for this - which is partly why I went back to revisit the
alternative:

Leighton's option (i) is strict CDS-type translation.  As Leighton suggests,
having "cds" in the argument name here makes sense.  Regarding the BioPerl
argument name for this functionality, "complete", on Bug 2381 comment 19,
Martin wrote:
> The "complete" is a cryptic naming, I wouldn't be fond of it.
>

I think you are both right about the naming.  Would complete_cds=True would be
clear?  In fact, I quite like the idea of using cds=True which is short and
also fairly clear.  This patch adds a complete_cds=Boolean argument to the
Bio.Seq translate methods and function, which should act like the BioPerl
equivalent.  It includes doctests showing the new functionality.

I would like to use either of these approaches in Biopython - but not both ;)


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Mon May 11 20:00:29 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 11 May 2009 16:00:29 -0400
Subject: [Biopython-dev] [Bug 2826] New: when creating a de-novo SeqRecord,
	the dbxrefs are not written by SeqIO.write
Message-ID: <bug-2826-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2826

           Summary: when creating a de-novo SeqRecord, the dbxrefs are not
                    written by SeqIO.write
           Product: Biopython
           Version: 1.49
          Platform: All
        OS/Version: Linux
            Status: NEW
          Severity: minor
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: david.wyllie at ndm.ox.ac.uk


Hi

when creating a SeqRecord de novo, the dbxrefs are not written by SeqIO.write.
Is this the intended behaviour?

here is an example:

# example script
from Bio.Seq import Seq
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein

# list to hold output records
outlist=[]

# ofh is the output file handle
ofh = open("/home/dwyllie/temporary.gbk","w")

# example of de novo creation of SeqRecord object from url:
# http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html
rec = SeqRecord(Seq("MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKAT",
generic_protein),  \
                                id="NP_418483.1", name="b4059",  
description="ssDNA-binding protein", \
                                dbxrefs=["ASAP:13298", "GI:16131885",
"GeneID:948570"])

print rec

outlist.append(rec)
count = SeqIO.write(outlist, ofh, "genbank")
ofh.close()

# end of script

OUTPUT:
ID: NP_418483.1
Name: b4059
Description: ssDNA-binding protein
Database cross-references: ASAP:13298, GI:16131885, GeneID:948570
Number of features: 0
Seq('MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKAT', ProteinAlphabet())

Contents of temporary.gbk:
LOCUS       b4059                     46 bp                     UNK 01-JAN-1980
DEFINITION  ssDNA-binding protein
ACCESSION   NP_418483
VERSION     NP_418483.1
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
ORIGIN
        1 MASRGVNKVI LVGNLGQDPE VRYMPNGGAV ANITLATSES WRDKAT
//


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Mon May 11 20:29:02 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 11 May 2009 16:29:02 -0400
Subject: [Biopython-dev] [Bug 2826] SeqRecord dbxrefs not written to GenBank
	by SeqIO
In-Reply-To: <bug-2826-42@http.bugzilla.open-bio.org/>
Message-ID: <200905112029.n4BKT2x0024871@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2826


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|when creating a de-novo     |SeqRecord dbxrefs not
                   |SeqRecord, the dbxrefs are  |written to GenBank by SeqIO
                   |not written by SeqIO.write  |


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-11 16:29 EST -------
Hi David,

Thank you for another interesting bug report. See here for what the NCBI uses
in a GenPept file for this example protein, NP_418483.1
http://www.ncbi.nlm.nih.gov/protein/16131885

The ASAP and GeneID numbers are not recorded at the sequence level - there is
nowhere in the GenBank file format to but them.  They are however recorded
within a CDS feature on the link above.  So, if you want these recorded, you'd
have to create a SeqFeature with the information (you can't use the SeqRecord's
dbxrefs list).

The GI number would get written, but due to an anomology in the GenBank parser
this is currently stored in the annotations dictionary under the key "gi", so
this is where the GenBank writer looks for this.  We should probably switch to
recording this in the dbxrefs as "gi:12345" as well/instead, and look for this
GI number there instead/as well.

Currently when parsing GenBank files, the only thing stored in the SeqRecord's
dbxref list is a PROJECT line cross reference (see Bug 2225).  Looking at the
code, we don't currently record that - we should.

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Mon May 11 22:55:21 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 11 May 2009 18:55:21 -0400
Subject: [Biopython-dev] [Bug 2826] SeqRecord dbxrefs not written to GenBank
	by SeqIO
In-Reply-To: <bug-2826-42@http.bugzilla.open-bio.org/>
Message-ID: <200905112255.n4BMtLFc004295@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2826


------- Comment #2 from david.wyllie at ndm.ox.ac.uk  2009-05-11 18:55 EST -------
Thank you. I'm new to BioPython.

The goal was to take some whole-genome sequence (which isn't in Genbank) and
attach a taxon to it, in order that it be written to a BioSQL database.

Other records in the BioSQL database derive from NCBI and so have taxon_ids, so
the additional WGS being in a similar format would make things simpler.

Thank you very much for all your assistance.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From cy at cymon.org  Tue May 12 11:07:59 2009
From: cy at cymon.org (Cymon Cox)
Date: Tue, 12 May 2009 12:07:59 +0100
Subject: [Biopython-dev] Clustal alignment format header line
Message-ID: <7265d4f0905120407j151408bcnf988ba554fcb5140@mail.gmail.com>

Both Muscle (-clw) and Probcons (-clustalw)  output a programme specific
header line for the clustal format alignment:

"MUSCLE (3.7) multiple sequence alignment


AK1H_ECOLI/1-378      CPDSINAALICRGEKMSIAIMAGVLEAR etc"

"PROBCONS version 1.12 multiple sequence alignment

AK1H_ECOLI/1-378    CPDSINAALICRGEKMSIAIMA

"

Bio.AlignIO will not read these alignments
Bio/AlignIO/ClustalIO.py:94
 if line[:7] != 'CLUSTAL':
       raise ValueError("Did not find CLUSTAL header")

Muscle does have a -clwstrict flag but ProbCons doesnt.

Would it be a good idea to relax the header parsing?

C.
--


From biopython at maubp.freeserve.co.uk  Tue May 12 15:28:35 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 12 May 2009 16:28:35 +0100
Subject: [Biopython-dev] Clustal alignment format header line
In-Reply-To: <7265d4f0905120407j151408bcnf988ba554fcb5140@mail.gmail.com>
References: <7265d4f0905120407j151408bcnf988ba554fcb5140@mail.gmail.com>
Message-ID: <320fb6e00905120828t139113d8rceb7db874acdd8a0@mail.gmail.com>

On Tue, May 12, 2009 at 12:07 PM, Cymon Cox <cy at cymon.org> wrote:
> Both Muscle (-clw) and Probcons (-clustalw) ?output a programme specific
> header line for the clustal format alignment:
>
> "MUSCLE (3.7) multiple sequence alignment
>
>
> AK1H_ECOLI/1-378 ? ? ?CPDSINAALICRGEKMSIAIMAGVLEAR etc"
>
> "PROBCONS version 1.12 multiple sequence alignment
>
> AK1H_ECOLI/1-378 ? ?CPDSINAALICRGEKMSIAIMA
>
> "
>
> Bio.AlignIO will not read these alignments
> Bio/AlignIO/ClustalIO.py:94
> ?if line[:7] != 'CLUSTAL':
> ? ? ? raise ValueError("Did not find CLUSTAL header")
>
> Muscle does have a -clwstrict flag but ProbCons doesnt.
>
> Would it be a good idea to relax the header parsing?
>
> C.

Maybe.  Up until now the only example of this I had personally come
across was MUSCLE, but they helpfully provide the -clwstrict argument
so the issue wasn't important.

There are also of course the official variants like:

CLUSTAL W (1.81) multiple sequence alignment
CLUSTAL 2.0.9 multiple sequence alignment

How would you code this?  A flexible option would be to take anything
where the first line ends with "multiple sequence alignment", but this
risks letting a lot of non-clustal files though which will then
(hopefully) fail, but probably with a much more cryptic error message.
A white list of safe variants like "MUSCLE" and "PROBCONS" would be
safest.

Also I have a vague memory of some tool using something like "CLUSTAL
... from ToolX" but I don't recall the details.

Peter


From cy at cymon.org  Tue May 12 15:43:47 2009
From: cy at cymon.org (Cymon Cox)
Date: Tue, 12 May 2009 16:43:47 +0100
Subject: [Biopython-dev] Clustal alignment format header line
In-Reply-To: <320fb6e00905120828t139113d8rceb7db874acdd8a0@mail.gmail.com>
References: <7265d4f0905120407j151408bcnf988ba554fcb5140@mail.gmail.com> 
	<320fb6e00905120828t139113d8rceb7db874acdd8a0@mail.gmail.com>
Message-ID: <7265d4f0905120843j172f5303y7029e1f7b5f4187f@mail.gmail.com>

2009/5/12 Peter <biopython at maubp.freeserve.co.uk>

> On Tue, May 12, 2009 at 12:07 PM, Cymon Cox <cy at cymon.org> wrote:
> > Both Muscle (-clw) and Probcons (-clustalw)  output a programme specific
> > header line for the clustal format alignment:
> >
> > "MUSCLE (3.7) multiple sequence alignment
> >
> >
> > AK1H_ECOLI/1-378      CPDSINAALICRGEKMSIAIMAGVLEAR etc"
> >
> > "PROBCONS version 1.12 multiple sequence alignment
> >
> > AK1H_ECOLI/1-378    CPDSINAALICRGEKMSIAIMA
> >
> > "
> >
> > Bio.AlignIO will not read these alignments
> > Bio/AlignIO/ClustalIO.py:94
> >  if line[:7] != 'CLUSTAL':
> >       raise ValueError("Did not find CLUSTAL header")
> >
> > Muscle does have a -clwstrict flag but ProbCons doesnt.
> >
> > Would it be a good idea to relax the header parsing?
> >
> > C.
>
> Maybe.  Up until now the only example of this I had personally come
> across was MUSCLE, but they helpfully provide the -clwstrict argument
> so the issue wasn't important.
>
> There are also of course the official variants like:
>
> CLUSTAL W (1.81) multiple sequence alignment
> CLUSTAL 2.0.9 multiple sequence alignment
>
> How would you code this?  A flexible option would be to take anything
> where the first line ends with "multiple sequence alignment", but this
> risks letting a lot of non-clustal files though which will then
> (hopefully) fail, but probably with a much more cryptic error message.
> A white list of safe variants like "MUSCLE" and "PROBCONS" would be
> safest.
>
> Also I have a vague memory of some tool using something like "CLUSTAL
> ... from ToolX" but I don't recall the details.


T-COFFEE for one:
"CLUSTAL FORMAT for T-COFFEE Version_6.92 [http://www.tcoffee.org] [MODE:
], CPU=0.00 sec, SCORE=100, Nseq=2, Len=601"

Is it so bad to let it fail on the structure of the data - effectively
ignore the header? Maybe have a general "this doesnt look like clustal
formatted data" error based on the data structure...

C.

--


From biopython at maubp.freeserve.co.uk  Tue May 12 16:05:15 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 12 May 2009 17:05:15 +0100
Subject: [Biopython-dev] Loading SeqRecords into BioSQL with NCBI taxon ID
Message-ID: <320fb6e00905120905l3d0e31b2p2a3d92f61096cbd5@mail.gmail.com>

Over on Bug 2826, David wrote:
http://bugzilla.open-bio.org/show_bug.cgi?id=2826#c2

> Thank you. I'm new to BioPython.
>
> The goal was to take some whole-genome sequence (which isn't in Genbank) and
> attach a taxon to it, in order that it be written to a BioSQL database.

You've talked about trying to parse WGS GenBank files on Bug 2825 but
presumable if this new data isn't in GenBank, it is in another format.

What format is your  whole-genome sequence?  FASTA or something simple?

> Other records in the BioSQL database derive from NCBI and so have taxon_ids,
> so the additional WGS being in a similar format would make things simpler.

I see. Basically you need to import a SeqRecord into BioSQL with an
NCBI taxon ID.  You don't need to write out a GenBank file to do this.

First create the SeqRecord, e.g.

from Bio import SeqIO
record = SeqIO.read(handle, format, alphabet)

There are now two options - because the BioSQL loader will look for
the NCBI taxon ID in two places:

(Option 1) Record the NCBI taxon ID in the SeqRecord's annotation
dictionary under the "ncbi_taxid" key.  This should work (untested):

record.annotations["ncbi_taxid"] = 12345 #or single element list, [12345]

(Option 2) Mimic a SeqRecord from parsing a GenBank file with a source
feature containing the taxon ID. This should work (untested):

#Create the SeqRecord:
record = SeqIO.read(handle, format, alphabet)
#Create the source features:
from Bio.SeqFeature import SeqFeature, FeatureLocation
f = SeqFeature(FeatureLocation(0, len(record)), strand=+1, type="source")
f.qualifiers["db_xref"] = ["taxon:12345"]
record.features = [f] #or insert at start

If you don't really have a sequence, this second approach doesn't make
so much sense.

[Arguably there could be a third option via the dbxref's list]

Then in either case, load the modified SeqRecord into the database.
You may want to pre-load the NCBI taxonomy, see
http://www.biopython.org/wiki/BioSQL

Alternatively, using Biopython 1.49+ you can have this fetched from
Entrez on demand with the fetch_NCBI_taxonomy=True option.  The BioSQL
wiki page needs updating on this topic.

Peter


From bugzilla-daemon at portal.open-bio.org  Tue May 12 16:11:43 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 12 May 2009 12:11:43 -0400
Subject: [Biopython-dev] [Bug 2826] SeqRecord dbxrefs not written to GenBank
	by SeqIO
In-Reply-To: <bug-2826-42@http.bugzilla.open-bio.org/>
Message-ID: <200905121611.n4CGBhrY001864@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2826


------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-12 12:11 EST -------
(In reply to comment #2)
> Thank you. I'm new to BioPython.
> 
> The goal was to take some whole-genome sequence (which isn't in Genbank) and
> attach a taxon to it, in order that it be written to a BioSQL database.

For this example you don't need to write out a GenBank file at all (which is
what this bug was about).  See my email on the mailing list for details:

http://lists.open-bio.org/pipermail/biopython/2009-May/005154.html
and sent in error to the dev list:
http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006028.html

I am leaving this bug open for relevant dbxrefs entries not currently recorded
when writing GenBank files with Bio.SeqIO (GI number which goes on the VERSION
line, and genome projects on the PROJECT / DBLINK line).

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Tue May 12 16:16:35 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 12 May 2009 17:16:35 +0100
Subject: [Biopython-dev] Clustal alignment format header line
In-Reply-To: <7265d4f0905120843j172f5303y7029e1f7b5f4187f@mail.gmail.com>
References: <7265d4f0905120407j151408bcnf988ba554fcb5140@mail.gmail.com>
	<320fb6e00905120828t139113d8rceb7db874acdd8a0@mail.gmail.com>
	<7265d4f0905120843j172f5303y7029e1f7b5f4187f@mail.gmail.com>
Message-ID: <320fb6e00905120916p3db7c003kf6eef581cbb4c93b@mail.gmail.com>

On Tue, May 12, 2009 at 4:43 PM, Cymon Cox <cy at cymon.org> wrote:
>Peter wrote:
>> Also I have a vague memory of some tool using something like "CLUSTAL
>> ... from ToolX" but I don't recall the details.
>
> T-COFFEE for one:
> "CLUSTAL FORMAT for T-COFFEE Version_6.92 [http://www.tcoffee.org] [MODE:
> ], CPU=0.00 sec, SCORE=100, Nseq=2, Len=601"

Yes - that is almost certainly the example I was thinking of.

> Is it so bad to let it fail on the structure of the data - effectively
> ignore the header? Maybe have a general "this doesnt look like clustal
> formatted data" error based on the data structure...

Some of the current error messages are a little cryptic to an end
user, I guess they could have "Are you sure this is a Clustal format
file?" appended to them.

I'd be happy with a whitelist of variant headers, i.e. must start with
"CLUSTAL", "MUSCLE" or "PROBCONS" (assuming these tools don't write
their own file formats which also start that way!).  If people find
new cases and report them, it also gives us notice about another tool
we may want to include in our command line wrappers, and/or obtain
sample output files for the unit tests.

Peter


From biopython at maubp.freeserve.co.uk  Tue May 12 17:14:27 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 12 May 2009 18:14:27 +0100
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
	<320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com>
	<8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com>
	<8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com>
	<320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com>
	<8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com>
	<320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com>
	<320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com>
	<320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com>
	<8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com>
Message-ID: <320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com>

On Tue, Apr 28, 2009 at 6:50 PM, Bartek Wilczynski
<bartek at rezolwenta.eu.org> wrote:
> On Tue, Apr 28, 2009 at 7:45 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> I take that back - I added an email address of just "peterc" to my
>> github account (it seems they don't do any validation, perhaps for
>> this very reason?). ?This had no immediate effect, but one day later
>> and all my CVS commits are now shown with my photo in github. ?Neat -
>
> great

That seems to have stopped working now - no idea why, "peterc" is
still listed an one of my email addresses on my github account, but my
github account is no longer linked to commits in Biopython.  Odd.

Do you think it would be straight forward for your CVS to git
conversion to map the CVS usernames to github usernames for future
commits (so as not to alter the currently published history)?

Peter


From bugzilla-daemon at portal.open-bio.org  Tue May 12 17:33:03 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 12 May 2009 13:33:03 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905121733.n4CHX3jK009739@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #32 from cymon.cox at gmail.com  2009-05-12 13:33 EST -------
Added PROBCONS and TCOFFEE command line interfaces and unittests.

The TCOFFEE commadline implements a very restricted set of options (just those
Brad attached). 

Also added white list of known headers to AlignIO/ClustalwIO.py:97 - the
PROBSCONS unittest will fail without this alteration.

On http://github.com/cymon/biopython-github-master/tree/applic-int
C.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bartek at rezolwenta.eu.org  Tue May 12 18:23:18 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Tue, 12 May 2009 20:23:18 +0200
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
	<8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com>
	<8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com>
	<320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com>
	<8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com>
	<320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com>
	<320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com>
	<320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com>
	<8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com>
	<320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com>
Message-ID: <8b34ec180905121123w6fe553ahd416b9e1d317dd11@mail.gmail.com>

On Tue, May 12, 2009 at 7:14 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:

> That seems to have stopped working now - no idea why, "peterc" is
> still listed an one of my email addresses on my github account, but my
> github account is no longer linked to commits in Biopython. ?Odd.

It seems to be OK again. Maybe it was temporary ?

>
> Do you think it would be straight forward for your CVS to git
> conversion to map the CVS usernames to github usernames for future
> commits (so as not to alter the currently published history)?
>

It would be straightforward to add a mapping to the conversion, but I
think it would affect the whole history...

I was thinking that the mapping was going to change when we finally
switch to git. Then it would be a natural cause of events...
Otherwise, we would have another step in our transition. Whether it's
worth doing it, depends on how long we expect to be in the transition
between CVS and git.

cheers
Bartek


From bugzilla-daemon at portal.open-bio.org  Tue May 12 18:44:09 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 12 May 2009 14:44:09 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905121844.n4CIi9sb017010@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1296 is|0                           |1
           obsolete|                            |


------- Comment #33 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-12 14:44 EST -------
(From update of attachment 1296)
(In reply to comment #32)
> Added PROBCONS and TCOFFEE command line interfaces and unittests.
> 
> The TCOFFEE commadline implements a very restricted set of options
> (just those Brad attached). 
> 
> Also added white list of known headers to AlignIO/ClustalwIO.py:97 - the
> PROBSCONS unittest will fail without this alteration.
> 
> On http://github.com/cymon/biopython-github-master/tree/applic-int

Thank you Cymon and Brad - those are now checked in, more or less as is.
I did tweak Bio/AlignIO/ClustalwIO.py a little bit.  Also, TCoffee says it can
be installed on Windows using Cygwin - we should try that at some point ;)

Note for the TCoffee suite we could also consider adding xpresso, 3dcoffee,
mcoffee and rcoffee as well - hopefully they have similar interfaces so with
some subclassing we won't have to duplicate a lot of the code.

One other thought - do you think the EMBOSS water and needle wrappers (and any
other alignment tools in EMBOSS) be made available under Bio.Align.Applications
(via an import in Bio/Align/Applications/__init__.py so no code duplication)?

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Tue May 12 18:57:24 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 12 May 2009 19:57:24 +0100
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <8b34ec180905121123w6fe553ahd416b9e1d317dd11@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
	<8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com>
	<320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com>
	<8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com>
	<320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com>
	<320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com>
	<320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com>
	<8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com>
	<320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com>
	<8b34ec180905121123w6fe553ahd416b9e1d317dd11@mail.gmail.com>
Message-ID: <320fb6e00905121157i2068b81fw5a8b21e65c33abe2@mail.gmail.com>

On Tue, May 12, 2009 at 7:23 PM, Bartek Wilczynski
<bartek at rezolwenta.eu.org> wrote:
>
> I was thinking that the mapping was going to change when we finally
> switch to git. Then it would be a natural cause of events...
> Otherwise, we would have another step in our transition. Whether it's
> worth doing it, depends on how long we expect to be in the transition
> between CVS and git.

I'm happy that git will work, and that I personally know enough about
the basics to manage.

I'm not happy with the current github repository due to the history
tag issue - but we know we can fix that now.  Are you going to try
removing the old tags and re-doing them on github?

Does anyone know how the git provided "ViewCVS" equivalent shows tags
in a file's history?

I think we should now have a chat with the OBF (off list) about how we
might go about installing git on their server.  Commits can then be
pushed out to github automatically (or pulled from github if we go the
other way round).  This would make several things easier:

(1) Seamless continuation of existing user accounts
(2) Keeping the snapshot code up to date: http://biopython.org/SRC/biopython/
(3) Having our own commit RSS feeds (not essential as this could be
done on github)
(4) Having automatic builds of the documentation (previously discussed
as nice to have)

Plus of course giving redundancy with the code mirrored on both OBF
servers and GitHub :)

Peter


From bugzilla-daemon at portal.open-bio.org  Tue May 12 19:45:12 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 12 May 2009 15:45:12 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905121945.n4CJjCFj023070@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #34 from cymon.cox at gmail.com  2009-05-12 15:45 EST -------
(In reply to comment #33)
> Note for the TCoffee suite we could also consider adding xpresso, 3dcoffee,
> mcoffee and rcoffee as well - hopefully they have similar interfaces so with
> some subclassing we won't have to duplicate a lot of the code.

With the latest version of t_coffee (and not the currently available Jaunty
package!), these (ie the meta calls like mcoffee etc) are all covered by the
"-mode" option. I just installed t_coffee from source and this appears to be
the case. There are so many options and interdependencies in TCOFFEE, and its
command line is clearly a moving target, that the interface may require more
work before being released.

> One other thought - do you think the EMBOSS water and needle wrappers (and any
> other alignment tools in EMBOSS) be made available under Bio.Align.Applications
> (via an import in Bio/Align/Applications/__init__.py so no code duplication)?

Sounds good to me.
C.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Tue May 12 23:04:53 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 13 May 2009 00:04:53 +0100
Subject: [Biopython-dev] Bio.EMBOSS wrappers
In-Reply-To: <320fb6e00904130649v699b16d1seea0c51172b9ab71@mail.gmail.com>
References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com>
	<20090413123219.GB5429@sobchak.mgh.harvard.edu>
	<320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com>
	<20090413134429.GE5429@sobchak.mgh.harvard.edu>
	<320fb6e00904130649v699b16d1seea0c51172b9ab71@mail.gmail.com>
Message-ID: <320fb6e00905121604q4c70d69ck35fb16210fb0efe2@mail.gmail.com>

On Mon, Apr 13, 2009 at 2:49 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Mon, Apr 13, 2009 at 2:44 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>>> > ... Feel free to add away.
>>>
>>> I need to work on my delegation skills - that seems to have back fired ;)
>>
>> Oops. I honestly read that as "do I have your permission?" I can of
>> course tackle this, but am a bit underwater now.
>
> Looking back, I was a bit ambiguous.  I don't mind who does it - let's
> see who has time free first.

That's done in CVS now - plus a few other things like -die and -stdout.
I've also done -outfile via the new base Emboss wrapper, as all the
tools (so far at least) include this option.

>>> Regarding adding -auto support, I have a question about the needle
>>> wrapper and the gap parameters.  Using the needle tool at the command
>>> line will prompt for the gap parameters UNLESS the -auto argument has
>>> been used.  i.e. Without -auto, it makes sense to insist on the gap
>>> parameters being included, which is what the current wrapper does.
>>> However, if we add support for -auto, then these parameters can be
>>> optional.  We could handle this in the wrapper, but it would be messy
>>> (and there may be similar questions with other EMBOSS tools).  What do
>>> you think - stick with the simple option of insisting the Biopython
>>> user set the gap parameters, even if they are using -auto?
>>
>> I think we should stick with the simple option. These were meant to
>> be pretty dumb specifiers that help users write more modular code than
>> simply pasting in a raw string for the command line. Trying to get
>> too fancy is probably overkill.
>
> Agreed.

By putting the outfile argument on the base EMBOSS wrapper class,
together with the related -filter and -stdout options, I was able to
enforce a simple check that at least one of these is used, applicable
to all the wrappers.  This preserves the old safety check that the
output file is required (unless using standard out via -filter and/or
-stdout instead).

Something similar could be done so that using -auto overrides the
any "required" flags we have set (e.g. for gapopen in water), but this
seems unnecessary to me (as discussed above).

Peter


From biopython at maubp.freeserve.co.uk  Wed May 13 09:55:06 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 13 May 2009 10:55:06 +0100
Subject: [Biopython-dev] Properties names in command line wrappers
In-Reply-To: <20090505123656.GB15113@sobchak.mgh.harvard.edu>
References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com>
	<20090505123656.GB15113@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00905130255j22905e80y6ff8678f6af89cb2@mail.gmail.com>

On Mon, May 4, 2009, Peter  wrote:
>>> ... The (hardly used) existing blastall wrapper in
>>> Bio/Blast/Applications.py gives the "-a" argument a human
>>> readable name of "nprocessors", and "-A" gets "window_size".
>>> With the old set_parameter call either alias could be used.
>>> However, with a python property we need to pick one as a
>>> preferred name - and I'm not 100% sure being helpful and
>>> using "nprocessors" (e.g. cline.nprocessors=4) is actually
>>> better than using the actual argument name (e.g. cline.a = 4).

On Tue, May 5, 2009, Brad wrote:
>> Could we support both the original argument and optional human
>> readable arguments? I know the code in Application is a bit
>> hard coded for the first argument as the real name and the last
>> argument as the readable name; the cleanest solution would be to
>> generalize this to have multiple names where it makes sense.
>> ...

On Tue, May 5, 2009, Peter wrote:
> ...
> I favour using only a single property for each parameter, with the
> name as similar as possible to the actual command line switch (i.e.
> property name "a" for "-a", not "nprocessors").  Note each property
> would have a docstring which will say what is it for ("Number of
> processors to use.").

I still favour only using a single python property for each parameter,
but after some work on the blastall wrapper last night, I am
beginning to come round to your point of view.

If a command line tool provides a long parameter name (some tools
provide both short and long names for important parameters) we
should use that rather than inventing our own [so no change here].

However, for tools like BLAST which *only* have cryptic single letter
command line options (case sensitive), maybe we should be using
a sensible human readable name for the associated property in the
Biopython wrapper (i.e. "nprocessors" for "-a", and "window_size"
for "-A").  Having actually now tried using properties "a" and "A",
the resulting python code is very cryptic - and only makes sense
if you are familiar with the blastall arguments (and given there are
so many of them, this is difficult!).

It should be trivial to extend to documentation strings automatically
to include something like "This maps onto the XXX command line
argument" so that the mapping is clear to the user without having to
look at our source code.

Hopefully this gets the balance right between giving nice python
code, and staying faithful to the actual command line tool API.

Peter


From biopython at maubp.freeserve.co.uk  Wed May 13 11:15:35 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 13 May 2009 12:15:35 +0100
Subject: [Biopython-dev] Properties names in command line wrappers
In-Reply-To: <7265d4f0905130350s4006db5bqc40eed1ec535243d@mail.gmail.com>
References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com>
	<20090505123656.GB15113@sobchak.mgh.harvard.edu>
	<320fb6e00905130255j22905e80y6ff8678f6af89cb2@mail.gmail.com>
	<7265d4f0905130350s4006db5bqc40eed1ec535243d@mail.gmail.com>
Message-ID: <320fb6e00905130415p6757da94id8d7508a2ea3eebf@mail.gmail.com>

On Wed, May 13, 2009 at 11:50 AM, Cymon Cox <cy at cymon.org> wrote:
>> On Tue, May 5, 2009, Peter wrote:
>> > ...
>> > I favour using only a single property for each parameter, with the
>> > name as similar as possible to the actual command line switch (i.e.
>> > property name "a" for "-a", not "nprocessors"). ?Note each property
>> > would have a docstring which will say what is it for ("Number of
>> > processors to use.").
>>
>> I still favour only using a single python property for each parameter,
>
> A confusing issue arises where we have alternative names for options.
> That the following example from _Probcons.py:
>
> ??????????? _Option(["-c", "c", "--consistency", "consistency" ], ["input"],
> ??????????????????? lambda x: x in range(0,6),
> ??????????????????? 0,
> ??????????????????? "Use 0 <= REPS <= 5 (default: 2) passes of consistency
> transformation",
> ??????????????????? 0),
>
>>>> cmd = cmdline = ProbconsCommandline("probcons", input="blah")
>>>> cmd.c = 1
>>>> str(cmd)
> 'probcons blah '
>>>> cmd.set_parameter("c", 1)
>>>> str(cmd)
> 'probcons -c 1 blah '
>>>> cmd.consistency = 2
>>>> str(cmd)
> 'probcons -c 2 blah '
>>>> cmd.c = 5
>>>> str(cmd)
> 'probcons -c 2 blah '
>
> That is, the user needs to look at the code to figure out what the correct
> name is to use when assigning to the property. Is it possible to restrict
> the binding of attributes to the cmdline to only valid property names? An
> alternative would be to restrict all parameters to only one name and
> document the alternatives it covers (dont like this idea - see below).

Yes, you can use any of the defined aliases with set_parameter, and
they are all equally valid, and all do exactly the same thing.  e.g.

cmd = ProbconsCommandline("probcons", input="blah")
cmd.set_parameter("c", 1)
cmd.set_parameter("-c", 1)
cmd.set_parameter("--consistency", 1)
cmd.set_parameter("consistency", 1)

I would however regard set_parameter as a legacy method and
push the (single) keyword argument or property alternative, for
which there is only one name (here "consistency" ):

cmd = ProbconsCommandline("probcons", input="blah")
cmd.consistency = 1

or,

cmd = ProbconsCommandline("probcons", input="blah", consistency=1)

[And yes, we should have some error checking code in the base
class __init__ method to make sure the string used is a valid python
identifier.]

The user does NOT have to look at the source code to find this out -
just the docstrings or properties - try help(cmd) or dir(cmd) in python.

>> but after some work on the blastall wrapper last night, I am
>> beginning to come round to your point of view.
>>
>> If a command line tool provides a long parameter name (some tools
>> provide both short and long names for important parameters) we
>> should use that rather than inventing our own [so no change here].
>>
>> However, for tools like BLAST which *only* have cryptic single letter
>> command line options (case sensitive), maybe we should be using
>> a sensible human readable name for the associated property in the
>> Biopython wrapper (i.e. "nprocessors" for "-a", and "window_size"
>> for "-A"). ?Having actually now tried using properties "a" and "A",
>> the resulting python code is very cryptic - and only makes sense
>> if you are familiar with the blastall arguments (and given there are
>> so many of them, this is difficult!).
>
> I dont agree. If you want to make your python code legible to people
> who are not familar with the command line options, you can just
> comment it. I think the interfaces should stick as close as possible
> to the application documentation. I see these interfaces being used
> mostly by people who are familar with the applications, in which case
> the command line construction should be fairly intuitive.

Well, I am on the fence here.  The trouble is that sometimes (e.g. BLAST)
the command line parameters themselves are just so cryptic.  Yes, we
could just use "a" and "A", and leave it up to the user to document their
code.  If we using "nprocessors" and "window_size" the code becomes
self documenting (although you have to know Biopython's mapping).

Brad's suggestion to support both in the property and keyword
arguments brings us back to having multiple choices on how to do
set a parameter (as in the set_parameter with its aliases), confusing
and unpythonic.

Peter


From cy at cymon.org  Wed May 13 10:50:54 2009
From: cy at cymon.org (Cymon Cox)
Date: Wed, 13 May 2009 11:50:54 +0100
Subject: [Biopython-dev] Properties names in command line wrappers
In-Reply-To: <320fb6e00905130255j22905e80y6ff8678f6af89cb2@mail.gmail.com>
References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com> 
	<20090505123656.GB15113@sobchak.mgh.harvard.edu>
	<320fb6e00905130255j22905e80y6ff8678f6af89cb2@mail.gmail.com>
Message-ID: <7265d4f0905130350s4006db5bqc40eed1ec535243d@mail.gmail.com>

2009/5/13 Peter <biopython at maubp.freeserve.co.uk>

> On Mon, May 4, 2009, Peter  wrote:
> >>> ... The (hardly used) existing blastall wrapper in
> >>> Bio/Blast/Applications.py gives the "-a" argument a human
> >>> readable name of "nprocessors", and "-A" gets "window_size".
> >>> With the old set_parameter call either alias could be used.
> >>> However, with a python property we need to pick one as a
> >>> preferred name - and I'm not 100% sure being helpful and
> >>> using "nprocessors" (e.g. cline.nprocessors=4) is actually
> >>> better than using the actual argument name (e.g. cline.a = 4).
>
> On Tue, May 5, 2009, Brad wrote:
> >> Could we support both the original argument and optional human
> >> readable arguments? I know the code in Application is a bit
> >> hard coded for the first argument as the real name and the last
> >> argument as the readable name; the cleanest solution would be to
> >> generalize this to have multiple names where it makes sense.
> >> ...
>
> On Tue, May 5, 2009, Peter wrote:
> > ...
> > I favour using only a single property for each parameter, with the
> > name as similar as possible to the actual command line switch (i.e.
> > property name "a" for "-a", not "nprocessors").  Note each property
> > would have a docstring which will say what is it for ("Number of
> > processors to use.").
>
> I still favour only using a single python property for each parameter,


A confusing issue arises where we have alternative names for options. That
the following example from _Probcons.py:


            _Option(["-c", "c", "--consistency", "consistency" ], ["input"],
                    lambda x: x in range(0,6),
                    0,
                    "Use 0 <= REPS <= 5 (default: 2) passes of consistency
transformation",
                    0),

>>> cmd = cmdline = ProbconsCommandline("probcons", input="blah")
>>> cmd.c = 1
>>> str(cmd)
'probcons blah '
>>> cmd.set_parameter("c", 1)
>>> str(cmd)
'probcons -c 1 blah '
>>> cmd.consistency = 2
>>> str(cmd)
'probcons -c 2 blah '
>>> cmd.c = 5
>>> str(cmd)
'probcons -c 2 blah '

That is, the user needs to look at the code to figure out what the correct
name is to use when assigning to the property. Is it possible to restrict
the binding of attributes to the cmdline to only valid property names? An
alternative would be to restrict all parameters to only one name and
document the alternatives it covers (dont like this idea - see below).

but after some work on the blastall wrapper last night, I am
> beginning to come round to your point of view.
>
> If a command line tool provides a long parameter name (some tools
> provide both short and long names for important parameters) we
> should use that rather than inventing our own [so no change here].
>
> However, for tools like BLAST which *only* have cryptic single letter
> command line options (case sensitive), maybe we should be using
> a sensible human readable name for the associated property in the
> Biopython wrapper (i.e. "nprocessors" for "-a", and "window_size"
> for "-A").  Having actually now tried using properties "a" and "A",
> the resulting python code is very cryptic - and only makes sense
> if you are familiar with the blastall arguments (and given there are
>
so many of them, this is difficult!).


I dont agree. If you want to make your python code legible to people who are
not familar with the command line options, you can just comment it. I think
the interfaces should stick as close as possible to the application
documentation. I see these interfaces being used mostly by people who are
familar with the applications, in which case the command line construction
should be fairly intuitive.

Cheers, C.
--


From biopython at maubp.freeserve.co.uk  Wed May 13 13:10:59 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 13 May 2009 14:10:59 +0100
Subject: [Biopython-dev] Properties names in command line wrappers
In-Reply-To: <320fb6e00905130415p6757da94id8d7508a2ea3eebf@mail.gmail.com>
References: <320fb6e00905040753g6c1a990j181f742eed858cd9@mail.gmail.com>
	<20090505123656.GB15113@sobchak.mgh.harvard.edu>
	<320fb6e00905130255j22905e80y6ff8678f6af89cb2@mail.gmail.com>
	<7265d4f0905130350s4006db5bqc40eed1ec535243d@mail.gmail.com>
	<320fb6e00905130415p6757da94id8d7508a2ea3eebf@mail.gmail.com>
Message-ID: <320fb6e00905130610g3eb8edb4q99913b8b0ae14bf9@mail.gmail.com>

On Wed, May 13, 2009 at 12:15 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> The user does NOT have to look at the source code to find this out -
> just the docstrings or properties - try help(cmd) or dir(cmd) in python.
>

I've just updated the automatically generated docstrings for each property
so that it includes the actual parameter name which will be used to build
the string.

Peter


From bugzilla-daemon at portal.open-bio.org  Wed May 13 15:01:33 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 13 May 2009 11:01:33 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905131501.n4DF1XYv019413@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #35 from cymon.cox at gmail.com  2009-05-13 11:01 EST -------
Ive added some very basic unittests for the command line interfaces, which dont
require the applications to be installed.

test_Application_Commandlines.py - currently in only includes
Bio/Align/Applications but Bio/Emboss tests could be added.

Note that the _Mafft.py command line interface is currently broken due the
restriction only having a single instance of a parameter on the command line.
Mafft uses the following option: 

--seed alignment1 [--seed alignment2 --seed alignment3 ...]

We could remove support this option in Mafft.

C.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Wed May 13 15:23:34 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 13 May 2009 11:23:34 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To: <bug-2815-42@http.bugzilla.open-bio.org/>
Message-ID: <200905131523.n4DFNYX7021233@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2815


------- Comment #36 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-13 11:23 EST -------
(In reply to comment #35)
> 
> Note that the _Mafft.py command line interface is currently broken due the
> restriction only having a single instance of a parameter on the command line.
> Mafft uses the following option: 
> 
> --seed alignment1 [--seed alignment2 --seed alignment3 ...]
> 
> We could remove support this option in Mafft.

Removing the --seed argument might be a pragmatic short term solution.

I'd considered this type of thing as a possible corner case - but hadn't
mentioned it as I didn't have a concrete example.  I would suggest setting the
parameter value to a list could work:

i.e. Support any of:

cline = MafftCommandline(seed=["alignment1", "alignment2", "alignment3"])
cline.set_paramter("seed", ["alignment1", "alignment2", "alignment3"])
cline.seed = ["alignment1", "alignment2", "alignment3"]

giving:

mafft --seed alignment1 --seed alignment2 --seed alignment3

We'd need to introduce a new _Option subclass for this.  A similar situation
applies to optional argument lists, like the Unix zip command:

zip zipfile file1 file2 file3 ...

where there is a single output filename (here zipfile), and then one or more
input files or filespecifiers (here three entries).

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From winda002 at student.otago.ac.nz  Thu May 14 04:53:42 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Thu, 14 May 2009 16:53:42 +1200
Subject: [Biopython-dev] Cookbook entry, concatenating nexus files
Message-ID: <4A0BA3D6.5070207@student.otago.ac.nz>


I have been slowly adding some of the scripts I use most commonly to the 
cookbook section of the wiki 
(http://biopython.org/wiki/Category:Cookbook). Since I'm very much a  
dilettante at this programming business as the cookbook is meant as 
supplementary documentation for Biopython it's probably a good idea for 
someone that knows what they are doing to look at these things (Peter 
has been really helpful with this thus far, but is seems unfair to 
saddle one man with so much bad programming :)

I've just added a recipe that uses the nexus class to concatenate 
multiple nexus files and provide some feedback if the taxa are not the 
same in each one: http://biopython.org/wiki/Concatenate_nexus

Any thoughts? If you think you can make it clearer/quicker/better then 
you can edit it on the wiki or provide comments here of there.

Cheers,
David


From biopython at maubp.freeserve.co.uk  Thu May 14 09:27:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 14 May 2009 10:27:12 +0100
Subject: [Biopython-dev] Cookbook entry, concatenating nexus files
In-Reply-To: <4A0BA3D6.5070207@student.otago.ac.nz>
References: <4A0BA3D6.5070207@student.otago.ac.nz>
Message-ID: <320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com>

On Thu, May 14, 2009 at 5:53 AM, David Winter
<winda002 at student.otago.ac.nz> wrote:
>
> I have been slowly adding some of the scripts I use most commonly to the
> cookbook section of the wiki (http://biopython.org/wiki/Category:Cookbook).
> Since I'm very much a ?dilettante at this programming business as the
> cookbook is meant as supplementary documentation for Biopython it's probably
> a good idea for someone that knows what they are doing to look at these
> things (Peter has been really helpful with this thus far, but is seems
> unfair to saddle one man with so much bad programming :)
>
> I've just added a recipe that uses the nexus class to concatenate multiple
> nexus files and provide some feedback if the taxa are not the same in each
> one: http://biopython.org/wiki/Concatenate_nexus
>
> Any thoughts? If you think you can make it clearer/quicker/better then you
> can edit it on the wiki or provide comments here of there.

What exactly are you trying to achieve?  A big Nexus files with lots
of alignments (and trees) in it?

When I talked to Frank about Nexus files, he said they should only
ever hold one alignment matrix, hence Bio.AlignIO does not allow
writing multiple alignments to a single Nexus file.  If you have some
real world examples of Nexus files holding more than one alignment
matrix, please share them - then we can try and get Bio.AlignIO (and
if need be Bio.Nexus) to cope with them directly!

Peter


From cy at cymon.org  Thu May 14 09:59:51 2009
From: cy at cymon.org (Cymon Cox)
Date: Thu, 14 May 2009 10:59:51 +0100
Subject: [Biopython-dev] Cookbook entry, concatenating nexus files
In-Reply-To: <320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com>
References: <4A0BA3D6.5070207@student.otago.ac.nz>
	<320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com>
Message-ID: <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com>

2009/5/14 Peter <biopython at maubp.freeserve.co.uk>

> On Thu, May 14, 2009 at 5:53 AM, David Winter
> <winda002 at student.otago.ac.nz> wrote:
> >
> > I have been slowly adding some of the scripts I use most commonly to the
> > cookbook section of the wiki (
> http://biopython.org/wiki/Category:Cookbook).
> > Since I'm very much a  dilettante at this programming business as the
> > cookbook is meant as supplementary documentation for Biopython it's
> probably
> > a good idea for someone that knows what they are doing to look at these
> > things (Peter has been really helpful with this thus far, but is seems
> > unfair to saddle one man with so much bad programming :)
> >
> > I've just added a recipe that uses the nexus class to concatenate
> multiple
> > nexus files and provide some feedback if the taxa are not the same in
> each
> > one: http://biopython.org/wiki/Concatenate_nexus
> >
> > Any thoughts? If you think you can make it clearer/quicker/better then
> you
> > can edit it on the wiki or provide comments here of there.
>
> What exactly are you trying to achieve?  A big Nexus files with lots
> of alignments (and trees) in it?


The example David has given is very useful and a common procedure for
phylogeneticists. Single gene/proteins tend to be aligned in separate
alignment files and the concatenated into a so-called 'supermatrix'.

One thing I would question is the first line:

"It's a good idea, if possible, to make species-level phylogenetic
inferences bases on multiple genes because a) demographic processes can lead
gene-trees to diverge from species trees and b) journal editors now this."

Yes, it is a good idea to make inferences based upon the largest amount of
data, but if demographic process have led to some gene(s) that have diverged
from the species tree, then this is a reason not to combined them.
Phylogenetic inference assumes all data evolved on the same tree - typically
one would analyse gene partitions individually to look for incongruence
among partitions before combining the data.


> When I talked to Frank about Nexus files, he said they should only
> ever hold one alignment matrix,


Well, that was my understanding as well. But, it may be wrong. I just tried
it - p4 will read both matrices no problem, PAUP* (the de facto standard
here) will execute both matrices ok presumably leaving just the last as the
data in memory.

I'll look into this further...

Cheers C.
-- 
____________________________________________________________________

Cymon J. Cox

Centro de Ciencias do Mar
Faculdade de Ciencias do Mar e Ambiente (FCMA)
Universidade do Algarve
Campus de Gambelas
8005-139 Faro
Portugal

Phone: +0351 289800909 ext 7909
Fax: +0351 289800051
Email: cy at cymon.org, cymon at ualg.pt, cymon.cox at gmail.com
HomePage : http://biology.duke.edu/bryology/cymon.html
-8.63/-6.77


From biopython at maubp.freeserve.co.uk  Thu May 14 11:02:03 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 14 May 2009 12:02:03 +0100
Subject: [Biopython-dev] Cookbook entry, concatenating nexus files
In-Reply-To: <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com>
References: <4A0BA3D6.5070207@student.otago.ac.nz>
	<320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com>
	<7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com>
Message-ID: <320fb6e00905140402i5e7d9343p60162f697692b427@mail.gmail.com>

On Thu, May 14, 2009 at 10:59 AM, Cymon Cox <cy at cymon.org> wrote:
>> What exactly are you trying to achieve? ?A big Nexus files with lots
>> of alignments (and trees) in it?
>
> The example David has given is very useful and a common procedure for
> phylogeneticists. Single gene/proteins tend to be aligned in separate
> alignment files and the concatenated into a so-called 'supermatrix'.

Oh right - I hadn't looked at David's example carefully enough earlier
to work out which concatenation he was doing (by row or by column).
It does make sense on re-reading.

Concatenation to give a single supermatrix (same number of taxa,
longer sequences) would be most elegantly done by sorting the three
alignments (so the taxa are in the same order) and then concatenating
them (by column).  See Bug 2552,
http://bugzilla.open-bio.org/show_bug.cgi?id=2552

Note that this procedure isn't specific to NEXUS files - you could do
this with any alignment format.  It is just fairly straight forward
with the Bio.Nexus module at the moment (at least, until we fix Bug
2552).

Peter


From biopython at maubp.freeserve.co.uk  Thu May 14 11:11:30 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 14 May 2009 12:11:30 +0100
Subject: [Biopython-dev] Cookbook entry, concatenating nexus files
In-Reply-To: <320fb6e00905140402i5e7d9343p60162f697692b427@mail.gmail.com>
References: <4A0BA3D6.5070207@student.otago.ac.nz>
	<320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com>
	<7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com>
	<320fb6e00905140402i5e7d9343p60162f697692b427@mail.gmail.com>
Message-ID: <320fb6e00905140411med6a548g5f687a53881e24aa@mail.gmail.com>

On Thu, May 14, 2009 at 12:02 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Oh right - I hadn't looked at David's example carefully enough earlier
> to work out which concatenation he was doing (by row or by column).
> It does make sense on re-reading.

I'd rephrase this bit of the intro:

<start>
It's a good idea, if possible, to make species-level phylogenetic
inferences bases on multiple genes because a) demographic processes
can lead gene-trees to diverge from species trees and b) journal
editors now this. Most of the alignment files supported by Biopython
allow you to write multiple alignments to the same file which makes
this easy. However, the nexus file format (used by PAUP* and Mr Bayes)
does not. In nexus files multiple alignments need to be represented as
different 'character partitions' within a data matrix that contains
one long sequence for each taxon.
<end>

Bio.AlignIO will in general write out one or more alignments to a
file.  It does NOT do any concatenation by column, required to give
the "supermatrix" which you want (which is why I get confused on the
first reading).  How about:

<start>
It's a good idea, if possible, to make species-level phylogenetic
inferences bases on multiple genes because (a) demographic processes
can lead gene-trees to diverge from species trees and (b) journal
editors know this.  [add stuff from Cymon's comment here?]

This is usually handled by creating a single "supermatrix" from
separate alignments for each gene.  i.e. You need a single alignment
containing one row for each taxon where the rows are the concatenated
pre-aligned sequences.  In NEXUS files (used by PAUP* and Mr Bayes)
multiple alignments can be explicitly represented as different
'character partitions' within a data matrix that contains one long
sequence for each taxon.  The Bio.Nexus module makes this relatively
straight forward.
<end>

Peter


From cy at cymon.org  Thu May 14 11:30:20 2009
From: cy at cymon.org (Cymon Cox)
Date: Thu, 14 May 2009 12:30:20 +0100
Subject: [Biopython-dev] Cookbook entry, concatenating nexus files
In-Reply-To: <7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com>
References: <4A0BA3D6.5070207@student.otago.ac.nz>
	<320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com> 
	<7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com>
Message-ID: <7265d4f0905140430j47b0a661jd58dbe5749e4a1f7@mail.gmail.com>

2009/5/14 Cymon Cox <cy at cymon.org>

> 2009/5/14 Peter <biopython at maubp.freeserve.co.uk>
>
>> When I talked to Frank about Nexus files, he said they should only
>> ever hold one alignment matrix,
>
>
> Well, that was my understanding as well. But, it may be wrong. I just tried
> it - p4 will read both matrices no problem, PAUP* (the de facto standard
> here) will execute both matrices ok presumably leaving just the last as the
> data in memory.
>
> I'll look into this further...
>

After a quick scan of the spec, there appears to be only one oblique
reference to this issue:

"Although the NEXUS standard does not impose constraints on the number of
blocks, particular programs will. For example, MacClade 3.07 does not allow
more than one TAXA block in a file."

So I read that to mean, you can have any number of similarly named blocks in
a NEXUS file, ie multiple DATA, TAXA, CHARACTERS, TREES etc, and its up to
an individual application to decide how to deal with them.

This seems to be in practice what happens: PAUP* will read multiple blocks
of the same name but only the last block of a particular name will remain in
memory after the file has been parsed. On the other hand, P4 will read
multiple DATA blocks and store the different alignments as separate objects,
and read multiple TREES blocks and store all the trees.

C.


From biopython at maubp.freeserve.co.uk  Thu May 14 18:20:47 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 14 May 2009 19:20:47 +0100
Subject: [Biopython-dev] SwissProt DE lines and bioentry.description field
	in BioSQL
Message-ID: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>

Hi,

This is cross-posted between biopython-dev and biosql-l as it regards
parsing the description (DE) lines in SwissProt files and how they are
stored in BioSQL.  This follows from an earlier discussion on
biopython-dev

Older SwissProt files just had one or two DE lines, and it made sense
to treat this as a simple string mapped onto the description field in
the bioentry table in BioSQL.  This appears to what happens with
BioPerl 1.5.x and in Biopython (although the details regarding white
space differ).  However, newer SwissProt files have many DE lines with
additional structure.  The example Michiel gave earlier on the
biopython-dev list was:

http://www.uniprot.org/uniprot/Q9XHP0.txt

This has the following DE lines:

DE   RecName: Full=11S globulin seed storage protein 2;
DE   AltName: Full=11S globulin seed storage protein II;
DE   AltName: Full=Alpha-globulin;
DE   Contains:
DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
DE     AltName: Full=11S globulin seed storage protein II acidic chain;
DE   Contains:
DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
DE     AltName: Full=11S globulin seed storage protein II basic chain;
DE   Flags: Precursor;

I had to fight with perl to get my old copy of BioPerl working again
(some week reference thing), but I managed, and then loaded this file
into my test BioSQL database with:

$ perl load_seqdatabase.pl --dbname biosql_test --dbuser root --dbpass
XXX --namespace biosql_test --format swiss Q9XHP0.txt

Then I looked at the resulting description in the main bioentry table:

$ mysql --user=root -p biosql_test -e 'SELECT description FROM
bioentry WHERE accession="Q9XHP0";'

This is stored as one huge long string (without the newlines, I'm not
sure if BioPerl strips those in parsing the file, or when loading it
into the database):

RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S
globulin seed storage protein II; AltName: Full=Alpha-globulin;
Contains: RecName: Full=11S globulin seed storage protein 2 acidic
chain; AltName: Full=11S globulin seed storage protein II acidic
chain; Contains: RecName: Full=11S globulin seed storage protein 2
basic chain; AltName: Full=11S globulin seed storage protein II basic
chain; Flags: Precursor;

For Biopython, I emptied the database then did:

>>> from Bio import SeqIO
>>> from BioSQL import BioSeqDatabase
>>> server = BioSeqDatabase.open_database(driver="MySQLdb", user="root", passwd = "XXX", host = "localhost", db="biosql_test")
>>> db = server["biosql-test"] #namespace
>>> db.load(SeqIO.parse(open("Q9XHP0.txt"), "swiss"))
1
>>> server.commit()

As before, I looked in the table with mysql.  Again - this stores the
full description from the DE line, although with the newlines
embedded.  So, Biopython is consistent with my old copy of BioPerl
(1.5.x) if we ignore the white space.

However, how does this look in BioPerl 1.6?  If this is the same, are
there any plans to change this?  For Biopython we have discussed
recording most of the DE information under the annotations instead
(keyed off RecName, AltName, Contains, Flags), but I would like to be
consistent with BioPerl+BioSQL.

Thanks

Peter


From winda002 at student.otago.ac.nz  Thu May 14 22:39:34 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Fri, 15 May 2009 10:39:34 +1200
Subject: [Biopython-dev] Cookbook entry, concatenating nexus files
In-Reply-To: <320fb6e00905140411med6a548g5f687a53881e24aa@mail.gmail.com>
References: <4A0BA3D6.5070207@student.otago.ac.nz>	
	<320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com>	
	<7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com>	
	<320fb6e00905140402i5e7d9343p60162f697692b427@mail.gmail.com>
	<320fb6e00905140411med6a548g5f687a53881e24aa@mail.gmail.com>
Message-ID: <4A0C9DA6.9060403@student.otago.ac.nz>

Peter wrote:
> On Thu, May 14, 2009 at 12:02 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>   
>> Oh right - I hadn't looked at David's example carefully enough earlier
>> to work out which concatenation he was doing (by row or by column).
>> It does make sense on re-reading.
>>     
Well, just about ;)
>
> I'd rephrase this bit of the intro:
>   
Yep, that's much better. Thanks Peter and Cymon for your feedback on 
this, I've updated the intro to include it and a couple of specific 
examples of how you'd use the character partitions.

(Have you guys seen  this:  
doi.wiley.com/10.1111/j.1755-0998.2008.02164.x , you could write a paper 
from one function in your nexus module!)

cheers,
david


From biopython at maubp.freeserve.co.uk  Fri May 15 09:05:59 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 15 May 2009 10:05:59 +0100
Subject: [Biopython-dev] Cookbook entry, concatenating nexus files
In-Reply-To: <4A0C9DA6.9060403@student.otago.ac.nz>
References: <4A0BA3D6.5070207@student.otago.ac.nz>
	<320fb6e00905140227t1b17e65dqfc9a042905692b4@mail.gmail.com>
	<7265d4f0905140259x620432e1t33a868ed097e763e@mail.gmail.com>
	<320fb6e00905140402i5e7d9343p60162f697692b427@mail.gmail.com>
	<320fb6e00905140411med6a548g5f687a53881e24aa@mail.gmail.com>
	<4A0C9DA6.9060403@student.otago.ac.nz>
Message-ID: <320fb6e00905150205k31d95c84naac1fa7873461263@mail.gmail.com>

On Thu, May 14, 2009 at 11:39 PM, David Winter
<winda002 at student.otago.ac.nz> wrote:
>>
>> I'd rephrase this bit of the intro:
>>
>
> Yep, that's much better. Thanks Peter and Cymon for your feedback on this,
> I've updated the intro to include it and a couple of specific examples of
> how you'd use the character partitions.

That does look much clearer now :) Could you include the three original
alignments in the text?  It would help to let the reader see what is going
on (and could be used to reproduce the example).

> (Have you guys seen ?this: ?doi.wiley.com/10.1111/j.1755-0998.2008.02164.x ,
> you could write a paper from one function in your nexus module!)

>From the abstract that does sound pretty trivial, but I guess that tool would be
useful for non-programmers - even if you could probably rewrite it as one
short python script using Biopython (or indeed a Perl script using BioPerl etc).

Peter


From bugzilla-daemon at portal.open-bio.org  Sat May 16 00:24:29 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 15 May 2009 20:24:29 -0400
Subject: [Biopython-dev] [Bug 2829] New: Biosequence.alphabet can be set to
	unknown after loading a nucleotide SeqRecord
Message-ID: <bug-2829-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2829

           Summary: Biosequence.alphabet can be set to unknown after loading
                    a nucleotide SeqRecord
           Product: Biopython
           Version: 1.49
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: BioSQL
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: david.wyllie at ndm.ox.ac.uk


Hi

I have done the following
1 loaded a small nucleotide fasta file with SeqIO, setting the alphabet
successfully 
2 written it to a test database with BioSQL
3 reloaded it, at which point the reloaded object has a "SingleLetterAlphabet"
alphabet and biosequence.alphabet is set to unknown.

Is this expected?

The overall object was to add some SeqFeatures to the loaded SeqRecord, but it
doesn't seem to store correctly even without any manipulations.

Below demonstrates the problem. The system is Ubuntu 9 x64/ Python 2.6/
Biopython 1.49.

#!/usr/bin/env python

from BioSQL import BioSeqDatabase
from Bio.Alphabet import generic_nucleotide
from Bio import SeqIO
from Bio import Seq

# define variables needed for testing
username="myusername"
password="mypassword"
hostname="localhost"

# we are going to try to load a nucleotide fasta file into a BioSQL database
# need a test file, with inputfile the file name;
#>test_sequence
#ttgaccgatgaccccggttcaggcttcaccacagtgtggaacgcggtcgtctccgaactt
inputfile="/home/dwyllie/test.faa"

# we want to create a new BioSQL database, called test
dbname="test"
dbdescription="test of alphabet storage"

# we also want to remove one if it exists, for the purposes of testing
server = BioSeqDatabase.open_database(driver="MySQLdb", db="bioseqdb",
user=username, passwd=password, host=hostname)   
# if the database doesn't exist, we get an error, so we trap for that
try:
        server.remove_database(dbname)
        server.adaptor.commit()
except KeyError:
        print "Attempt to remove ",dbname," failed; going on to create a new
one" 

server = BioSeqDatabase.open_database(driver="MySQLdb", db="bioseqdb",
user=username, passwd=password, host=hostname)   
db = server.new_database(dbname, description=dbdescription)
server.adaptor.commit()

# set up a list to hold the mycobacterial sequences
selectedrecords = [] # Setup an empty list which we'll later write

# ifh is the input file handle;
ifh = open(inputfile, "rU")

# set a counter
recordsread=0

for record in SeqIO.parse(ifh, "fasta", generic_nucleotide):

        # increment counter
        recordsread=recordsread+1

        # just so we can reload it easily, we'll assign an id to this record
        # however, the problem does not depend on this,
        # nor on the nature of the defline, as far as I can tell
        record.id="IDENTIFIER_"+str(recordsread)

        print "** Note the sequence type of the Seq ** "
        print record

        # note that to this point it does appear to work, and the alphabet is
correct.
        selectedrecords.append(record)

print inputfile, "total found ", recordsread
ifh.close()

# write it to the bioSQL database
print "Writing sequences to database"
db.load(selectedrecords)
server.adaptor.commit()

# subsequent attempts to write the re-loaded object fail because no alphabet is
defined
print "However, the alphabet hasn't been stored."
loadedrecord=db.lookup(gi="IDENTIFIER_1")
print "Displaying re-loaded record"
print loadedrecord

# this can be confirmed by running
sqlcmd="""
select * from 
bioseqdb.biosequence,
bioseqdb.bioentry, 
bioseqdb.biodatabase
where 
biodatabase.biodatabase_id= bioentry.biodatabase_id  and
biosequence.bioentry_id=bioentry.bioentry_id and
biodatabase.name="test"

"""

print "This can be confirmed by examining bioseqdb.biosequence.alphabet, which
is set to unknown; ", sqlcmd


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Sat May 16 11:37:52 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 16 May 2009 07:37:52 -0400
Subject: [Biopython-dev] [Bug 2829] BioSQL does not record a generic
	nucleotide alphabet
In-Reply-To: <bug-2829-42@http.bugzilla.open-bio.org/>
Message-ID: <200905161137.n4GBbqKe018688@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2829


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
            Summary|Biosequence.alphabet can be |BioSQL does not record a
                   |set to unknown after loading|generic nucleotide alphabet
                   |a nucleotide SeqRecord      |


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-16 07:37 EST -------
Biopython has a relatively rich range of alphabets, including IUPAC ambiguous
and unambiguous alphabets, plus ways to indicate gap characters and stop
symbols.  The BioSQL range is much simpler, so some information is inevitably
lost.

In BioSQL, all we store is a simple string, "dna", "rna", "protein" or
"unknown" (although BioJava used uppercase, so that is effectively allowed
too). See:
http://www.biosql.org/wiki/Enhancement_Requests#Check_constraint_on_biosequence.alphabet

This means if your sequence was using "IUPAC extended protein with a * stop
codon", all we can record is "protein". i.e. On retrieval from a BioSQL
database, the alphabet is simply a generic protein.  Likewise "ambiguous IUAC
DNA with minus as the gap character" just becomes generic DNA.

Note that as far as I know, currently none of the Bio* languages attempt to
record "nucleotide" (i.e. "dna" or "rna").  This is something we should discuss
on the BioSQL mailing list as a possible enhancement.

So in answer to your question "Is this expected?", yes, a generic nucleotide
alphabet isn't "dna", "rna" or "protein" so is currently recorded in the BioSQL
database as "unknown".  This gets turned into the SingleLetterAlphabet on
retrieval.

Changing title to "BioSQL does not record a generic nucleotide alphabet" and
marking this as an enhancement.

Peter

P.S. Are you just testing here, or do you really not know if your sequence is
DNA or RNA?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Sat May 16 11:54:11 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 16 May 2009 07:54:11 -0400
Subject: [Biopython-dev] [Bug 2829] BioSQL does not record a generic
	nucleotide alphabet
In-Reply-To: <bug-2829-42@http.bugzilla.open-bio.org/>
Message-ID: <200905161154.n4GBsBWZ019474@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2829


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-16 07:54 EST -------
See:
http://lists.open-bio.org/pipermail/biosql-l/2009-May/001515.html


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bartek at rezolwenta.eu.org  Sat May 16 17:39:18 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Sat, 16 May 2009 19:39:18 +0200
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <320fb6e00905121157i2068b81fw5a8b21e65c33abe2@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
	<320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com>
	<8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com>
	<320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com>
	<320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com>
	<320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com>
	<8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com>
	<320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com>
	<8b34ec180905121123w6fe553ahd416b9e1d317dd11@mail.gmail.com>
	<320fb6e00905121157i2068b81fw5a8b21e65c33abe2@mail.gmail.com>
Message-ID: <8b34ec180905161039u529a7093p978aa8d61970ca51@mail.gmail.com>

On Tue, May 12, 2009 at 8:57 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> I'm not happy with the current github repository due to the history
> tag issue - but we know we can fix that now. ?Are you going to try
> removing the old tags and re-doing them on github?

I've finally found some time for it and fixed the tags in the main repository.
I was able to run the update and it ran ok,  I w2as also able to clone the repo
from the official branch and see that they are OK in gitx. If anyone
has problems
with the tags, please let me know.

>
> Does anyone know how the git provided "ViewCVS" equivalent shows tags
> in a file's history?

If you are talking about gitweb, you can see it (for example: Makefile
for linux 2.6.17) here:

http://git.kernel.org/?p=linux/kernel/git/chrisw/linux-2.6.17.y.git;a=history;f=Makefile;h=79072d86297e78406791f0fc5764c35eb04fd07d;hb=78ace17e51d4968ed2355e8f708d233d1cc37f6d

I've also installed gitweb on a copy of biopython repo on my server
(not a permanent URL, not updated from trunk)
http://83.243.39.60/cgi-bin/gitweb.cgi?p=biopython.git;a=tree;hb=HEAD

It shows the tags, but (as usually with git), the tags are only shown
for the files which were affected by the particular commit marked with
the tag. So this behavior is consistent with kernel.org and github.


cheers
Bartek


From biopython at maubp.freeserve.co.uk  Sat May 16 20:35:36 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 16 May 2009 21:35:36 +0100
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <8b34ec180905161039u529a7093p978aa8d61970ca51@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
	<8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com>
	<320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com>
	<320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com>
	<320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com>
	<8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com>
	<320fb6e00905121014r40ab9eb5q525f41344c77c650@mail.gmail.com>
	<8b34ec180905121123w6fe553ahd416b9e1d317dd11@mail.gmail.com>
	<320fb6e00905121157i2068b81fw5a8b21e65c33abe2@mail.gmail.com>
	<8b34ec180905161039u529a7093p978aa8d61970ca51@mail.gmail.com>
Message-ID: <320fb6e00905161335i28be05fay848dc18f86e728cf@mail.gmail.com>

On 5/16/09, Bartek Wilczynski <bartek at rezolwenta.eu.org> wrote:
> On Tue, May 12, 2009 at 8:57 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>  >
>  > I'm not happy with the current github repository due to the history
>  > tag issue - but we know we can fix that now.  Are you going to try
>  > removing the old tags and re-doing them on github?
>
>  I've finally found some time for it and fixed the tags in the main repository.

Great :)

>  I was able to run the update and it ran ok,  I was also able to clone the repo
>  from the official branch and see that they are OK in gitx. If anyone
>  has problems with the tags, please let me know.

I'll check with my Mac on Monday.

>  > Does anyone know how the git provided "ViewCVS" equivalent shows
>  > tags in a file's history?
>
> If you are talking about gitweb, you can see it (for example: Makefile
> for linux 2.6.17) here:
>
>  http://git.kernel.org/?p=linux/kernel/git/chrisw/linux-2.6.17.y.git;a=history;f=Makefile;h=79072d86297e78406791f0fc5764c35eb04fd07d;hb=78ace17e51d4968ed2355e8f708d233d1cc37f6d
>
>  I've also installed gitweb on a copy of biopython repo on my server
>  (not a permanent URL, not updated from trunk)
>  http://83.243.39.60/cgi-bin/gitweb.cgi?p=biopython.git;a=tree;hb=HEAD
>
>  It shows the tags, but (as usually with git), the tags are only shown
>  for the files which were affected by the particular commit marked with
>  the tag. So this behavior is consistent with kernel.org and github.

Thanks for those examples.

I see what you mean, looking at Bio/Blast/NCBIXML.py in gitweb for
example, no tags show up at all.  On the other hand, for the NEWS
file, some tags show up.  Basically for what I want to use the tags
for (identifying changes to a single file between two releases),
gitweb doesn't work.  Nor does github's history. This is a shame.

I think the reason CVS (or SVN) seem to work better in this regard is
like python they care about individual files, while git works in terms
of changes (which may affect multiple files).

I'll see how I get on with the command line or graphical git history
viewers and get back to you...

Cheers,

Peter


From hlapp at gmx.net  Sat May 16 22:34:57 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 16 May 2009 18:34:57 -0400
Subject: [Biopython-dev] SwissProt DE lines and bioentry.description
	field in BioSQL
In-Reply-To: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
Message-ID: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>

Don't you love SwissProt (or UniProt as we must call it now I  
suppose). They (understandably) try to squeeze ever more annotation  
into the existing tags, rather than adding new tags.

So, of the following structure:

DE   RecName: Full=11S globulin seed storage protein 2;
DE   AltName: Full=11S globulin seed storage protein II;
DE   AltName: Full=Alpha-globulin;
DE   Contains:
DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
DE     AltName: Full=11S globulin seed storage protein II acidic chain;
DE   Contains:
DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
DE     AltName: Full=11S globulin seed storage protein II basic chain;
DE   Flags: Precursor;

really only the first line, with the 'RecName: Full=' removed, is the  
description line as we know it. The rest, I would say, is annotation,  
such as two alternative names, amino acid chains contained in the full  
record (shouldn't this be feature annotation, really? and indeed it is  
- why it needs to be repeated here is beyond me) and their names as  
well as alternative names, and the fact that the sequence is a  
precursor form.

Leaving all this in one string has the advantage that we can round- 
trip it (and there is probably hardly any other way to accomplish  
that), but clearly in terms of semantics this isn't the sequence  
description as we know it anymore.

Does anyone else think too that completely changing the semantics of  
sequence annotation fields is a bad idea? <sigh/>

My inclination from a BioPerl perspective is to extract the part  
following 'RecName: Full=' as the description, and attach the rest as  
annotation. We could in fact use the TagTree class for this. I'm cross- 
posting to BioPerl too to gather what other BioPerl'ers think about  
this.

	-hilmar

On May 14, 2009, at 2:20 PM, Peter wrote:

> Hi,
>
> This is cross-posted between biopython-dev and biosql-l as it regards
> parsing the description (DE) lines in SwissProt files and how they are
> stored in BioSQL.  This follows from an earlier discussion on
> biopython-dev
>
> Older SwissProt files just had one or two DE lines, and it made sense
> to treat this as a simple string mapped onto the description field in
> the bioentry table in BioSQL.  This appears to what happens with
> BioPerl 1.5.x and in Biopython (although the details regarding white
> space differ).  However, newer SwissProt files have many DE lines with
> additional structure.  The example Michiel gave earlier on the
> biopython-dev list was:
>
> http://www.uniprot.org/uniprot/Q9XHP0.txt
>
> This has the following DE lines:
>
> DE   RecName: Full=11S globulin seed storage protein 2;
> DE   AltName: Full=11S globulin seed storage protein II;
> DE   AltName: Full=Alpha-globulin;
> DE   Contains:
> DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
> DE     AltName: Full=11S globulin seed storage protein II acidic  
> chain;
> DE   Contains:
> DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
> DE     AltName: Full=11S globulin seed storage protein II basic chain;
> DE   Flags: Precursor;
>
> I had to fight with perl to get my old copy of BioPerl working again
> (some week reference thing), but I managed, and then loaded this file
> into my test BioSQL database with:
>
> $ perl load_seqdatabase.pl --dbname biosql_test --dbuser root --dbpass
> XXX --namespace biosql_test --format swiss Q9XHP0.txt
>
> Then I looked at the resulting description in the main bioentry table:
>
> $ mysql --user=root -p biosql_test -e 'SELECT description FROM
> bioentry WHERE accession="Q9XHP0";'
>
> This is stored as one huge long string (without the newlines, I'm not
> sure if BioPerl strips those in parsing the file, or when loading it
> into the database):
>
> RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S
> globulin seed storage protein II; AltName: Full=Alpha-globulin;
> Contains: RecName: Full=11S globulin seed storage protein 2 acidic
> chain; AltName: Full=11S globulin seed storage protein II acidic
> chain; Contains: RecName: Full=11S globulin seed storage protein 2
> basic chain; AltName: Full=11S globulin seed storage protein II basic
> chain; Flags: Precursor;
>
> For Biopython, I emptied the database then did:
>
>>>> from Bio import SeqIO
>>>> from BioSQL import BioSeqDatabase
>>>> server = BioSeqDatabase.open_database(driver="MySQLdb",  
>>>> user="root", passwd = "XXX", host = "localhost", db="biosql_test")
>>>> db = server["biosql-test"] #namespace
>>>> db.load(SeqIO.parse(open("Q9XHP0.txt"), "swiss"))
> 1
>>>> server.commit()
>
> As before, I looked in the table with mysql.  Again - this stores the
> full description from the DE line, although with the newlines
> embedded.  So, Biopython is consistent with my old copy of BioPerl
> (1.5.x) if we ignore the white space.
>
> However, how does this look in BioPerl 1.6?  If this is the same, are
> there any plans to change this?  For Biopython we have discussed
> recording most of the DE information under the annotations instead
> (keyed off RecName, AltName, Contains, Flags), but I would like to be
> consistent with BioPerl+BioSQL.
>
> Thanks
>
> Peter
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Sat May 16 23:14:54 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 17 May 2009 00:14:54 +0100
Subject: [Biopython-dev] SwissProt DE lines and bioentry.description
	field in BioSQL
In-Reply-To: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
Message-ID: <320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com>

On 5/16/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>
> Don't you love SwissProt (or UniProt as we must call it now I suppose).
> They (understandably) try to squeeze ever more annotation into the existing
> tags, rather than adding new tags.
>
>  So, of the following structure:
>
>  DE   RecName: Full=11S globulin seed storage protein 2;
>  DE   AltName: Full=11S globulin seed storage protein II;
>  DE   AltName: Full=Alpha-globulin;
>  DE   Contains:
>  DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
>  DE     AltName: Full=11S globulin seed storage protein II acidic chain;
>  DE   Contains:
>  DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
>  DE     AltName: Full=11S globulin seed storage protein II basic chain;
>  DE   Flags: Precursor;
>
>  really only the first line, with the 'RecName: Full=' removed, is the
> description line as we know it. The rest, I would say, is annotation, such
> as two alternative names, amino acid chains contained in the full record
> (shouldn't this be feature annotation, really? and indeed it is - why it
> needs to be repeated here is beyond me) and their names as well as
> alternative names, and the fact that the sequence is a precursor form.
>
>  Leaving all this in one string has the advantage that we can round-trip it
> (and there is probably hardly any other way to accomplish that), but clearly
> in terms of semantics this isn't the sequence description as we know it
> anymore.
>
>  Does anyone else think too that completely changing the semantics of
> sequence annotation fields is a bad idea? <sigh/>

+1
That's pretty much what I thought on seeing this the first time.

>  My inclination from a BioPerl perspective is to extract the part following
> 'RecName: Full=' as the description, and attach the rest as annotation. We
> could in fact use the TagTree class for this. I'm cross-posting to BioPerl
> too to gather what other BioPerl'ers think about this.

Am I right to infer that currently BioPerl 1.6.x, like BioPerl 1.5.x just
treats the DE lines as only big long string?

Could you translate your idea about the TagTree class into something
concrete with BioSQL tables and fields for me? I'm not familiar with
the TagTree (or Perl).

Over on the Biopython list we'd talked about storing this annotation in
a nested structured.  However, in order to use the BioSQL annotations
mechanisms, I think a simple flat structure is required :(

Peter


From biopython at maubp.freeserve.co.uk  Sat May 16 23:28:43 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 17 May 2009 00:28:43 +0100
Subject: [Biopython-dev] [BioSQL-l] SwissProt DE lines and
	bioentry.description field in BioSQL
In-Reply-To: <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
Message-ID: <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>

On 5/17/09, Chris Fields <cjfields at illinois.edu> wrote:
>
> On May 16, 2009, at 5:34 PM, Hilmar Lapp wrote:
> > My inclination from a BioPerl perspective is to extract the part following
> > 'RecName: Full=' as the description, and attach the rest as annotation. We
> > could in fact use the TagTree class for this. I'm cross-posting to BioPerl
> > too to gather what other BioPerl'ers think about this.
> >
> >        -hilmar
> >
>
> This is much like the GN issues we've run into before, and we *could* set
> this up using TagTree or similar.  In the latter case of gene name the data
> is stored in a text tree as follows:
>
>  gene_names:
>   gene_name:
>     Name: GC1QBP
>     Synonyms: HABP1
>     Synonyms: SF2P32
>     Synonyms: C1QBP
>
>  That could be changed to an XML string:
>
>  <?xml version="1.0" encoding="UTF-8"?>
>  <gene_names>
>   <gene_name>
>     <Name>GC1QBP</Name>
>     <Synonyms>HABP1</Synonyms>
>     <Synonyms>SF2P32</Synonyms>
>     <Synonyms>C1QBP</Synonyms>
>   </gene_name>
>  </gene_names>
>
> Thinking about this we should attempt to coalesce around a standard instead
> of forcing the other Bio*  to a specific format.

How would you record this in BioSQL?  As an XML string for an annotation value?

Brad has suggested JSON might be useful for this kind of thing (see
also per-letter-annotation discussion).

Peter


From hlapp at gmx.net  Sat May 16 23:37:14 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 16 May 2009 19:37:14 -0400
Subject: [Biopython-dev] [BioSQL-l] SwissProt DE lines and
	bioentry.description field in BioSQL
In-Reply-To: <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
	<320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>
Message-ID: <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net>


On May 16, 2009, at 7:28 PM, Peter wrote:

>> That could be changed to an XML string:
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <gene_names>
>>  <gene_name>
>>    <Name>GC1QBP</Name>
>>    <Synonyms>HABP1</Synonyms>
>>    <Synonyms>SF2P32</Synonyms>
>>    <Synonyms>C1QBP</Synonyms>
>>  </gene_name>
>> </gene_names>
>>
>> Thinking about this we should attempt to coalesce around a standard  
>> instead
>> of forcing the other Bio*  to a specific format.
>
> How would you record this in BioSQL?  As an XML string for an  
> annotation value?

Yes. A TagTree object can be serialized to XML, and the XML can be  
stored as the annotation value in BioSQL. As the XML can be read back  
in, it allows full round-tripping.

> Brad has suggested JSON might be useful for this kind of thing (see
> also per-letter-annotation discussion).

JSON could be another serialization format, but XML is equally or  
better supported in all languages except JavaScript. Furthermore, you  
could just send the XML to the browser and have an XSLT (either  
directly, or indirectly through JavaScript doing the transformation)  
do the rendering.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Sat May 16 23:42:17 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 16 May 2009 19:42:17 -0400
Subject: [Biopython-dev] SwissProt DE lines and bioentry.description
	field in BioSQL
In-Reply-To: <320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com>
Message-ID: <8CD4EED1-A689-447F-8F6E-8D2204DD4E86@gmx.net>


On May 16, 2009, at 7:14 PM, Peter wrote:

> Am I right to infer that currently BioPerl 1.6.x, like BioPerl 1.5.x  
> just
> treats the DE lines as only big long string?

Yes.

> Could you translate your idea about the TagTree class into something
> concrete with BioSQL tables and fields for me? [...] Over on the  
> Biopython list we'd talked about storing this annotation in a nested  
> structured.

That's more or less what TagTree is.

>  However, in order to use the BioSQL annotations mechanisms, I think  
> a simple flat structure is required :(

Not necessarily. If you have a flat serialization (such as XML) the  
nested structure isn't needed. Of course that's not a fully normalized  
relational representation, but if you had one, how often would it be  
used, how efficient would those queries be (SQL is poor at nested or  
recursive data structures), and how much pain would it be to write the  
object-relational mappings?

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Sun May 17 12:40:47 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 17 May 2009 13:40:47 +0100
Subject: [Biopython-dev] [BioSQL-l] SwissProt DE lines and
	bioentry.description field in BioSQL
In-Reply-To: <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
	<320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>
	<0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net>
Message-ID: <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com>

On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>
>  On May 16, 2009, at 7:28 PM, Peter wrote:
> > > That could be changed to an XML string:
> > >
> > > <?xml version="1.0" encoding="UTF-8"?>
> > > <gene_names>
> > >  <gene_name>
> > >   <Name>GC1QBP</Name>
> > >   <Synonyms>HABP1</Synonyms>
> > >   <Synonyms>SF2P32</Synonyms>
> > >   <Synonyms>C1QBP</Synonyms>
> > >  </gene_name>
> > > </gene_names>
> > >
> > > Thinking about this we should attempt to coalesce around a standard
> > > instead of forcing the other Bio*  to a specific format.

Absolutely - some common standard should be agreed.

Would you envision doing this for other structured fields, inventing a
new mini XML format each time?  That seems open ended and likely to
cause a lot of work keeping all the Bio* project synchronised.

Here you have mapped RecName and AltName fields in the DE lines to
Name and Synonyms (shouldn't that be Synonym singular?).  I also don't
get why you have used a gene_name entry inside a gene_names list.
Would you hold the contains information and the flags information from
the DE lines in separate XML entries?

I would have gone for something much closer to the original DE line
markup i.e. using the field names UniProt use, RecName and AltName,
rather than mapping these to Name and Synonym.

> > How would you record this in BioSQL?  As an XML string for an annotation
> > value?
>
> Yes. A TagTree object can be serialized to XML, and the XML can be stored
> as the annotation value in BioSQL. As the XML can be read back in, it allows
> full round-tripping.

Assuming you stored all the DE markup, then yes, a round trip back to
the SwissProt file could be possible.  And, depending on the details
of the XML structure used, it would be possible to represent this in a
python structure too.

> > Brad has suggested JSON might be useful for this kind of thing (see
> > also per-letter-annotation discussion).
>
> JSON could be another serialization format, but XML is equally or better
> supported in all languages except JavaScript. Furthermore, you could just
> send the XML to the browser and have an XSLT (either directly, or indirectly
> through JavaScript doing the transformation) do the rendering.

I have no strong preference for either XML or JSON (but would rather
avoid them if they are not really needed).  For other types of
annotation there may be a clearer advantage for one over the other,
e.g. per letter annotation like the secondary structure of a protein
sequence, or the quality scores of a nucleotide contig.

On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
> Not necessarily. If you have a flat serialization (such as XML) the nested
> structure isn't needed. Of course that's not a fully normalized relational
> representation, but if you had one, how often would it be used, how
> efficient would those queries be (SQL is poor at nested or recursive data
> structures), and how much pain would it be to write the object-relational
> mappings?

In this example, searching the database using one of the SwissProt
AltNames (synonyms), or filtering on the Flags sounds like a
reasonable request - but this would be very difficult if the data is
stored inside XML strings.

Of course, because the RecName and AltName entries are top level, we
could just record them as normal - simple strings in the annotations
table.  This seems much nicer.  Likewise the "Flags: Precursor;" line.
 i.e. listing the tag/value pairs which could be used in the
bioentry_qualifier_value table:

AltName = "Full=11S globulin seed storage protein II"
AltName = "Full=Alpha-globulin"
Flags = "Precursor"

(the RecName field, "Full=11S globulin seed storage protein 2", could
be used for the bioentry.description instead)

The above are all pretty easy.  We only need to consider nesting (or
something like XML or JSON) for some of the DE information, in the
example discussed the Contains lines.  Even this could be even be done
by storing each contains entry as a single long string (holding both
the name and synonyms) directly from the DE line itself, something
like this:

Contains = "RecName: Full=11S globulin seed storage protein 2 acidic
chain;\nAltName: Full=11S globulin seed storage protein II acidic
chain;"
Contains = "RecName: Full=11S globulin seed storage protein 2 basic
chain;\nAltName: Full=11S globulin seed storage protein II basic
chain;"

Peter


From hlapp at gmx.net  Sun May 17 15:21:59 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 17 May 2009 11:21:59 -0400
Subject: [Biopython-dev] SwissProt DE lines and bioentry.description
	field in BioSQL
In-Reply-To: <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
	<320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>
	<0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net>
	<320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com>
Message-ID: <A8AB4BCB-9CD3-428D-AF10-899AD8055EC7@gmx.net>


On May 17, 2009, at 8:40 AM, Peter wrote:

> On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>>
>> On May 16, 2009, at 7:28 PM, Peter wrote:
>>>> That could be changed to an XML string:
>>>>
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <gene_names>
>>>> <gene_name>
>>>>  <Name>GC1QBP</Name>
>>>>  <Synonyms>HABP1</Synonyms>
>>>>  <Synonyms>SF2P32</Synonyms>
>>>>  <Synonyms>C1QBP</Synonyms>
>>>> </gene_name>
>>>> </gene_names>
>>>>
>>>> Thinking about this we should attempt to coalesce around a standard
>>>> instead of forcing the other Bio*  to a specific format.
>
> [...] Here you have mapped RecName and AltName fields in the DE  
> lines to
> Name and Synonyms (shouldn't that be Synonym singular?).

The example is for the GN lines in SwissProt, not the DE lines.

> [...]
> On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>> Not necessarily. If you have a flat serialization (such as XML) the  
>> nested
>> structure isn't needed. Of course that's not a fully normalized  
>> relational
>> representation, but if you had one, how often would it be used, how
>> efficient would those queries be (SQL is poor at nested or  
>> recursive data
>> structures), and how much pain would it be to write the object- 
>> relational
>> mappings?
>
> In this example, searching the database using one of the SwissProt
> AltNames (synonyms), or filtering on the Flags sounds like a
> reasonable request - but this would be very difficult if the data is
> stored inside XML strings.

Actually no. Modern full-text indexers (inside or outside the  
database) can index XML text columns right away and very well. In  
fact, for the last project that I built a full-text search for (on top  
of a BioSQL database) I did that by writing custom XML documents to a  
separate table for each record I wanted indexed. Oracle's full text  
indexer did the rest. I also built a separate identifier/name/ 
accession index that pulled all the gene names, symbols, accession  
numbers, identifiers etc into a single table for indexing.

What I mean is, a fully normalized relational representation,  
especially if nested, is often not the most efficient data structure  
for efficient searching and filtering.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From bugzilla-daemon at portal.open-bio.org  Sun May 17 22:53:13 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 17 May 2009 18:53:13 -0400
Subject: [Biopython-dev] [Bug 2829] BioSQL does not record a generic
	nucleotide alphabet
In-Reply-To: <bug-2829-42@http.bugzilla.open-bio.org/>
Message-ID: <200905172253.n4HMrDIX006938@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2829


------- Comment #3 from david.wyllie at ndm.ox.ac.uk  2009-05-17 18:53 EST -------
(In reply to comment #2)
> See:
> http://lists.open-bio.org/pipermail/biosql-l/2009-May/001515.html
> 

Hi

thank you very much for explaining.

I'm not sure this is a bug, it's a design feature due to my not understanding
the implications of generic_nucleotide.  

I know it's DNA, and if one uses generic_dna instead in the testcase, all is
well.

Alphabets are explained clearly in the documentation.  Thank you again.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Mon May 18 10:08:45 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 18 May 2009 06:08:45 -0400
Subject: [Biopython-dev] [Bug 2829] BioSQL does not record a generic
	nucleotide alphabet
In-Reply-To: <bug-2829-42@http.bugzilla.open-bio.org/>
Message-ID: <200905181008.n4IA8j0J015956@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2829


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |WONTFIX


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-18 06:08 EST -------
(In reply to comment #3)
> Hi
> 
> thank you very much for explaining.
> 
> I'm not sure this is a bug, it's a design feature due to my
> not understanding the implications of generic_nucleotide.  

As I argued on the BioSQL mailing list, generic nucleotide
sequences are a valid case not catered to at the moment.
However, they are a corner case, and have no equivalent in
BioPerl (which is happy to guess at DNA or RNA).

Marking this bug as WON'T FIX.

> I know it's DNA, and if one uses generic_dna instead in
> the testcase, all is well.

Good - if you know you have DNA, then specifying a DNA
alphabet would be my recommended course of action.

> Alphabets are explained clearly in the documentation.
> Thank you again.

Let us know if you find anything that needs further
clarification in the documentation.

Thanks,

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Mon May 18 13:38:03 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 18 May 2009 14:38:03 +0100
Subject: [Biopython-dev] SwissProt DE lines and bioentry.description
	field in BioSQL
In-Reply-To: <A8AB4BCB-9CD3-428D-AF10-899AD8055EC7@gmx.net>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
	<320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>
	<0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net>
	<320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com>
	<A8AB4BCB-9CD3-428D-AF10-899AD8055EC7@gmx.net>
Message-ID: <320fb6e00905180638q29de63c4if0627eff416c4481@mail.gmail.com>

On Sun, May 17, 2009 at 4:21 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>
> On May 17, 2009, at 8:40 AM, Peter wrote:
>>
>> [...] Here you have mapped RecName and AltName fields in the DE lines to
>> Name and Synonyms (shouldn't that be Synonym singular?).
>
> The example is for the GN lines in SwissProt, not the DE lines.

Ah, that probably explains some of my confusion.

>> In this example, searching the database using one of the SwissProt
>> AltNames (synonyms), or filtering on the Flags sounds like a
>> reasonable request - but this would be very difficult if the data is
>> stored inside XML strings.
>
> Actually no. Modern full-text indexers (inside or outside the database) can
> index XML text columns right away and very well. In fact, for the last
> project that I built a full-text search for (on top of a BioSQL database) I
> did that by writing custom XML documents to a separate table for each
> record I wanted indexed. Oracle's full text indexer did the rest. I also built a
> separate identifier/name/accession index that pulled all the gene names,
> symbols, accession numbers, identifiers etc into a single table for
> indexing.

OK, when I said searching "would be very difficult if the data is
stored inside XML strings", maybe it wasn't so difficult for you - but
that still sounds complicated!

Sticking with the GN lines and the synonym, if this was stored as a
simple tag/value as usual in BioSQL, I would write my SQL statement to
search the annotation table where the term id was that associated with
a GN synonym, and the annotation value was "HABP1".  Simple.

Using the XML approach, are you suggesting you could do a full text
search on the annotation value field, looking for any rows where the
field contains "<Synonyms>HABP1</Synonyms>", where the term id matches
the GN lines' XML string? This sounds simplistic and probably rather
slow - presumably why you resorted to the more complicated indexing
scheme described above?

> What I mean is, a fully normalized relational representation, especially if
> nested, is often not the most efficient data structure for efficient
> searching and filtering.

OK.  But do we really need to worry about complex nested structures
for the SwissProt annotation (or in general)?

Peter


From biopython at maubp.freeserve.co.uk  Tue May 19 14:23:58 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 19 May 2009 15:23:58 +0100
Subject: [Biopython-dev] [Biopython] Parsing large blast files
In-Reply-To: <320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com>
References: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com>
	<290052.25369.qm@web62407.mail.re1.yahoo.com>
	<320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com>
Message-ID: <320fb6e00905190723u2eca08e6o3f70bf37be79e4bf@mail.gmail.com>

Last month on this thread we started talking about the BLAST
command line wrappers:
http://lists.open-bio.org/pipermail/biopython/2009-April/005134.html

On Wed, Apr 29, 2009, Peter wrote:
> On Wed, Apr 29, 2009, Michiel de Hoon wrote:
>>
>> How would users typically use Bio.Blast.Applications?
>
> In the next release, I would aim to have Bio.Blast.Applications
> updated to cover blastall (fully), plus blastpgp and rpsblast
> (currently not covered) and for the three helper functions
> Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast to all use
> Bio.Blast.Applications internally.

That should be done now in CVS - it turned out to be a lot more
tedious that I had expected, but I think we are OK.

I would be very grateful to have a couple of people test this out.
At the very least, just update your copy of Biopython and confirm
any existing scripts using the Bio.Blast.NCBIStandalone
blastall, blastpgp or rpsblast functions still work as expected.

Note we still need to agree on the preferred name for each
parameter (i.e. what do we use for the python properties) as
discussed on this thread:
http://lists.open-bio.org/pipermail/biopython-dev/2009-May/005976.html
http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006039.html

Peter


From biopython at maubp.freeserve.co.uk  Tue May 19 17:00:41 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 19 May 2009 18:00:41 +0100
Subject: [Biopython-dev] Repeated options in command line interfaces
Message-ID: <320fb6e00905191000g473b9e68r12b8704652b1ad93@mail.gmail.com>

Hello all,

Yes - its another thread about command line wrappers!

One of the Roche 454 off instrument applications is runMapping,
which in the most general situation allows you to map one or
more SFF files onto one or more FASTA files, e.g.

runMapping -o ~/test -ref example1.fasta example2.fasta -read
data1.sff data2.sff

Notice that "-ref" and "-read" are not repeated, so we could
treat this via the current application wrapper system as follows:

#These modules don't exist (yet):
from Bio.Sequencing.Applications import RunMappingCommandline
cline = RunMappingCommandline()
cline.ref = "example1.fasta example2.fasta"
cline.read = "data1.sff data2.sff"

This isn't very elegant, but would work.  Over on Bug 2815,
Cymon and I have briefly discussed the --seed parameter in
Mafft, which is used to specify one or more alignment files, e.g.

mafft ... --seed alignment1 --seed alignment2 --seed alignment3 ...

Notice that "--seed" is repeated before each value.

I was thinking it would be nice to treat this as a single
property (seed) which takes a list of strings as its value:

from Bio.Align.Applications import MafftCommandline
cline = MafftCommandline()
cline.seed = ["alignment1", "alignment2", ...]

or, equivalently:

from Bio.Align.Applications import MafftCommandline
cline = MafftCommandline(seed=["alignment1", "alignment2", ...])

or, using the old set_parameter approach,

from Bio.Align.Applications import MafftCommandline
cline = MafftCommandline()
cline.set_parameter("seed", ["alignment1", "alignment2", ...])

and similarly for a Roche wrapper, e.g.

#These modules don't exist (yet):
from Bio.Sequencing.Applications import RunMappingCommandline
cline = RunMappingCommandline()
cline.ref = ["example1.fasta", "example2.fasta"]
cline.read = ["data1.sff", "data2.sff"]

Doing this nicely would require two _Option subclasses in
Bio.Application, one for repeated options like "seed" in
Mafft, and one for multiple valued options like "ref" and
"read" in the Roche tools.

Does this sound sensible?

Does anyone have any more examples?

Peter


From bugzilla-daemon at portal.open-bio.org  Wed May 20 16:31:24 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 20 May 2009 12:31:24 -0400
Subject: [Biopython-dev] [Bug 2833] New: Features insertion on previous
	bioentry_id
Message-ID: <bug-2833-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833

           Summary: Features insertion on previous bioentry_id
           Product: Biopython
           Version: 1.50
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P1
         Component: BioSQL
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: andrea at biodec.com


Biopython 1.50 (also 1.50b it's the same code)
python2.4 or python2.5
postgresql 8.3
BioSQL Schema 1.0.1

Problem: 
 imagine to have 3 seqrecord (s1,s2,s3), imagine that 
  - s1 == s3 (but from different sources....) in other words
    s1 and s3 are not the same object
  - s2 != s1 and s2 != s3

 imagine to load a Biosql db in this order:
 - db.load([s1])
 - db.load([s2])
 - db.load([s3])

 At the end of the loading i will have only 2 bioentry ID 
 BUT the s3.features will be inserted on s2 seqrecord.

---------------------------------------------------------------------------------------
More in details (documented behaviour):

print s1
ID: ENST00000334859
Name: ENST00000334859
Description: Leucine-rich repeat and calponin homology domain-containing
protein 3 precursor. [Source:Uniprot/SWISSPROT;Acc:Q96II8]
Number of features: 24
/source=
/taxonomy=[]
/keywords=['']
/accessions=['ENST00000334859']
/data_file_division=UNK
/date=01-JAN-1980
/organism=. .
/gi=ENST00000334859
Seq('ATGGCGGCCGCGGGCTTGGTCGCTGTGGCAGCGGCTGCCGAGTACTCTGGCACG...TGA',
IUPACAmbiguousDNA())

print s2
ID: ENST00000391466
Name: ENST00000391466
Description: CDNA FLJ44976 fis, clone BRAWH3001833.
[Source:Uniprot/SPTREMBL;Acc:Q6ZQT1]
Number of features: 8
/source=
/taxonomy=[]
/keywords=['']
/accessions=['ENST00000391466']
/data_file_division=UNK
/date=01-JAN-1980
/organism=. .
/gi=ENST00000391466
Seq('ATGACAGTGATTCTCTTTACCCAACTCACCGCACCCATGGCAGTGATTCTCTTT...TAG',
IUPACAmbiguousDNA())

print s3
ID: ENST00000334859
Name: ENST00000334859
Description: Leucine-rich repeat and calponin homology domain-containing
protein 3 precursor. [Source:Uniprot/SWISSPROT;Acc:Q96II8]
Number of features: 24
/source=
/taxonomy=[]
/keywords=['']
/accessions=['ENST00000334859']
/data_file_division=UNK
/date=01-JAN-1980
/organism=. .
/gi=ENST00000334859
Seq('ATGGCGGCCGCGGGCTTGGTCGCTGTGGCAGCGGCTGCCGAGTACTCTGGCACG...TGA',
IUPACAmbiguousDNA())

As you can see: 
 - s1 and S3 are identical and s2 differs from them.
 - s1 and s3 has 24 features
 - s2 has 8 features

STEP 1 (biosql insertion of s1)
  - db.load([s1])
  - looking into the db:
 select bioentry_id, name, accession, identifier  from bioentry;
 bioentry_id |      name       |    accession    |   identifier    |
-------------+-----------------+-----------------+-----------------+
          39 | ENST00000334859 | ENST00000334859 | ENST00000334859 |
(1 row)

  select * from seqfeature;
select * from seqfeature;
 seqfeature_id | bioentry_id | type_term_id | source_term_id | display_name |
rank
---------------+-------------+--------------+----------------+--------------+------
           291 |          39 |           27 |             15 |              |  
 1
           292 |          39 |           27 |             15 |              |  
 2
           293 |          39 |           27 |             15 |              |  
 3
           294 |          39 |           27 |             15 |              |  
 4
           295 |          39 |           27 |             15 |              |  
 5
           296 |          39 |           14 |             15 |              |  
 6
           297 |          39 |           14 |             15 |              |  
 7
           298 |          39 |           30 |             15 |              |  
 8
           299 |          39 |           30 |             15 |              |  
 9
           300 |          39 |           30 |             15 |              |  
10
           301 |          39 |           30 |             15 |              |  
11
           302 |          39 |           30 |             15 |              |  
12
           303 |          39 |           30 |             15 |              |  
13
           304 |          39 |           30 |             15 |              |  
14
           305 |          39 |           30 |             15 |              |  
15
           306 |          39 |           30 |             15 |              |  
16
           307 |          39 |           30 |             15 |              |  
17
           308 |          39 |           25 |             15 |              |  
18
           309 |          39 |           25 |             15 |              |  
19
           310 |          39 |           25 |             15 |              |  
20
           311 |          39 |           25 |             15 |              |  
21
           312 |          39 |           25 |             15 |              |  
22
           313 |          39 |           26 |             15 |              |  
23
           314 |          39 |           26 |             15 |              |  
24
(24 rows)


STEP 2 (biosql insertion of s2)
  - db.load([s2])
  - looking into the db:
 select bioentry_id, name, accession, identifier  from bioentry;
 bioentry_id |      name       |    accession    |   identifier
-------------+-----------------+-----------------+-----------------
          39 | ENST00000334859 | ENST00000334859 | ENST00000334859
          40 | ENST00000391466 | ENST00000391466 | ENST00000391466
(2 rows)

  select * from seqfeature;
 seqfeature_id | bioentry_id | type_term_id | source_term_id | display_name |
rank
---------------+-------------+--------------+----------------+--------------+------
           291 |          39 |           27 |             15 |              |  
 1
           292 |          39 |           27 |             15 |              |  
 2
           293 |          39 |           27 |             15 |              |  
 3
           294 |          39 |           27 |             15 |              |  
 4
           295 |          39 |           27 |             15 |              |  
 5
           296 |          39 |           14 |             15 |              |  
 6
           297 |          39 |           14 |             15 |              |  
 7
           298 |          39 |           30 |             15 |              |  
 8
           299 |          39 |           30 |             15 |              |  
 9
           300 |          39 |           30 |             15 |              |  
10
           301 |          39 |           30 |             15 |              |  
11
           302 |          39 |           30 |             15 |              |  
12
           303 |          39 |           30 |             15 |              |  
13
           304 |          39 |           30 |             15 |              |  
14
           305 |          39 |           30 |             15 |              |  
15
           306 |          39 |           30 |             15 |              |  
16
           307 |          39 |           30 |             15 |              |  
17
           308 |          39 |           25 |             15 |              |  
18
           309 |          39 |           25 |             15 |              |  
19
           310 |          39 |           25 |             15 |              |  
20
           311 |          39 |           25 |             15 |              |  
21
           312 |          39 |           25 |             15 |              |  
22
           313 |          39 |           26 |             15 |              |  
23
           314 |          39 |           26 |             15 |              |  
24
           315 |          40 |           28 |             15 |              |  
 1
           316 |          40 |           28 |             15 |              |  
 2
           317 |          40 |           28 |             15 |              |  
 3
           318 |          40 |           28 |             15 |              |  
 4
           319 |          40 |           28 |             15 |              |  
 5
           320 |          40 |           28 |             15 |              |  
 6
           321 |          40 |           28 |             15 |              |  
 7
           322 |          40 |           28 |             15 |              |  
 8
(32 rows)

STEP 3 (biosql insertion of s3)
  - db.load([s3])
  - looking into the db:
select bioentry_id, name, accession, identifier  from bioentry;
 bioentry_id |      name       |    accession    |   identifier
-------------+-----------------+-----------------+-----------------
          39 | ENST00000334859 | ENST00000334859 | ENST00000334859
          40 | ENST00000391466 | ENST00000391466 | ENST00000391466
(2 rows)

select * from seqfeature;
 seqfeature_id | bioentry_id | type_term_id | source_term_id | display_name |
rank
---------------+-------------+--------------+----------------+--------------+------
           291 |          39 |           27 |             15 |              |  
 1
           292 |          39 |           27 |             15 |              |  
 2
           293 |          39 |           27 |             15 |              |  
 3
           294 |          39 |           27 |             15 |              |  
 4
           295 |          39 |           27 |             15 |              |  
 5
           296 |          39 |           14 |             15 |              |  
 6
           297 |          39 |           14 |             15 |              |  
 7
           298 |          39 |           30 |             15 |              |  
 8
           299 |          39 |           30 |             15 |              |  
 9
           300 |          39 |           30 |             15 |              |  
10
           301 |          39 |           30 |             15 |              |  
11
           302 |          39 |           30 |             15 |              |  
12
           303 |          39 |           30 |             15 |              |  
13
           304 |          39 |           30 |             15 |              |  
14
           305 |          39 |           30 |             15 |              |  
15
           306 |          39 |           30 |             15 |              |  
16
           307 |          39 |           30 |             15 |              |  
17
           308 |          39 |           25 |             15 |              |  
18
           309 |          39 |           25 |             15 |              |  
19
           310 |          39 |           25 |             15 |              |  
20
           311 |          39 |           25 |             15 |              |  
21
           312 |          39 |           25 |             15 |              |  
22
           313 |          39 |           26 |             15 |              |  
23
           314 |          39 |           26 |             15 |              |  
24
           315 |          40 |           28 |             15 |              |  
 1
           316 |          40 |           28 |             15 |              |  
 2
           317 |          40 |           28 |             15 |              |  
 3
           318 |          40 |           28 |             15 |              |  
 4
           319 |          40 |           28 |             15 |              |  
 5
           320 |          40 |           28 |             15 |              |  
 6
           321 |          40 |           28 |             15 |              |  
 7
           322 |          40 |           28 |             15 |              |  
 8
           323 |          40 |           27 |             15 |              |  
 1
           324 |          40 |           27 |             15 |              |  
 2
           325 |          40 |           27 |             15 |              |  
 3
           326 |          40 |           27 |             15 |              |  
 4
           327 |          40 |           27 |             15 |              |  
 5
           328 |          40 |           14 |             15 |              |  
 6
           329 |          40 |           14 |             15 |              |  
 7
           330 |          40 |           30 |             15 |              |  
 8
           331 |          40 |           30 |             15 |              |  
 9
           332 |          40 |           30 |             15 |              |  
10
           333 |          40 |           30 |             15 |              |  
11
           334 |          40 |           30 |             15 |              |  
12
           335 |          40 |           30 |             15 |              |  
13
           336 |          40 |           30 |             15 |              |  
14
           337 |          40 |           30 |             15 |              |  
15
           338 |          40 |           30 |             15 |              |  
16
           339 |          40 |           30 |             15 |              |  
17
           340 |          40 |           25 |             15 |              |  
18
           341 |          40 |           25 |             15 |              |  
19
           342 |          40 |           25 |             15 |              |  
20
           343 |          40 |           25 |             15 |              |  
21
           344 |          40 |           25 |             15 |              |  
22
           345 |          40 |           26 |             15 |              |  
23
           346 |          40 |           26 |             15 |              |  
24
(56 rows)

As you can easily see the 24 feature of s3 seqrecord has been added to the
bioentry_id 40 (that was s2).
------------------------------------------------------------------------------------

The problem is not so easy to understand. I tried to have a look into the code
of
Loader.py and i found something:
  the code works in this way:
  1) it tries to load the seqrecord using:
          load_seqrecord(self, record)
          this method as first thing tries to load the bioentry table with
          the method:
                _load_bioentry_table(self, record)
                this method at last thing tries to get the bioentry_id
                of the "just inserted" record with the db method:
                self.adaptor.last_id('bioentry')

  2) then with the  bioentry_id recovered from the first method
     it tries to fill the other tables...and also the seqfeature...

  3) In biosql (the schema), if you try to insert a record into
     the bioentry table that has the same Identifier or Accession
     of an existing record it doesn't do anything....
     and it tells you "INSERT 0 0"

  4) So, if you try to insert the s3 record that has the same
     Accession and Identifier of the s1... the bioentry_id 
     the load_seqrecord(self, record) method will return
     the bioentry_id of the s2 record (it will be the 
     self.adaptor.last_id('bioentry') output)

Maybe other information will be transferred to s2 (not only
the features...). For example also "dbxrefs" could suffer
of the same problem....

I think the solution depend on what we expect from the code:
  - if we expect a behaviour like "don't do anything with identical
Accession/Identifier"
    it is better to check the last_id before and after insertion and return
None
    if it is identical... 
    than manage a "None" bioentry_id like a block in the other 
    biosql insertions....

  - if we expect a "Merge" behaviour it is better to
    retrive the bioentry_id of the object with the same Accession/Identifier
    and than verify if the 2 seqrecord has identical sequence and
    than merge features/annotations/dbxrefs.... etc.

  - other behaviours... other solutions...

Andrea


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Wed May 20 20:25:39 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 20 May 2009 16:25:39 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905202025.n4KKPdYT020904@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-20 16:25 EST -------
(In reply to comment #0)
> Biopython 1.50 (also 1.50b it's the same code)
> python2.4 or python2.5
> postgresql 8.3
> BioSQL Schema 1.0.1
> 
> Problem: 
>  imagine to have 3 seqrecord (s1,s2,s3), ... load a Biosql db in this order:
>  - db.load([s1])
>  - db.load([s2])
>  - db.load([s3])
> 
>  At the end of the loading i will have only 2 bioentry ID 
>  BUT the s3.features will be inserted on s2 seqrecord.

BioSQL will allow you to have multiple versions of the same record but they
must have different versions (e.g. s1.id="ENST00000334859.0" and
s3.id="ENST00000334859.1" should work). The problem with your data is s1.id ==
s3.id, so I would expect them to get the same accession and version (taken as
zero).  Therefore s3 should *fail* to load.

I can try and reproduce this using the information given, but it would help if
you could attach the original sequence files to this bug.

Thanks,

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Wed May 20 21:07:08 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 20 May 2009 17:07:08 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905202107.n4KL78te024053@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-20 17:07 EST -------
(In reply to comment #0)
> Biopython 1.50 (also 1.50b it's the same code)
> python2.4 or python2.5
> postgresql 8.3
> BioSQL Schema 1.0.1

What version of psycopg are you using? i.e. The python library for talking to
PostgreSQL.

Have you tried running Biopython's BioSQL unit tests?  You'll need to configure
your settings in setup_BioSQL.py first.

If that looks good could you try updating to the latest Biopython from CVS and
retesting? I've added a basic check in test_BioSQL.py for duplicated entries
(using a GenBank file) which works on my machine using MySQL.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu May 21 10:31:42 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 May 2009 06:31:42 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905211031.n4LAVgvW019852@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #3 from andrea at biodec.com  2009-05-21 06:31 EST -------
Created an attachment (id=1299)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1299&action=view)
Pickled Seqrecord s1


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu May 21 10:32:12 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 May 2009 06:32:12 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905211032.n4LAWBXC019888@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #4 from andrea at biodec.com  2009-05-21 06:32 EST -------
Created an attachment (id=1300)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1300&action=view)
Pickled Seqrecord s2


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu May 21 10:32:28 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 May 2009 06:32:28 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905211032.n4LAWSlA019903@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #5 from andrea at biodec.com  2009-05-21 06:32 EST -------
Created an attachment (id=1301)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1301&action=view)
Pickled Seqrecord s3


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu May 21 10:34:46 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 May 2009 06:34:46 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905211034.n4LAYkhC020056@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #6 from andrea at biodec.com  2009-05-21 06:34 EST -------
Hi Peter,
i did 4 tests: [python2.4,python2.5]*[psycopg,psycopg2]
with 
 - biopython from "this morning" cvs.
 - psycopg.__version__  '1.1.21'
 - psycopg2.__version__ '2.0.7 (dec mx dt ext pq3)'

in any case i've the same results:

Make sure all records are correctly loaded. ... ok
Make sure can't import records twice. ... FAIL
Indepth check that SeqFeatures are transmitted through the db. ... ok
Load SeqRecord objects into a BioSQL database. ... ok
Get a list of all items in the database. ... ok
Test retrieval of items using various ids. ... ok
Check can add DBSeq objects together. ... ok
Check can turn a DBSeq object into a Seq or MutableSeq. ... ok
Make sure Seqs from BioSQL implement the right interface. ... ok
Check SeqFeatures of a sequence. ... ok
Make sure SeqRecords from BioSQL implement the right interface. ... ok
Check that slices of sequences are retrieved properly. ... ok

======================================================================
FAIL: Make sure can't import records twice.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 374, in test_reload
    self.assert_("duplicate" in str(err).lower())
AssertionError

----------------------------------------------------------------------
Ran 12 tests in 23.815s

FAILED (failures=1)

i've 1 failure in "Make sure can't import records twice. ..." it seems
interesting
for the problem...


Then i tried with python2.4, python2.5, psycopg, psycopg2

i attached the pickles of the 3 seqrecords so you can try by yourself...

###########################################################
from BioSQL import BioSeqDatabase
import cPickle

server = BioSeqDatabase.open_database(driver = "psycopg2", user = 'postgres',
passwd = "hidden", host = "dbservertest", db = 'test_biosql' )

## LOAD SeqRecords from pickle
s1=cPickle.load(open('s1.cpk'))
s2=cPickle.load(open('s2.cpk'))
s3=cPickle.load(open('s3.cpk'))

## LOAD INTO DB 
db=server.new_database('test')
server.commit()
db.load([s1])
db.load([s2])
db.load([s3])
db.adaptor.commit()
###########################################################


I had always the same problem.

So i prepare a buildout environment with the last Biopython
and with a new psycopg2 library (for psycopg i had the latest).

psycopg2.__version__ '2.0.11 (dt dec ext pq3)'

The result from the test was the same
The result from the upload (based on pickled seqrecords) was the same

Thanks
Andrea


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu May 21 10:39:18 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 May 2009 06:39:18 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905211039.n4LAdIit020365@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-21 06:39 EST -------
(In reply to comment #6)
> Hi Peter,
> i did 4 tests: [python2.4,python2.5]*[psycopg,psycopg2]
> with 
>  - biopython from "this morning" cvs.
>  - psycopg.__version__  '1.1.21'
>  - psycopg2.__version__ '2.0.7 (dec mx dt ext pq3)'
> 
> in any case i've the same results:
> 
> Make sure all records are correctly loaded. ... ok
> Make sure can't import records twice. ... FAIL
> ...
> ======================================================================
> FAIL: Make sure can't import records twice.
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "test_BioSQL.py", line 374, in test_reload
>     self.assert_("duplicate" in str(err).lower())
> AssertionError

OK - the unit test is doing what I expected, and the duplicate insertion
is failing. Its just the error message is different to what I expected,
which should be trivial to fix. This means inserting the same GenBank record
twice fails (which is good).

However, the unit test doesn't reproduce your original issue. Hopefully your
pickled SeqRecord objects will help there...

Thanks,

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu May 21 11:36:34 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 May 2009 07:36:34 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905211136.n4LBaYO8024199@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-21 07:36 EST -------
(In reply to comment #7)
> However, the unit test doesn't reproduce your original issue. Hopefully
> your pickled SeqRecord objects will help there...

Based on your example script in comment 6 with the pickled SeqRecord objects,
but using MySQL, I get an IntegrityError as expected:

Traceback (most recent call last):
...
IntegrityError: (1062, "Duplicate entry 'ENST00000334859-2-0' for key 2")

I get the same error with simplified records lacking any annotation or features
(I just saved your three records to a FASTA file and reloaded them). So what
ever is going wrong seems to be PostgreSQL specific (or at least, does not
affect MySQL).

I have updated test_BioSQL.py in CVS to cover more variations (revision 1.33),
and hopefully the error message check should work on PostgreSQL as well. It
would be very helpful if you could test that.

Part of the new tests is a slight variation on your original example.  Could
you try this:

db.load([s1])
server.commit()
db.load([s2])
server.commit()
db.load([s3])
server.commit()

This might tell us if the issue is with PostgreSQL not checking the key
constraints until the commit.

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From chapmanb at 50mail.com  Thu May 21 12:29:27 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 21 May 2009 08:29:27 -0400
Subject: [Biopython-dev] Repeated options in command line interfaces
In-Reply-To: <320fb6e00905191000g473b9e68r12b8704652b1ad93@mail.gmail.com>
References: <320fb6e00905191000g473b9e68r12b8704652b1ad93@mail.gmail.com>
Message-ID: <20090521122927.GM84112@sobchak.mgh.harvard.edu>

Hi Peter;

> Yes - its another thread about command line wrappers!

It seems like y'all are unearthing every single crazy command line
option choice out there. Great to have this fleshed out.

> One of the Roche 454 off instrument applications is runMapping,
> which in the most general situation allows you to map one or
> more SFF files onto one or more FASTA files, e.g.
> 
> runMapping -o ~/test -ref example1.fasta example2.fasta -read
> data1.sff data2.sff
[...]
> the --seed parameter in Mafft, which is used to specify one or more
> alignment files, e.g.
> 
> mafft ... --seed alignment1 --seed alignment2 --seed alignment3 ...
> 
> Notice that "--seed" is repeated before each value.
> 
> I was thinking it would be nice to treat this as a single
> property (seed) which takes a list of strings as its value:
> 
> from Bio.Align.Applications import MafftCommandline
> cline = MafftCommandline()
> cline.seed = ["alignment1", "alignment2", ...]
[...]
> #These modules don't exist (yet):
> from Bio.Sequencing.Applications import RunMappingCommandline
> cline = RunMappingCommandline()
> cline.ref = ["example1.fasta", "example2.fasta"]
> cline.read = ["data1.sff", "data2.sff"]

This makes good sense to me. It hides the actual nastiness a bit and
makes it clear in the code what is happening -- assigning multiple
parameters to a single option. It sounds like a great way to handle
it.

Brad


From bugzilla-daemon at portal.open-bio.org  Thu May 21 15:04:40 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 May 2009 11:04:40 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905211504.n4LF4ej0015238@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #9 from andrea at biodec.com  2009-05-21 11:04 EST -------
(In reply to comment #8)
> (In reply to comment #7)
> > However, the unit test doesn't reproduce your original issue. Hopefully
> > your pickled SeqRecord objects will help there...
> 
> Based on your example script in comment 6 with the pickled SeqRecord objects,
> but using MySQL, I get an IntegrityError as expected:
> 
> Traceback (most recent call last):
> ...
> IntegrityError: (1062, "Duplicate entry 'ENST00000334859-2-0' for key 2")
> 
> I get the same error with simplified records lacking any annotation or features
> (I just saved your three records to a FASTA file and reloaded them). So what
> ever is going wrong seems to be PostgreSQL specific (or at least, does not
> affect MySQL).

According to me it's postgres specific the fact that i don't have any 
error at all. If biopython expects from postgres an error in this 
situation there are some problem in postgres (or in mine).

> 
> I have updated test_BioSQL.py in CVS to cover more variations (revision 1.33),
> and hopefully the error message check should work on PostgreSQL as well. It
> would be very helpful if you could test that.

This is te results of the test: it's the same on python2.4 and python2.5:
Make sure can't import records with same ID (in one go). ... FAIL
Make sure can't import records with same ID (in steps). ... FAIL
Make sure can't import records with same ID (in steps with commit). ... FAIL
Make sure can't import a single record twice (in one go). ... FAIL
Make sure can't import a single record twice (in steps). ... FAIL
Make sure can't import a single record twice (in steps with commit). ... FAIL
Make sure all records are correctly loaded. ... ok
Make sure can't reimport existing records. ... FAIL
Indepth check that SeqFeatures are transmitted through the db. ... ok
Load SeqRecord objects into a BioSQL database. ... ok
Get a list of all items in the database. ... ok
Test retrieval of items using various ids. ... ok
Check can add DBSeq objects together. ... ok
Check can turn a DBSeq object into a Seq or MutableSeq. ... ok
Make sure Seqs from BioSQL implement the right interface. ... ok
Check SeqFeatures of a sequence. ... ok
Make sure SeqRecords from BioSQL implement the right interface. ... ok
Check that slices of sequences are retrieved properly. ... ok

======================================================================
FAIL: Make sure can't import records with same ID (in one go).
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 397, in test_duplicate_id_load
    err.__class__.__name__ + "\n" + str(err))
AssertionError: Exception
Should have failed!

======================================================================
FAIL: Make sure can't import records with same ID (in steps).
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 410, in test_duplicate_id_load2
    err.__class__.__name__ + "\n" + str(err))
AssertionError: Exception
Should have failed!

======================================================================
FAIL: Make sure can't import records with same ID (in steps with commit).
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 424, in test_duplicate_id_load3
    err.__class__.__name__ + "\n" + str(err))
AssertionError: Exception
Should have failed!

======================================================================
FAIL: Make sure can't import a single record twice (in one go).
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 361, in test_duplicate_load
    err.__class__.__name__ + "\n" + str(err))
AssertionError: Exception
Should have failed!

======================================================================
FAIL: Make sure can't import a single record twice (in steps).
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 373, in test_duplicate_load2
    err.__class__.__name__ + "\n" + str(err))
AssertionError: Exception
Should have failed!

======================================================================
FAIL: Make sure can't import a single record twice (in steps with commit).
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 386, in test_duplicate_load3
    err.__class__.__name__ + "\n" + str(err))
AssertionError: Exception
Should have failed!

======================================================================
FAIL: Make sure can't reimport existing records.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 463, in test_reload
    err.__class__.__name__ + "\n" + str(err))
AssertionError: OperationalError
currval of sequence "bioentry_pk_seq" is not yet defined in this session


----------------------------------------------------------------------
Ran 18 tests in 26.938s

FAILED (failures=7)


> 
> Part of the new tests is a slight variation on your original example.  Could
> you try this:
> 
> db.load([s1])
> server.commit()
> db.load([s2])
> server.commit()
> db.load([s3])
> server.commit()
> 
>>> ## LOAD INTO DB
>>> db.load([s1])
1
>>> server.commit()
>>> db.load([s2])
1
>>> server.commit()
>>> db.load([s3])
1
>>> server.commit()
>>>
i don't have any errors!!!

> This might tell us if the issue is with PostgreSQL not checking the key
> constraints until the commit.
> 
it seems that. If i try to do the insertion via SQL i don't have any
errors. I just have a message of the type:
INSERT 0 0
due to the fact the postgres doesn't insert anything.

Andrea


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu May 21 17:05:12 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 May 2009 13:05:12 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905211705.n4LH5Ca6028981@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #10 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-21 13:05 EST -------
Well, some progress :)

(In reply to comment #9)
> This is te results of the test: it's the same on python2.4 and python2.5:
> Make sure can't import records with same ID (in one go). ... FAIL
> Make sure can't import records with same ID (in steps). ... FAIL
> Make sure can't import records with same ID (in steps with commit). ... FAIL
> Make sure can't import a single record twice (in one go). ... FAIL
> Make sure can't import a single record twice (in steps). ... FAIL
> Make sure can't import a single record twice (in steps with commit). ... FAIL
> Make sure all records are correctly loaded. ... ok
> Make sure can't reimport existing records. ... FAIL
> Indepth check that SeqFeatures are transmitted through the db. ... ok
> Load SeqRecord objects into a BioSQL database. ... ok
> Get a list of all items in the database. ... ok
> Test retrieval of items using various ids. ... ok
> Check can add DBSeq objects together. ... ok
> Check can turn a DBSeq object into a Seq or MutableSeq. ... ok
> Make sure Seqs from BioSQL implement the right interface. ... ok
> Check SeqFeatures of a sequence. ... ok
> Make sure SeqRecords from BioSQL implement the right interface. ... ok
> Check that slices of sequences are retrieved properly. ... ok
> 
> ======================================================================
> FAIL: Make sure can't import records with same ID (in one go).
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "test_BioSQL.py", line 397, in test_duplicate_id_load
>     err.__class__.__name__ + "\n" + str(err))
> AssertionError: Exception
> Should have failed!
> ...

Also the error formatting wasn't quite what I had intended, fixed in CVS.
However, most of the tests are allowing duplicates to be recorded without any
error (on PostgreSQL).  This is bad.

> ======================================================================
> FAIL: Make sure can't reimport existing records.
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "test_BioSQL.py", line 463, in test_reload
>     err.__class__.__name__ + "\n" + str(err))
> AssertionError: OperationalError
> currval of sequence "bioentry_pk_seq" is not yet defined in this session

Interestingly the final test gives us an OperationalError about the bioentry
table's primary key (presumably from our last_id method which would call the
SQL statement "select currval('bioentry_pk_seq')"). This suggests some clues
about what is going wrong.

http://www.postgresql.org/docs/8.3/static/functions-sequence.html
http://www.postgresql.org/docs/8.3/static/sql-createsequence.html

See also:
http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/sql/biosqldb-pg.sql

CREATE SEQUENCE bioentry_pk_seq;
CREATE TABLE bioentry ( 
         bioentry_id INTEGER DEFAULT nextval ( 'bioentry_pk_seq' ) NOT NULL , 
         biodatabase_id INTEGER NOT NULL , 
         taxon_id INTEGER , 
         name VARCHAR ( 40 ) NOT NULL , 
         accession VARCHAR ( 128 ) NOT NULL , 
         identifier VARCHAR ( 40 ) , 
         division VARCHAR ( 6 ) , 
         description TEXT , 
         version INTEGER NOT NULL , 
         PRIMARY KEY ( bioentry_id ) , 
         UNIQUE ( accession , biodatabase_id , version ) , 
-- CONFIG: uncomment one (and only one) of the two lines below. The
-- first puts a uniqueness constraint on the identifier column alone;
-- the other one puts a uniqueness constraint on identifier only
-- within a namespace.
--       UNIQUE ( identifier ) 
         UNIQUE ( identifier , biodatabase_id ) 
) ; 

CREATE INDEX bioentry_name ON bioentry ( name ); 
CREATE INDEX bioentry_db ON bioentry ( biodatabase_id ); 
CREATE INDEX bioentry_tax ON bioentry ( taxon_id );


I'm a little surprised all the other duplicate record tests show different
behaviour. I have updated test_BioSQL.py to perform all these new duplicate
tests on a clean database - which I probably should have done in the first
place (CVS revision 1.35).

[All these tests are passing on MySQL. Trying the example by hand triggers an
IntegrityError.]

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu May 21 22:22:18 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 May 2009 18:22:18 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905212222.n4LMMIls028194@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #11 from andrea at biodec.com  2009-05-21 18:22 EST -------
So the problem is related to the different behaviur adopted by postgres loaded
with the biosql schema, with respect to mysql.

Sorry because i thought the problem was due to BioSQL because i didn't know
wich was the "expected database behaviour". 

Since we expect an error during insertion of a "duplicate" or "quite duplicate"
record... we have only to focus on the postgres biosql schema, and why/where it
differs from the mysql one.

I didn't have time to have a look to the difference between the various
"duplicate record tests". I will do.

[i've tried postgres 8.4... and it's exactly the same]


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From cy at cymon.org  Thu May 21 22:52:39 2009
From: cy at cymon.org (Cymon Cox)
Date: Thu, 21 May 2009 23:52:39 +0100
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <200905212222.n4LMMIls028194@portal.open-bio.org>
References: <bug-2833-42@http.bugzilla.open-bio.org/>
	<200905212222.n4LMMIls028194@portal.open-bio.org>
Message-ID: <7265d4f0905211552pddc3180ua3a3deb6ba8102bc@mail.gmail.com>

2009/5/21 <bugzilla-daemon at portal.open-bio.org>

> http://bugzilla.open-bio.org/show_bug.cgi?id=2833
>
>
>
>
>
> ------- Comment #11 from andrea at biodec.com  2009-05-21 18:22 EST -------
> So the problem is related to the different behaviur adopted by postgres
> loaded
> with the biosql schema, with respect to mysql.
>
> Sorry because i thought the problem was due to BioSQL because i didn't know
> wich was the "expected database behaviour".
>
> Since we expect an error during insertion of a "duplicate" or "quite
> duplicate"
> record... we have only to focus on the postgres biosql schema, and
> why/where it
> differs from the mysql one.
>
> I didn't have time to have a look to the difference between the various
> "duplicate record tests". I will do.
>
> [i've tried postgres 8.4... and it's exactly the same]


Hi Andrea,

The problem appears to be related to the BioSQL schema/PostGreSQL.

As you indicated, adding a duplicate entry to bioentry returns a "INSERT 0
0" and doesnt throw an IntegrityError which is what the code is looking from
and presumably what MySQL throws.

The reason it doesnt throw an error is because of one (or both) of the RULES
in the schema:

rule_bioentry_i1
and/or
rule_bioentry_i2

If you delete these two rules, load the schema and try to do a duplicate
entry:

mytest=# insert into bioentry(bioentry_id, biodatabase_id, name, accession,
version) values (2, 1, 'blah1', 'test4', 1);
INSERT 0 1
mytest=# select * from bioentry;
 bioentry_id | biodatabase_id | taxon_id | name  | accession | identifier |
division | description | version
-------------+----------------+----------+-------+-----------+------------+----------+-------------+---------
           2 |              1 |          | blah1 | test4     |
|          |             |       1
(1 row)

mytest=# insert into bioentry(bioentry_id, biodatabase_id, name, accession,
version) values (2, 1, 'blah1', 'test4', 1);
ERROR:  duplicate key value violates unique constraint "bioentry_pkey"

we have an error rather than a "INSERT 0 0"

I'm going to assume that psycopg2 would pick-up this error and throw an
IntegrityError, but I havent taken it any further to check.

Cheers, C.


From hlapp at gmx.net  Fri May 22 02:05:17 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 21 May 2009 22:05:17 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <7265d4f0905211552pddc3180ua3a3deb6ba8102bc@mail.gmail.com>
References: <bug-2833-42@http.bugzilla.open-bio.org/>
	<200905212222.n4LMMIls028194@portal.open-bio.org>
	<7265d4f0905211552pddc3180ua3a3deb6ba8102bc@mail.gmail.com>
Message-ID: <8C0BF1E3-15DF-4F89-AB57-7AE09B86BCCE@gmx.net>


On May 21, 2009, at 6:52 PM, Cymon Cox wrote:

> [...]
>
> Hi Andrea,
>
> The problem appears to be related to the BioSQL schema/PostGreSQL.
>
> As you indicated, adding a duplicate entry to bioentry returns a  
> "INSERT 0
> 0" and doesnt throw an IntegrityError which is what the code is  
> looking from
> and presumably what MySQL throws.
>
> The reason it doesnt throw an error is because of one (or both) of  
> the RULES
> in the schema:

Indeed, I'd almost forgotten. The rules are there mostly as a remnant  
from earlier versions of PostgreSQL to support transactional loading  
the way bioperl-db (the object-relational mapping for BioPerl) is  
optimized. You probably don't need them anywhere else.

	-hilmar

<gory-details>
Bioperl-db is optimized such that entities that very likely don't  
exist yet in the database are attempted for insert right away. If the  
insert fails due to a unique key violation, the record is looked up  
(and then expected to be found). In Oracle and MySQL you can do this  
and the transaction remains healthy; i.e., you can commit the  
transaction later and all statements except those that failed will be  
committed. In PostgreSQL any failed statement dooms the entire  
transaction, and the only way out is a rollback. In this case, if you  
want the loading of one sequence record as one transaction, failing to  
insert a single feature record will doom the entire sequence load and  
you would need to start over with the sequence. To fix this, I wrote  
the rules, which in essence do do the lookups for PostgreSQL that the  
bioperl-db code would otherwise avoid, and on insert do nothing if the  
record is found, which results in zero rows affected when you would  
expect one (which is what bioperl-db cues off of and then triggers a  
lookup).
The right way to do this meanwhile is to use nested transactions,  
which PostgreSQL supports since v8.0.x, but I haven't gotten around to  
implement support for that in Bioperl-db.
</gory-details>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From bugzilla-daemon at portal.open-bio.org  Fri May 22 03:56:13 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 May 2009 23:56:13 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905220356.n4M3uDfM021127@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #12 from cymon.cox at gmail.com  2009-05-21 23:56 EST -------
After deleting the RULES in the BioSQL schema, all the new unittests pass.

(All the RULES can be deleted as they are all there to circumvent the problem
in Bioperl-db described by Hilmar Lapp on the biopython-dev list:

http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006084.html

See also the comment in the schema.)

C.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri May 22 08:41:39 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 May 2009 04:41:39 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905220841.n4M8fd3w015716@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #13 from andrea at biodec.com  2009-05-22 04:41 EST -------
(In reply to comment #12)
> After deleting the RULES in the BioSQL schema, all the new unittests pass.
> 
> (All the RULES can be deleted as they are all there to circumvent the problem
> in Bioperl-db described by Hilmar Lapp on the biopython-dev list:
> 
> http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006084.html
> 
> See also the comment in the schema.)
> 
> C.

I've deleted the two rules, 
rule_bioentry_i1
rule_bioentry_i2

and then i run the tests:
Make sure can't import records with same ID (in one go). ... ok
Make sure can't import records with same ID (in steps). ... ok
Make sure can't import records with same ID (in steps with commit). ... ok
Make sure can't import a single record twice (in one go). ... ok
Make sure can't import a single record twice (in steps). ... ok
Make sure can't import a single record twice (in steps with commit). ... ok
Make sure all records are correctly loaded. ... ok
Make sure can't reimport existing records. ... ok
Indepth check that SeqFeatures are transmitted through the db. ... ok
Load SeqRecord objects into a BioSQL database. ... ok
Get a list of all items in the database. ... ok
Test retrieval of items using various ids. ... ok
Check can add DBSeq objects together. ... ok
Check can turn a DBSeq object into a Seq or MutableSeq. ... ok
Make sure Seqs from BioSQL implement the right interface. ... ok
Check SeqFeatures of a sequence. ... ok
Make sure SeqRecords from BioSQL implement the right interface. ... ok
Check that slices of sequences are retrieved properly. ... ok

----------------------------------------------------------------------
Ran 18 tests in 58.371s

OK

with pythhon2.4, python2.5, psycopg, psycopg2.
Everything seems to be ok. I don't know which other possible effects could be
triggered by this deletion. But i think it should be inserted as soon as
possbile into the BioSQL Schema/PostGreSQL (updating also the Test BioSQL
schema/PostGreSQL).


After removing the rules i've run my own tests:
.....
>>> ## LOAD INTO DB
>>> db.load([s1])
1
>>> db.load([s2])
1
>>> db.load([s3])
Traceback (most recent call last):
  File "<console>", line 1, in ?
  File "../BioSQL/BioSeqDatabase.py", line 442, in load
  File "../BioSQL/Loader.py", line 50, in load_seqrecord
  File "../BioSQL/Loader.py", line 550, in _load_bioentry_table
  File "../BioSQL/BioSeqDatabase.py", line 301, in execute
IntegrityError: duplicate key value violates unique constraint
"bioentry_accession_key"

And i've got the error, that is what it is expected as a normal behaviour.
So now i've only to trap the exception or pre-check duplications.

Many Thanks
Andrea


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri May 22 12:06:36 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 May 2009 08:06:36 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905221206.n4MC6aWo000368@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


------- Comment #14 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-22 08:06 EST -------
(In reply to comment #13)
> (In reply to comment #12)
> > After deleting the RULES in the BioSQL schema, all the new unittests pass.
> > 
> > (All the RULES can be deleted as they are all there to circumvent the
> > problem in Bioperl-db described by Hilmar Lapp on the biopython-dev list:
> > 
> > http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006084.html
> > 
> > See also the comment in the schema.)
> > 
> > C.

Well spotted Cymon - I'd missed that.

> I've deleted the two rules, 
> rule_bioentry_i1
> rule_bioentry_i2
> 
> ...
> with pythhon2.4, python2.5, psycopg, psycopg2.
> Everything seems to be ok.
> ...
> After removing the rules i've run my own tests:
> .....
> >>> ## LOAD INTO DB
> >>> db.load([s1])
> 1
> >>> db.load([s2])
> 1
> >>> db.load([s3])
> Traceback (most recent call last):
>   File "<console>", line 1, in ?
>   File "../BioSQL/BioSeqDatabase.py", line 442, in load
>   File "../BioSQL/Loader.py", line 50, in load_seqrecord
>   File "../BioSQL/Loader.py", line 550, in _load_bioentry_table
>   File "../BioSQL/BioSeqDatabase.py", line 301, in execute
> IntegrityError: duplicate key value violates unique constraint
> "bioentry_accession_key"
> 
> And i've got the error, that is what it is expected as a normal behaviour.
> So now i've only to trap the exception or pre-check duplications.

Great.

It will be down to BioSQL to change the schema (in conjunction with BioPerl),
but Hilmar seems to be looking into this:
http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006084.html

I suppose in the short term we could change our local copy of the schema used
in the Biopython unit tests...

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Fri May 22 12:27:06 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 22 May 2009 13:27:06 +0100
Subject: [Biopython-dev] RULES in BioSQL PostgreSQL schema
Message-ID: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>

Hi all,

This is a continuation of a thread / bug report from Biopython (Bug 2833)
where attempting to import duplicate entries into BioSQL did not raise an
error on PostgreSQL (but does on MySQL). Cymon traced this to the
RULES present in the schema to help bioperl-db.

On Fri, May 22, 2009 at 3:05 AM, Hilmar Lapp <hlapp at gmx.net> wrote:
>
> On May 21, 2009, at 6:52 PM, Cymon Cox wrote:
>
>> [...]
>>
>> Hi Andrea,
>>
>> The problem appears to be related to the BioSQL schema/PostGreSQL.
>>
>> As you indicated, adding a duplicate entry to bioentry returns a "INSERT 0
>> 0" and doesnt throw an IntegrityError which is what the code is looking
>> from and presumably what MySQL throws.
>>
>> The reason it doesnt throw an error is because of one (or both) of the
>> RULES in the schema:
>
> Indeed, I'd almost forgotten. The rules are there mostly as a remnant from
> earlier versions of PostgreSQL to support transactional loading the way
> bioperl-db (the object-relational mapping for BioPerl) is optimized. You
> probably don't need them anywhere else.
>
> ? ? ? ?-hilmar
>
> <gory-details>
> Bioperl-db is optimized such that entities that very likely don't exist yet
> in the database are attempted for insert right away. If the insert fails due
> to a unique key violation, the record is looked up (and then expected to be
> found). In Oracle and MySQL you can do this and the transaction remains
> healthy; i.e., you can commit the transaction later and all statements
> except those that failed will be committed. In PostgreSQL any failed
> statement dooms the entire transaction, and the only way out is a rollback.
> In this case, if you want the loading of one sequence record as one
> transaction, failing to insert a single feature record will doom the entire
> sequence load and you would need to start over with the sequence. To fix
> this, I wrote the rules, which in essence do do the lookups for PostgreSQL
> that the bioperl-db code would otherwise avoid, and on insert do nothing if
> the record is found, which results in zero rows affected when you would
> expect one (which is what bioperl-db cues off of and then triggers a
> lookup).
> The right way to do this meanwhile is to use nested transactions, which
> PostgreSQL supports since v8.0.x, but I haven't gotten around to implement
> support for that in Bioperl-db.
> </gory-details>

Hilmar,

It seems for Biopython to work properly with BioSQL on PostgreSQL
these bioentry rules should be removed from the schema (as the
comments in the schema do suggest). Obviously doing this would
break any installation also using the current version of bioperl-db.

Do the RULES affect BioJava or BioRuby using BioSQL on
PostgreSQL?

Are you happy to remove these RULES in BioSQL v1.0.x (after
making the outlined transactional changes in bioperl-db)?

Thanks,

Peter


From hlapp at gmx.net  Fri May 22 15:03:11 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Fri, 22 May 2009 11:03:11 -0400
Subject: [Biopython-dev] RULES in BioSQL PostgreSQL schema
In-Reply-To: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>
References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>
Message-ID: <CFCA6516-FFEA-4D8A-9E51-BCBAFC0F67A7@gmx.net>


On May 22, 2009, at 8:27 AM, Peter wrote:

> Are you happy to remove these RULES in BioSQL v1.0.x (after
> making the outlined transactional changes in bioperl-db)?

In principle yes. It would also mean dropping support for PostgreSQL  
v7.x, but I would hope that that's a non-issue.

But if anyone here is still using and relying on PostgreSQL v7.x (or  
earlier?) do let us know, please.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Fri May 22 15:57:38 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 22 May 2009 16:57:38 +0100
Subject: [Biopython-dev] RULES in BioSQL PostgreSQL schema
In-Reply-To: <CFCA6516-FFEA-4D8A-9E51-BCBAFC0F67A7@gmx.net>
References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>
	<CFCA6516-FFEA-4D8A-9E51-BCBAFC0F67A7@gmx.net>
Message-ID: <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com>

On Fri, May 22, 2009 at 4:03 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>
> On May 22, 2009, at 8:27 AM, Peter wrote:
>
>> Are you happy to remove these RULES in BioSQL v1.0.x (after
>> making the outlined transactional changes in bioperl-db)?
>
> In principle yes. It would also mean dropping support for PostgreSQL v7.x,
> but I would hope that that's a non-issue.
>
> But if anyone here is still using and relying on PostgreSQL v7.x (or
> earlier?) do let us know, please.

Great.

In the meantime could you add a big warning about this issue to the
INSTALL notes for PostgreSQL (i.e. recommend removing the RULES
section if not using bioper-db)?
http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/INSTALL

Peter


From biopython at maubp.freeserve.co.uk  Fri May 22 16:06:21 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 22 May 2009 17:06:21 +0100
Subject: [Biopython-dev] Peter at a conference next week
Message-ID: <320fb6e00905220906l2446afbfk9804599db74a4d66@mail.gmail.com>

Hi all,

Just to let you know I will be at a conference next week, so don't
expect (Biopython) email replies as promptly as usual. I may even
leave my laptop at home ;)

Peter


From hlapp at gmx.net  Fri May 22 18:20:58 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Fri, 22 May 2009 14:20:58 -0400
Subject: [Biopython-dev] RULES in BioSQL PostgreSQL schema
In-Reply-To: <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com>
References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>
	<CFCA6516-FFEA-4D8A-9E51-BCBAFC0F67A7@gmx.net>
	<320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com>
Message-ID: <410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net>

Yes, agree. Would you mind filing this in Bugzilla for BioSQL? -hilmar

On May 22, 2009, at 11:57 AM, Peter wrote:

> On Fri, May 22, 2009 at 4:03 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>>
>> On May 22, 2009, at 8:27 AM, Peter wrote:
>>
>>> Are you happy to remove these RULES in BioSQL v1.0.x (after
>>> making the outlined transactional changes in bioperl-db)?
>>
>> In principle yes. It would also mean dropping support for  
>> PostgreSQL v7.x,
>> but I would hope that that's a non-issue.
>>
>> But if anyone here is still using and relying on PostgreSQL v7.x (or
>> earlier?) do let us know, please.
>
> Great.
>
> In the meantime could you add a big warning about this issue to the
> INSTALL notes for PostgreSQL (i.e. recommend removing the RULES
> section if not using bioper-db)?
> http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/INSTALL
>
> Peter

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From bugzilla-daemon at portal.open-bio.org  Fri May 22 18:37:21 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 May 2009 14:37:21 -0400
Subject: [Biopython-dev] [Bug 2837] New: Reading Roche 454 SFF sequence read
	files in Bio.SeqIO
Message-ID: <bug-2837-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2837

           Summary: Reading Roche 454 SFF sequence read files in Bio.SeqIO
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk


Roche 454 sequencing returns the read data in SFF files, a documented binary
format, capturing the sequence letters and qualities together with trimming
information. It would be nice to support reading (and in the longer term also
writing) these files directly with Bio.SeqIO.

See this thread for background:
http://lists.open-bio.org/pipermail/biopython/2009-April/005083.html


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri May 22 18:39:26 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 May 2009 14:39:26 -0400
Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read
	files in Bio.SeqIO
In-Reply-To: <bug-2837-42@http.bugzilla.open-bio.org/>
Message-ID: <200905221839.n4MIdQU5008555@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2837


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-22 14:39 EST -------
Created an attachment (id=1303)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1303&action=view)
Bio/SeqIO/RocheSffIO.py

This is a rough SeqIO parser constructing SeqRecord objects using a parser
contributed by Jose Blanca. Additional work would be required for paired end
reads - and even more work to be able to write out these files.

Potentially Jose's parser could be exposed as a public module under
Bio.Sequencing, but here is it just two private classes.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Fri May 22 18:40:45 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 22 May 2009 19:40:45 +0100
Subject: [Biopython-dev] sff reader
In-Reply-To: <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
	<320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com>
	<200904171246.46568.jblanca@btc.upv.es>
	<320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com>
Message-ID: <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com>

On Fri, Apr 17, 2009 at 12:08 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Fri, Apr 17, 2009 at 11:46 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
>> Hi Peter:
>> Here you have some code to read the sff files.
>
> Thanks - I'm not sure when I'll get to look at this, maybe next week.
>
>> For the time being it creates a dict for the sequences. I'm not sure about
>> how to integrate the generated data in BioPython. The sequence and
>> qualities should go to a SeqRecord, but there is also the information
>> about the clipping.
>
> For Bio.SeqIO, we would need to use a SeqRecord. ?Ideally we'd want to
> be able to read and write SFF files, and to do that we'll have to record all
> the essential annotation (i.e. clipping) somehow.

I've had a look at your code this evening, and written a rough SeqIO
module using it, available here on enhancement Bug 2837,
http://bugzilla.open-bio.org/show_bug.cgi?id=2837

> Can you write SFF files?
>
>> For my work I use a kind of SeqRecord with a mask property and the
>> mask is a Location that shows which part of the sequence is ok. I don't
>> know if that's a valid model for BioPython.
>
> A mask could be done as a list of booleans, and we can treat it as
> another per-letter-annotation in the SeqRecord. ?I'm not sure if this
> is helpful or not.
>
> The Roche tools let you choose to extract trimmed reads as FASTA
> and QUAL, or untrimmed. ?Perhaps for reading SFF files with
> Bio.SeqIO we should get the user to choose between these
> options (e.g. format names "roche-sff" and "roche-sff-notrim")?

This would work...

> Roche's FASTA files use upper case for the trimmed region, and
> lower case for the start/end which would get trimmed off. This is
> simple and we could do this for Biopython too - meaning you'd get
> the same data if you read the SFF file directly, or used Roche's
> FASTA+QUAL files with SeqIO. ?Note that when reading an SFF
> file directly, we should probably record the real trim data as well.

In my current code, I decided to use the same quality trimming
representation that Roche use if converting the SFF file into FASTA
format (the leading and trailing trim regions are in lower case). We
may want to record the trim positions in the SeqRecord's annotation
as well.

>> There's also a couple of more tricks with the clipping.
>> In theory there's clip_qual and clip_adapter, but in the files
>> we've seen clip_adapter is always zero and clip_quality is used
>> instead for both quality and adapter. I think we could generate
>> one clipping combining both. Let me know what do you think.
>> Also take into account that in some cases the generated clipping
>> from the 454 software are just wrong.
>
> I'll need to learn more about the details before coming to any
> conclusions about how to deal with this information in Biopython.

Right now I have not looked at the left/right adaptor clipping information,
as you found, in the example file I have looked at these fields are zero.

Note I will be away for the next week, so am unlikely to respond to
any emails on this.

Peter


From bugzilla-daemon at portal.open-bio.org  Fri May 22 19:23:44 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 May 2009 15:23:44 -0400
Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read
	files in Bio.SeqIO
In-Reply-To: <bug-2837-42@http.bugzilla.open-bio.org/>
Message-ID: <200905221923.n4MJNiAe013574@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2837


spenthil at gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |spenthil at gmail.com


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri May 22 21:16:07 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 May 2009 17:16:07 -0400
Subject: [Biopython-dev] [Bug 2838] New: If a SeqRecord containing Genbank
	information is read from BioSQL,
	it cannot be written to another BioSQL database
Message-ID: <bug-2838-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2838

           Summary: If a SeqRecord containing Genbank information is read
                    from BioSQL, it cannot be written to another BioSQL
                    database
           Product: Biopython
           Version: 1.49
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: BioSQL
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: david.wyllie at ndm.ox.ac.uk


I've been trying to annotate some microbial sequences; some are from genbank.
So the proposed series of events was:
1) get sequences from genbank
2) store in BioSQL database called One
3) recover them from BioSql
4) annotate the recovered SeqRecords [this works, but isn't necessary for this
problem to be reproduced - here, I'm making no changes at all to the SeqRecord]
5) store the annotated SeqRecords in a different BioSQL database called Two.

The problem is that Step 5 fails when the original record was recovered from
Genbank.

The traceback (below) indicates a problem with the BioSQL loader in 
_load_bioentry_date

Here is the screen output, including traceback.
The program (attached) first loads a record from Genbank,
writes it to One, recovers it from One; at this point it has changed, in
particular in the way date fields are represented.

 the entrez load has a /date feature which is not a list
 /date=26-MAY-2005
 while the reloaded version has two date fields
 /dates=['26-MAY-2005']
 /date=['26-MAY-2005']  

Whether this is relevant I'm not sure. 

The subsequent write of the recovered version to Two fails.
As a control, I've checked that the original version can be written to Two
successfully.

I'm a novice with Python and Biopython so please accept my apologies if there
is something obvious and very stupid responsible for this.

---------------------------------------------------------------------------
dwyllie at dwyllie:~/programs/Project/src$ python dbtestcase.py
OK, going to recover record 28804743  from genbank....
Record loaded looks like this:
ID: AB098727.1
Name: AB098727
Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal
protein S11, cytochromoe b/f complex subunit IV, partial cds.
Number of features: 5
/sequence_version=1
/source=chloroplast Ceratodon purpureus
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta',
'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales',
'Ditrichaceae', 'Ceratodon']
/keywords=['']
/references=[<Bio.SeqFeature.Reference instance at 0x2190b90>,
<Bio.SeqFeature.Reference instance at 0x219a5a8>, <Bio.SeqFeature.Reference
instance at 0x219a5f0>, <Bio.SeqFeature.Reference instance at 0x219a6c8>]
/accessions=['AB098727']
/data_file_division=PLN
/date=26-MAY-2005
/organism=Ceratodon purpureus
/gi=28804743
Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA',
IUPACAmbiguousDNA())
========================================================================
Load from Entrez completed, records= 1
Here is the loaded record:
========================================================================
ID: AB098727.1
Name: AB098727
Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal
protein S11, cytochromoe b/f complex subunit IV, partial cds.
Number of features: 5
/sequence_version=1
/source=chloroplast Ceratodon purpureus
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta',
'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales',
'Ditrichaceae', 'Ceratodon']
/keywords=['']
/references=[<Bio.SeqFeature.Reference instance at 0x2190b90>,
<Bio.SeqFeature.Reference instance at 0x219a5a8>, <Bio.SeqFeature.Reference
instance at 0x219a5f0>, <Bio.SeqFeature.Reference instance at 0x219a6c8>]
/accessions=['AB098727']
/data_file_division=PLN
/date=26-MAY-2005
/organism=Ceratodon purpureus
/gi=28804743
Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA',
IUPACAmbiguousDNA())
========================================================================
Now loading these records into a BioSQL database One.
/var/lib/python-support/python2.6/MySQLdb/__init__.py:34: DeprecationWarning:
the sets module is deprecated
  from sets import ImmutableSet
Creating a new database  One
========================================================================
Load from database One completed, records= 1
========================================================================
Here is the record recovered from database One:
ID: AB098727.1
Name: AB098727
Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal
protein S11, cytochromoe b/f complex subunit IV, partial cds.
Number of features: 5
/dates=['26-MAY-2005']
/ncbi_taxid=3225
/date=['26-MAY-2005']
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Bryopsida',
'Dicranidae', 'Dicranales', 'Ditrichaceae', 'Ceratodon', 'Ceratodon purpureus']
/source=['chloroplast Ceratodon purpureus']
/references=[<Bio.SeqFeature.Reference instance at 0x235d9e0>,
<Bio.SeqFeature.Reference instance at 0x235db90>, <Bio.SeqFeature.Reference
instance at 0x235dcf8>, <Bio.SeqFeature.Reference instance at 0x235de60>]
/gi=28804743
/data_file_division=PLN
/keywords=['']
/organism=Ceratodon purpureus
/sequence_version=['1']
/accessions=['AB098727']
DBSeq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA',
DNAAlphabet())
========================================================================
Creating a new database  Two
Traceback (most recent call last):
  File "dbtestcase.py", line 206, in <module>
    from dbtestcase import AuthDetails
  File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 225, in
<module>
    DemonstrateProblem(problemgi,ad)
  File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 199, in
DemonstrateProblem
    db2.load(listtoload)
  File "/var/lib/python-support/python2.6/BioSQL/BioSeqDatabase.py", line 430,
in load
    db_loader.load_seqrecord(cur_record)
  File "/var/lib/python-support/python2.6/BioSQL/Loader.py", line 50, in
load_seqrecord
    self._load_bioentry_date(record, bioentry_id)
  File "/var/lib/python-support/python2.6/BioSQL/Loader.py", line 577, in
_load_bioentry_date
    self.adaptor.execute(sql, (bioentry_id, date_id, date))
  File "/var/lib/python-support/python2.6/BioSQL/BioSeqDatabase.py", line 289,
in execute
    self.cursor.execute(sql, args or ())
  File "/var/lib/python-support/python2.6/MySQLdb/cursors.py", line 166, in
execute
    self.errorhandler(self, exc, value)
  File "/var/lib/python-support/python2.6/MySQLdb/connections.py", line 35, in
defaulterrorhandler
    raise errorclass, errorvalue
_mysql_exceptions.ProgrammingError: (1064, "You have an error in your SQL
syntax; check the manual that corresponds to your MySQL server version for the
right syntax to use near '), 1)' at line 1")
dwyllie at dwyllie:~/programs/Project/src$


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri May 22 21:19:03 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 May 2009 17:19:03 -0400
Subject: [Biopython-dev] [Bug 2838] If a SeqRecord containing Genbank
	information is read from BioSQL,
	it cannot be written to another BioSQL database
In-Reply-To: <bug-2838-42@http.bugzilla.open-bio.org/>
Message-ID: <200905222119.n4MLJ3d3026350@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2838


------- Comment #1 from david.wyllie at ndm.ox.ac.uk  2009-05-22 17:19 EST -------
Created an attachment (id=1304)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1304&action=view)
A python script which reproduces the error.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri May 22 22:46:04 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 May 2009 18:46:04 -0400
Subject: [Biopython-dev] [Bug 2833] Features insertion on previous
	bioentry_id
In-Reply-To: <bug-2833-42@http.bugzilla.open-bio.org/>
Message-ID: <200905222246.n4MMk4QO000548@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2833


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  BugsThisDependsOn|                            |2839


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Fri May 22 22:46:54 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 22 May 2009 23:46:54 +0100
Subject: [Biopython-dev] RULES in BioSQL PostgreSQL schema
In-Reply-To: <410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net>
References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>
	<CFCA6516-FFEA-4D8A-9E51-BCBAFC0F67A7@gmx.net>
	<320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com>
	<410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net>
Message-ID: <320fb6e00905221546i26edc7a2u2a02fb0d01c374ea@mail.gmail.com>

On 5/22/09, Hilmar Lapp <hlapp at gmx.net> wrote:
> Yes, agree. Would you mind filing this in Bugzilla for BioSQL? -hilmar

I've filed Bug 2839, hopefully this is what you had in mind:
http://bugzilla.open-bio.org/show_bug.cgi?id=2839

Peter


From chapmanb at 50mail.com  Fri May 22 22:54:32 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 22 May 2009 18:54:32 -0400
Subject: [Biopython-dev] sff reader
In-Reply-To: <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
	<320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com>
	<200904171246.46568.jblanca@btc.upv.es>
	<320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com>
	<320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com>
Message-ID: <20090522225432.GU84112@sobchak.mgh.harvard.edu>

Peter and Jose;
I haven't used SFF files myself as we don't have a 454 machine, but
do know of a couple of implementations of SFF TO Fastq/Fasta. 
Flower is a Haskell implementation:

http://blog.malde.org/index.php/flower/

And PyroBayes is a 454 base caller:

http://bioinformatics.bc.edu/marthlab/PyroBayes

Depending on what you all end up doing, these might be useful as
comparison points, or for wrapping with Application command lines.

Brad

> On Fri, Apr 17, 2009 at 12:08 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> > On Fri, Apr 17, 2009 at 11:46 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> >> Hi Peter:
> >> Here you have some code to read the sff files.
> >
> > Thanks - I'm not sure when I'll get to look at this, maybe next week.
> >
> >> For the time being it creates a dict for the sequences. I'm not sure about
> >> how to integrate the generated data in BioPython. The sequence and
> >> qualities should go to a SeqRecord, but there is also the information
> >> about the clipping.
> >
> > For Bio.SeqIO, we would need to use a SeqRecord. ?Ideally we'd want to
> > be able to read and write SFF files, and to do that we'll have to record all
> > the essential annotation (i.e. clipping) somehow.
> 
> I've had a look at your code this evening, and written a rough SeqIO
> module using it, available here on enhancement Bug 2837,
> http://bugzilla.open-bio.org/show_bug.cgi?id=2837
> 
> > Can you write SFF files?
> >
> >> For my work I use a kind of SeqRecord with a mask property and the
> >> mask is a Location that shows which part of the sequence is ok. I don't
> >> know if that's a valid model for BioPython.
> >
> > A mask could be done as a list of booleans, and we can treat it as
> > another per-letter-annotation in the SeqRecord. ?I'm not sure if this
> > is helpful or not.
> >
> > The Roche tools let you choose to extract trimmed reads as FASTA
> > and QUAL, or untrimmed. ?Perhaps for reading SFF files with
> > Bio.SeqIO we should get the user to choose between these
> > options (e.g. format names "roche-sff" and "roche-sff-notrim")?
> 
> This would work...
> 
> > Roche's FASTA files use upper case for the trimmed region, and
> > lower case for the start/end which would get trimmed off. This is
> > simple and we could do this for Biopython too - meaning you'd get
> > the same data if you read the SFF file directly, or used Roche's
> > FASTA+QUAL files with SeqIO. ?Note that when reading an SFF
> > file directly, we should probably record the real trim data as well.
> 
> In my current code, I decided to use the same quality trimming
> representation that Roche use if converting the SFF file into FASTA
> format (the leading and trailing trim regions are in lower case). We
> may want to record the trim positions in the SeqRecord's annotation
> as well.
> 
> >> There's also a couple of more tricks with the clipping.
> >> In theory there's clip_qual and clip_adapter, but in the files
> >> we've seen clip_adapter is always zero and clip_quality is used
> >> instead for both quality and adapter. I think we could generate
> >> one clipping combining both. Let me know what do you think.
> >> Also take into account that in some cases the generated clipping
> >> from the 454 software are just wrong.
> >
> > I'll need to learn more about the details before coming to any
> > conclusions about how to deal with this information in Biopython.
> 
> Right now I have not looked at the left/right adaptor clipping information,
> as you found, in the example file I have looked at these fields are zero.
> 
> Note I will be away for the next week, so am unlikely to respond to
> any emails on this.
> 
> Peter
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From bugzilla-daemon at portal.open-bio.org  Fri May 22 22:58:24 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 May 2009 18:58:24 -0400
Subject: [Biopython-dev] [Bug 2838] If a SeqRecord containing Genbank
	information is read from BioSQL,
	it cannot be written to another BioSQL database
In-Reply-To: <bug-2838-42@http.bugzilla.open-bio.org/>
Message-ID: <200905222258.n4MMwOXA001311@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2838


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-22 18:58 EST -------
(In reply to comment #0)
> I've been trying to annotate some microbial sequences; some are from genbank.
> So the proposed series of events was:
> 1) get sequences from genbank
> 2) store in BioSQL database called One
> 3) recover them from BioSql
> 4) annotate the recovered SeqRecords [this works, but isn't
>    necessary for this problem to be reproduced - here, I'm
>    making no changes at all to the SeqRecord]
> 5) store the annotated SeqRecords in a different BioSQL database called Two.
> 
> The problem is that Step 5 fails when the original record was recovered from
> Genbank.
> 
> The traceback (below) indicates a problem with the BioSQL loader in 
> _load_bioentry_date
> ...
> I'm a novice with Python and Biopython so please accept my apologies if
> there is something obvious and very stupid responsible for this.

What you are trying to do sounds very reasonable (although I have never
actually needed to or tried to do this myself). You were right about the date
thing, the loader code only expected a string, not a list. Fixed in CVS
revision 1.40 of BioSQL/Loader.py, and I have also added a unit test for this
use case in Tests/test_BioSQL.py revision 1.36.

Note there is a known minor discrepancy with dates (see Bug 2681) when
comparing the original SeqRecord to the DBSeqRecord after loading/retrieving
from BioSQL.

If you could confirm this solves your problem, I think we can close this bug.
Thank you!

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From chapmanb at 50mail.com  Fri May 22 22:54:32 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 22 May 2009 18:54:32 -0400
Subject: [Biopython-dev] sff reader
In-Reply-To: <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
	<320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com>
	<200904171246.46568.jblanca@btc.upv.es>
	<320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com>
	<320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com>
Message-ID: <20090522225432.GU84112@sobchak.mgh.harvard.edu>

Peter and Jose;
I haven't used SFF files myself as we don't have a 454 machine, but
do know of a couple of implementations of SFF TO Fastq/Fasta. 
Flower is a Haskell implementation:

http://blog.malde.org/index.php/flower/

And PyroBayes is a 454 base caller:

http://bioinformatics.bc.edu/marthlab/PyroBayes

Depending on what you all end up doing, these might be useful as
comparison points, or for wrapping with Application command lines.

Brad

> On Fri, Apr 17, 2009 at 12:08 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> > On Fri, Apr 17, 2009 at 11:46 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> >> Hi Peter:
> >> Here you have some code to read the sff files.
> >
> > Thanks - I'm not sure when I'll get to look at this, maybe next week.
> >
> >> For the time being it creates a dict for the sequences. I'm not sure about
> >> how to integrate the generated data in BioPython. The sequence and
> >> qualities should go to a SeqRecord, but there is also the information
> >> about the clipping.
> >
> > For Bio.SeqIO, we would need to use a SeqRecord. ?Ideally we'd want to
> > be able to read and write SFF files, and to do that we'll have to record all
> > the essential annotation (i.e. clipping) somehow.
> 
> I've had a look at your code this evening, and written a rough SeqIO
> module using it, available here on enhancement Bug 2837,
> http://bugzilla.open-bio.org/show_bug.cgi?id=2837
> 
> > Can you write SFF files?
> >
> >> For my work I use a kind of SeqRecord with a mask property and the
> >> mask is a Location that shows which part of the sequence is ok. I don't
> >> know if that's a valid model for BioPython.
> >
> > A mask could be done as a list of booleans, and we can treat it as
> > another per-letter-annotation in the SeqRecord. ?I'm not sure if this
> > is helpful or not.
> >
> > The Roche tools let you choose to extract trimmed reads as FASTA
> > and QUAL, or untrimmed. ?Perhaps for reading SFF files with
> > Bio.SeqIO we should get the user to choose between these
> > options (e.g. format names "roche-sff" and "roche-sff-notrim")?
> 
> This would work...
> 
> > Roche's FASTA files use upper case for the trimmed region, and
> > lower case for the start/end which would get trimmed off. This is
> > simple and we could do this for Biopython too - meaning you'd get
> > the same data if you read the SFF file directly, or used Roche's
> > FASTA+QUAL files with SeqIO. ?Note that when reading an SFF
> > file directly, we should probably record the real trim data as well.
> 
> In my current code, I decided to use the same quality trimming
> representation that Roche use if converting the SFF file into FASTA
> format (the leading and trailing trim regions are in lower case). We
> may want to record the trim positions in the SeqRecord's annotation
> as well.
> 
> >> There's also a couple of more tricks with the clipping.
> >> In theory there's clip_qual and clip_adapter, but in the files
> >> we've seen clip_adapter is always zero and clip_quality is used
> >> instead for both quality and adapter. I think we could generate
> >> one clipping combining both. Let me know what do you think.
> >> Also take into account that in some cases the generated clipping
> >> from the 454 software are just wrong.
> >
> > I'll need to learn more about the details before coming to any
> > conclusions about how to deal with this information in Biopython.
> 
> Right now I have not looked at the left/right adaptor clipping information,
> as you found, in the example file I have looked at these fields are zero.
> 
> Note I will be away for the next week, so am unlikely to respond to
> any emails on this.
> 
> Peter
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From biopython at maubp.freeserve.co.uk  Fri May 22 23:09:56 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 23 May 2009 00:09:56 +0100
Subject: [Biopython-dev] sff reader
In-Reply-To: <20090522225432.GU84112@sobchak.mgh.harvard.edu>
References: <200904161146.28203.jblanca@btc.upv.es>
	<320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com>
	<200904171246.46568.jblanca@btc.upv.es>
	<320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com>
	<320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com>
	<20090522225432.GU84112@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com>

On 5/22/09, Brad Chapman <chapmanb at 50mail.com> wrote:
> Peter and Jose;
>  I haven't used SFF files myself as we don't have a 454 machine,

We don't have one in house either, and have instead out-sourced to a
couple of sequencing centres in the UK with 454 machines.

>  but do know of a couple of implementations of SFF TO
>  Fastq/Fasta.
>  Flower is a Haskell implementation:
>
>  http://blog.malde.org/index.php/flower/
>
>  And PyroBayes is a 454 base caller:
>
>  http://bioinformatics.bc.edu/marthlab/PyroBayes
>
>  Depending on what you all end up doing, these might be useful as
>  comparison points, or for wrapping with Application command lines.

I would say Roche's own tools are the best reference, but these only
output FASTA and QUAL, not FASTQ files (at the moment at least). So
yes, being able to compare a Biopython SFF to FASTQ conversion with
that by Flower (or anything else) would be handy.

Peter


From spenthil at gmail.com  Fri May 22 23:52:30 2009
From: spenthil at gmail.com (Senthil Palanisami)
Date: Fri, 22 May 2009 16:52:30 -0700
Subject: [Biopython-dev] sff reader
In-Reply-To: <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
	<320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> 
	<200904171246.46568.jblanca@btc.upv.es>
	<320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> 
	<320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> 
	<20090522225432.GU84112@sobchak.mgh.harvard.edu>
	<320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com>
Message-ID: <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com>

I have been working with SFF files for the past month, and can say it's
definitely frustrating working with custom binary formats.

Take a look at sff_extract which is written in python. It converts sff files
into fasta and xml or caf files:
http://bioinf.comav.upv.es/sff_extract/index.html

You can find detailed specs of the format @
http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#header-global


--
Senthil Palanisami
http://spenthil.com


On Fri, May 22, 2009 at 4:09 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On 5/22/09, Brad Chapman <chapmanb at 50mail.com> wrote:
> > Peter and Jose;
> >  I haven't used SFF files myself as we don't have a 454 machine,
>
> We don't have one in house either, and have instead out-sourced to a
> couple of sequencing centres in the UK with 454 machines.
>
> >  but do know of a couple of implementations of SFF TO
> >  Fastq/Fasta.
> >  Flower is a Haskell implementation:
> >
> >  http://blog.malde.org/index.php/flower/
> >
> >  And PyroBayes is a 454 base caller:
> >
> >  http://bioinformatics.bc.edu/marthlab/PyroBayes
> >
> >  Depending on what you all end up doing, these might be useful as
> >  comparison points, or for wrapping with Application command lines.
>
> I would say Roche's own tools are the best reference, but these only
> output FASTA and QUAL, not FASTQ files (at the moment at least). So
> yes, being able to compare a Biopython SFF to FASTQ conversion with
> that by Flower (or anything else) would be handy.
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


From biopython at maubp.freeserve.co.uk  Sat May 23 00:10:57 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 23 May 2009 01:10:57 +0100
Subject: [Biopython-dev] sff reader
In-Reply-To: <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
	<320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com>
	<200904171246.46568.jblanca@btc.upv.es>
	<320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com>
	<320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com>
	<20090522225432.GU84112@sobchak.mgh.harvard.edu>
	<320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com>
	<21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com>
Message-ID: <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com>

On 5/23/09, Senthil Palanisami <spenthil at gmail.com> wrote:
> I have been working with SFF files for the past month, and can say it's
>  definitely frustrating working with custom binary formats.

At least in this case it is publicly documented. Have you needed to
write out (or edit) an SFF file yet? Have you used any paired end
reads in SFF format?

>  Take a look at sff_extract which is written in python. It converts sff files
>  into fasta and xml or caf files:
>  http://bioinf.comav.upv.es/sff_extract/index.html

That is what this code is based on - Jose Blanca is one of the authors
of  sff_extract.

>  You can find detailed specs of the format @
>  http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#header-global

I think you must have missed this thread last month ;)
http://lists.open-bio.org/pipermail/biopython/2009-April/005084.html

Peter


From bugzilla-daemon at portal.open-bio.org  Sat May 23 01:16:54 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 May 2009 21:16:54 -0400
Subject: [Biopython-dev] [Bug 2838] If a SeqRecord containing Genbank
	information is read from BioSQL,
	it cannot be written to another BioSQL database
In-Reply-To: <bug-2838-42@http.bugzilla.open-bio.org/>
Message-ID: <200905230116.n4N1GsRl010917@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2838


------- Comment #3 from david.wyllie at ndm.ox.ac.uk  2009-05-22 21:16 EST -------
Thank you!

Unfortunately I'm not sure it's fixed, or maybe there is another problem:

I have uninstalled the BioPython package using Synaptic package manager
(previously I was using 1.49), downloaded from cvs checkout.

Thanks for your message
http://osdir.com/ml/python.bio.general/2008-07/msg00035.html
I can confirm that the default ubuntu 9.0 install lacks the python-dev package,
with the necessary Python.h headers. 

After python-dev is installed, 
build is OK, 
Tests pass
running test
test_Ace ... ok
test_AlignIO ... ok
test_BioSQL ... /var/lib/python-support/python2.6/MySQLdb/__init__.py:34:
DeprecationWarning: the sets module is deprecated
  from sets import ImmutableSet
/home/dwyllie/biopython/build/lib.linux-x86_64-2.6/BioSQL/BioSeqDatabase.py:144:
Warning: 'TYPE=storage_engine' is deprecated; use 'ENGINE=storage_engine'
instead
  self.adaptor.cursor.execute(sql_line)
ok
test_BioSQL_SeqIO ... ok
test_CAPS ... ok
test_Clustalw ... ok
..

and install is OK too.  This is all new to me but it seems to work OK.

I have checked the source code and I think your modification is correctly in
place

I think I have your patch in place:

  def _load_bioentry_date(self, record, bioentry_id):
        """Add the effective date of the entry into the database.

        record - a SeqRecord object with an annotated date
        bioentry_id - corresponding database identifier
        """
        # dates are GenBank style, like:
        # 14-SEP-2000
        date = record.annotations.get("date",
                                      strftime("%d-%b-%Y", gmtime()).upper())
        if isinstance(date, list) : date = date[0]
        annotation_tags_id = self._get_ontology_id("Annotation Tags")
        date_id = self._get_term_id("date_changed", annotation_tags_id)
        sql = r"INSERT INTO bioentry_qualifier_value" \
              r" (bioentry_id, term_id, value, rank)" \
              r" VALUES (%s, %s, %s, 1)" 
        self.adaptor.execute(sql, (bioentry_id, date_id, date))


Now when I re-run dbtestcase.py (attached previously) I get a different error
message.

dwyllie at dwyllie:~/programs/CheckleyProject/src$ python dbtestcase.py
OK, going to recover record 28804743  from genbank....
Record loaded looks like this:
ID: AB098727.1
Name: AB098727
Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal
protein S11, cytochromoe b/f complex subunit IV, partial cds.
Number of features: 5
/sequence_version=1
/source=chloroplast Ceratodon purpureus
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta',
'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales',
'Ditrichaceae', 'Ceratodon']
/keywords=['']
/references=[<Bio.SeqFeature.Reference object at 0x26e7a10>,
<Bio.SeqFeature.Reference object at 0x26e7a90>, <Bio.SeqFeature.Reference
object at 0x26e7b50>, <Bio.SeqFeature.Reference object at 0x26e7bd0>]
/accessions=['AB098727']
/data_file_division=PLN
/date=26-MAY-2005
/organism=Ceratodon purpureus
/gi=28804743
Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA',
IUPACAmbiguousDNA())
========================================================================
Load from Entrez completed, records= 1
Here is the loaded record:
========================================================================
ID: AB098727.1
Name: AB098727
Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal
protein S11, cytochromoe b/f complex subunit IV, partial cds.
Number of features: 5
/sequence_version=1
/source=chloroplast Ceratodon purpureus
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta',
'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales',
'Ditrichaceae', 'Ceratodon']
/keywords=['']
/references=[<Bio.SeqFeature.Reference object at 0x26e7a10>,
<Bio.SeqFeature.Reference object at 0x26e7a90>, <Bio.SeqFeature.Reference
object at 0x26e7b50>, <Bio.SeqFeature.Reference object at 0x26e7bd0>]
/accessions=['AB098727']
/data_file_division=PLN
/date=26-MAY-2005
/organism=Ceratodon purpureus
/gi=28804743
Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA',
IUPACAmbiguousDNA())
========================================================================
Now loading these records into a BioSQL database One.
/var/lib/python-support/python2.6/MySQLdb/__init__.py:34: DeprecationWarning:
the sets module is deprecated
  from sets import ImmutableSet
Creating a new database  One
========================================================================
Load from database One completed, records= 1
========================================================================
Here is the record recovered from database One:
Traceback (most recent call last):
  File "dbtestcase.py", line 165, in <module>
    from dbtestcase import AuthDetails
  File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 182, in
<module>
    DemonstrateProblem(problemgi,ad)
  File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 138, in
DemonstrateProblem
    print recordrecovered
  File "/usr/local/lib/python2.6/dist-packages/Bio/SeqRecord.py", line 489, in
__str__
    if self.letter_annotations :
  File "/usr/local/lib/python2.6/dist-packages/Bio/SeqRecord.py", line 165, in
<lambda>
    fget=lambda self : self._per_letter_annotations,
AttributeError: 'DBSeqRecord' object has no attribute '_per_letter_annotations'
dwyllie at dwyllie:~/programs/CheckleyProject/src$ 


Have I failed to install something?
Unfortunately, I wasn't running off CVS before your change.

Best wishes
d


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From spenthil at gmail.com  Sat May 23 01:48:24 2009
From: spenthil at gmail.com (Senthil Palanisami)
Date: Fri, 22 May 2009 18:48:24 -0700
Subject: [Biopython-dev] sff reader
In-Reply-To: <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
	<320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> 
	<200904171246.46568.jblanca@btc.upv.es>
	<320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> 
	<320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> 
	<20090522225432.GU84112@sobchak.mgh.harvard.edu>
	<320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> 
	<21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> 
	<320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com>
Message-ID: <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com>

Sorry, I only recently joined this list - should have gone through the
archives first.

I have done some minimal SFF tweaking, but only by first converting them to
CA format.

No paired end reads yet, but I do know my PI wants me to start looking at
some in the next month or two.

--
Senthil Palanisami
http://spenthil.com


On Fri, May 22, 2009 at 5:10 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On 5/23/09, Senthil Palanisami <spenthil at gmail.com> wrote:
> > I have been working with SFF files for the past month, and can say it's
> >  definitely frustrating working with custom binary formats.
>
> At least in this case it is publicly documented. Have you needed to
> write out (or edit) an SFF file yet? Have you used any paired end
> reads in SFF format?
>
> >  Take a look at sff_extract which is written in python. It converts sff
> files
> >  into fasta and xml or caf files:
> >  http://bioinf.comav.upv.es/sff_extract/index.html
>
> That is what this code is based on - Jose Blanca is one of the authors
> of  sff_extract.
>
> >  You can find detailed specs of the format @
> >
> http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#header-global
>
> I think you must have missed this thread last month ;)
> http://lists.open-bio.org/pipermail/biopython/2009-April/005084.html
>
> Peter
>


From biopython at maubp.freeserve.co.uk  Sat May 23 11:28:36 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 23 May 2009 12:28:36 +0100
Subject: [Biopython-dev] sff reader
In-Reply-To: <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
	<320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com>
	<200904171246.46568.jblanca@btc.upv.es>
	<320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com>
	<320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com>
	<20090522225432.GU84112@sobchak.mgh.harvard.edu>
	<320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com>
	<21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com>
	<320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com>
	<21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com>
Message-ID: <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com>

On Sat, May 23, 2009 at 2:48 AM, Senthil Palanisami <spenthil at gmail.com> wrote:
> Sorry, I only recently joined this list - should have gone through the
> archives first.

Don't worry - and if I sounded grumpy, sorry - I was up late last night.

> I have done some minimal SFF tweaking, but only by first converting them
> to CA format.

What do you mean by CA format? I don't recall seeing that abbreviation
before.

> No paired end reads yet, but I do know my PI wants me to start looking
> at some in the next month or two.

I haven't had any paired end 454 reads to work with personally, but I'm
sure there are some examples available online somewhere.

Peter


From bugzilla-daemon at portal.open-bio.org  Sat May 23 11:49:18 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 23 May 2009 07:49:18 -0400
Subject: [Biopython-dev] [Bug 2838] If a SeqRecord containing Genbank
	information is read from BioSQL,
	it cannot be written to another BioSQL database
In-Reply-To: <bug-2838-42@http.bugzilla.open-bio.org/>
Message-ID: <200905231149.n4NBnIEQ023192@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2838


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-23 07:49 EST -------
(In reply to comment #3)
> Thank you!
> 
> Unfortunately I'm not sure it's fixed, or maybe there is another problem:
> ...
> Now when I re-run dbtestcase.py (attached previously) I get a different error
> message.
> ...
> Traceback (most recent call last):
> ...
>   File "/usr/local/lib/python2.6/dist-packages/Bio/SeqRecord.py", line 489, in
> __str__
>     if self.letter_annotations :
>   File "/usr/local/lib/python2.6/dist-packages/Bio/SeqRecord.py", line 165, in
> <lambda>
>     fget=lambda self : self._per_letter_annotations,
> AttributeError: 'DBSeqRecord' object has no attribute '_per_letter_annotations'
> dwyllie at dwyllie:~/programs/CheckleyProject/src$ 
> 
> 
> Have I failed to install something?

No - everything looks OK, and the deprecation warnings are known about and not
in Biopython anyway.

> Unfortunately, I wasn't running off CVS before your change.

The original problem is fixed. However, you've found a new bug in the __str__
method for the DBSeqRecord related to the fact there is no
per-letter-annotation (this would have been introduced in Biopython 1.50 when I
added the letter_annotations dictionary to the SeqRecord class). I'm a little
surprised that our unit tests didn't catch this - but its fixed now:

Tests/test_BioSQL.py CVS revision 1.37
BioSQL/BioSeq.py CVS revision 1.36

Note BioSQL doesn't yet support recording anything more complicated than
strings, although we've started talking about using XML or JSON for this. As a
result, Biopython does not attempt to record any per-letter-annotation in the
BioSQL database. With the fix the DBSeqRecord now has an empty
per-letter-annotation dictionary. Before it didn't, hense the AttributeError.

Hopefully you won't find any more issues, but if you do, please file another
bug - I'm marking this one as fixed.

Thanks for your report and time David,

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From spenthil at gmail.com  Sat May 23 16:11:22 2009
From: spenthil at gmail.com (Senthil Palanisami)
Date: Sat, 23 May 2009 09:11:22 -0700
Subject: [Biopython-dev] sff reader
In-Reply-To: <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
	<200904171246.46568.jblanca@btc.upv.es> 
	<320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> 
	<320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> 
	<20090522225432.GU84112@sobchak.mgh.harvard.edu>
	<320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> 
	<21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> 
	<320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com> 
	<21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> 
	<320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com>
Message-ID: <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com>

You didn't sound particularly grumpy, I am just aware of the annoyances
related to people too lazy to do a quick search of through a mailing list
before spamming.

I pulled 'CA' straight out of a wgs assembler program:
http://apps.sourceforge.net/mediawiki/wgs-assembler/index.php?title=Formatting_Inputs#sffToCA
I think 'frg' is the real file format name.

--
Senthil Palanisami
http://spenthil.com


On Sat, May 23, 2009 at 4:28 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Sat, May 23, 2009 at 2:48 AM, Senthil Palanisami <spenthil at gmail.com>
> wrote:
> > Sorry, I only recently joined this list - should have gone through the
> > archives first.
>
> Don't worry - and if I sounded grumpy, sorry - I was up late last night.
>
> > I have done some minimal SFF tweaking, but only by first converting them
> > to CA format.
>
> What do you mean by CA format? I don't recall seeing that abbreviation
> before.
>
> > No paired end reads yet, but I do know my PI wants me to start looking
> > at some in the next month or two.
>
> I haven't had any paired end 454 reads to work with personally, but I'm
> sure there are some examples available online somewhere.
>
> Peter
>


From mjldehoon at yahoo.com  Sun May 24 04:10:28 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Sat, 23 May 2009 21:10:28 -0700 (PDT)
Subject: [Biopython-dev] SwissProt DE lines and bioentry.description
	field in BioSQL
Message-ID: <867081.50034.qm@web62404.mail.re1.yahoo.com>


I suggest that for the short term, we store the DE lines as one string in the same way as Bioperl 1.5 and 1.6, until we decide on a more advanced way to treat these lines. Currently Bio.SeqIO and Bio.SwissProt use different ways to handle the DE lines, and neither of them agrees with Bioperl.

--Michiel.


--- On Mon, 5/18/09, Peter <biopython at maubp.freeserve.co.uk> wrote:

> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: Re: [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL
> To: "Hilmar Lapp" <hlapp at gmx.net>
> Cc: "Chris Fields" <cjfields at illinois.edu>, "BioPerl List" <bioperl-l at lists.open-bio.org>, "biosql-l" <biosql-l at lists.open-bio.org>, biopython-dev at biopython.org
> Date: Monday, May 18, 2009, 9:38 AM
> On Sun, May 17, 2009 at 4:21 PM,
> Hilmar Lapp <hlapp at gmx.net>
> wrote:
> >
> > On May 17, 2009, at 8:40 AM, Peter wrote:
> >>
> >> [...] Here you have mapped RecName and AltName
> fields in the DE lines to
> >> Name and Synonyms (shouldn't that be Synonym
> singular?).
> >
> > The example is for the GN lines in SwissProt, not the
> DE lines.
> 
> Ah, that probably explains some of my confusion.
> 
> >> In this example, searching the database using one
> of the SwissProt
> >> AltNames (synonyms), or filtering on the Flags
> sounds like a
> >> reasonable request - but this would be very
> difficult if the data is
> >> stored inside XML strings.
> >
> > Actually no. Modern full-text indexers (inside or
> outside the database) can
> > index XML text columns right away and very well. In
> fact, for the last
> > project that I built a full-text search for (on top of
> a BioSQL database) I
> > did that by writing custom XML documents to a separate
> table for each
> > record I wanted indexed. Oracle's full text indexer
> did the rest. I also built a
> > separate identifier/name/accession index that pulled
> all the gene names,
> > symbols, accession numbers, identifiers etc into a
> single table for
> > indexing.
> 
> OK, when I said searching "would be very difficult if the
> data is
> stored inside XML strings", maybe it wasn't so difficult
> for you - but
> that still sounds complicated!
> 
> Sticking with the GN lines and the synonym, if this was
> stored as a
> simple tag/value as usual in BioSQL, I would write my SQL
> statement to
> search the annotation table where the term id was that
> associated with
> a GN synonym, and the annotation value was "HABP1".?
> Simple.
> 
> Using the XML approach, are you suggesting you could do a
> full text
> search on the annotation value field, looking for any rows
> where the
> field contains "<Synonyms>HABP1</Synonyms>",
> where the term id matches
> the GN lines' XML string? This sounds simplistic and
> probably rather
> slow - presumably why you resorted to the more complicated
> indexing
> scheme described above?
> 
> > What I mean is, a fully normalized relational
> representation, especially if
> > nested, is often not the most efficient data structure
> for efficient
> > searching and filtering.
> 
> OK.? But do we really need to worry about complex
> nested structures
> for the SwissProt annotation (or in general)?
> 
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
> 


From biopython at maubp.freeserve.co.uk  Sun May 24 10:42:14 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 24 May 2009 11:42:14 +0100
Subject: [Biopython-dev] SwissProt DE lines and bioentry.description
	field in BioSQL
In-Reply-To: <867081.50034.qm@web62404.mail.re1.yahoo.com>
References: <867081.50034.qm@web62404.mail.re1.yahoo.com>
Message-ID: <320fb6e00905240342t7d59f783t8203cce581256f88@mail.gmail.com>

On Sun, May 24, 2009 at 5:10 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
> I suggest that for the short term, we store the DE lines as one
> string in the same way as Bioperl 1.5 and 1.6, until we decide
> on a more advanced way to treat these lines.

Agreed.

> Currently Bio.SeqIO and Bio.SwissProt use different ways to
> handle the DE lines, and neither of them agrees with Bioperl.

Well, Bio.SeqIO agrees with BioPerl modulo the white space -
but we might as well agree with the current BioPerl behaviour
until something is settled for storing more complex objects
than strings in BioSQL.

As I mentioned earlier, I'll be away for this week, so feel free
to press ahead with this.

Peter


From bugzilla-daemon at portal.open-bio.org  Mon May 25 18:21:26 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 May 2009 14:21:26 -0400
Subject: [Biopython-dev] [Bug 2840] New: When a record has been loaded from
	BioSQL, trying to save it to another database fails with loader
	db_loader.load_seqrecord fails in _load_reference
Message-ID: <bug-2840-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2840

           Summary: When a record has been loaded from BioSQL, trying to
                    save it to another database fails with loader
                    db_loader.load_seqrecord fails in _load_reference
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: BioSQL
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: david.wyllie at ndm.ox.ac.uk


Hi

I have been trying to load SeqRecords from BioSQL, annotate them, and then
write them to a different BioSQL database.  Reloading the record to the second
database fails.  This isn't to do with annotation - none is performed.

This issue is different from #2838, which has been addressed (thank you).

The sequence of events is
1) eFetch a SeqRecord from Genbank (succeeds)
2) write to BioSQL (succeeds)
3) recover from BioSQL (succeeds)
4) write to BioSQL (fails, although no modifications have been made).

The current problem seems related to references:
Loader.load_seqrecord._load_reference.
Error says:

_load_reference
    start = 1 + int(str(reference.location[0].start))
ValueError: invalid literal for int() with base 10: 'None'

Testing has been done on Ubuntu 9 x64 with Python 2.6 (debian package),
python-dev (debian package), load from CVS as of 24.5.09, and a testcase
program, dbtestcase.py, attached to the now fixed bug #2838.

To run dbtestcase.py, the mysql details will have to be altered on line
beginning
ad=AuthDetails(...
but otherwise it should I think run.

Traceback and program output from dbtestcase.py follow.
dwyllie at dwyllie:~/programs/CheckleyProject/src$ python dbtestcase.py
OK, going to recover record 28804743  from genbank....
Record loaded looks like this:
ID: AB098727.1
Name: AB098727
Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal
protein S11, cytochromoe b/f complex subunit IV, partial cds.
Number of features: 5
/sequence_version=1
/source=chloroplast Ceratodon purpureus
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta',
'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales',
'Ditrichaceae', 'Ceratodon']
/keywords=['']
/references=[<Bio.SeqFeature.Reference object at 0x2524a10>,
<Bio.SeqFeature.Reference object at 0x2524a90>, <Bio.SeqFeature.Reference
object at 0x2524b50>, <Bio.SeqFeature.Reference object at 0x2524bd0>]
/accessions=['AB098727']
/data_file_division=PLN
/date=26-MAY-2005
/organism=Ceratodon purpureus
/gi=28804743
Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA',
IUPACAmbiguousDNA())
========================================================================
Load from Entrez completed, records= 1
Here is the loaded record:
========================================================================
ID: AB098727.1
Name: AB098727
Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal
protein S11, cytochromoe b/f complex subunit IV, partial cds.
Number of features: 5
/sequence_version=1
/source=chloroplast Ceratodon purpureus
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta',
'Bryophyta', 'Moss Superclass V', 'Bryopsida', 'Dicranidae', 'Dicranales',
'Ditrichaceae', 'Ceratodon']
/keywords=['']
/references=[<Bio.SeqFeature.Reference object at 0x2524a10>,
<Bio.SeqFeature.Reference object at 0x2524a90>, <Bio.SeqFeature.Reference
object at 0x2524b50>, <Bio.SeqFeature.Reference object at 0x2524bd0>]
/accessions=['AB098727']
/data_file_division=PLN
/date=26-MAY-2005
/organism=Ceratodon purpureus
/gi=28804743
Seq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA',
IUPACAmbiguousDNA())
========================================================================
Now loading these records into a BioSQL database One.
/var/lib/python-support/python2.6/MySQLdb/__init__.py:34: DeprecationWarning:
the sets module is deprecated
  from sets import ImmutableSet
Creating a new database  One
========================================================================
Load from database One completed, records= 1
========================================================================
Here is the record recovered from database One:
ID: AB098727.1
Name: AB098727
Description: Ceratodon purpureus chloroplast rps11, petD genes for ribosomal
protein S11, cytochromoe b/f complex subunit IV, partial cds.
Number of features: 5
/dates=['26-MAY-2005']
/ncbi_taxid=3225
/date=['26-MAY-2005']
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Bryopsida',
'Dicranidae', 'Dicranales', 'Ditrichaceae', 'Ceratodon', 'Ceratodon purpureus']
/source=['chloroplast Ceratodon purpureus']
/references=[<Bio.SeqFeature.Reference object at 0x269e710>,
<Bio.SeqFeature.Reference object at 0x269e810>, <Bio.SeqFeature.Reference
object at 0x269e910>, <Bio.SeqFeature.Reference object at 0x269ea10>]
/gi=28804743
/data_file_division=PLN
/keywords=['']
/organism=Ceratodon purpureus
/sequence_version=['1']
/accessions=['AB098727']
DBSeq('AATTCGATTTTTTGTTCGTGATGTAACTCCTATGCCTCATAATGGGTGTAGACC...ATA',
DNAAlphabet())
========================================================================
Creating a new database  Two
Traceback (most recent call last):
  File "dbtestcase.py", line 165, in <module>
    from dbtestcase import AuthDetails
  File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 182, in
<module>
    DemonstrateProblem(problemgi,ad)
  File "/home/dwyllie/programs/CheckleyProject/src/dbtestcase.py", line 158, in
DemonstrateProblem
    db2.load(listtoload)
  File "/usr/local/lib/python2.6/dist-packages/BioSQL/BioSeqDatabase.py", line
442, in load
    db_loader.load_seqrecord(cur_record)
  File "/usr/local/lib/python2.6/dist-packages/BioSQL/Loader.py", line 57, in
load_seqrecord
    self._load_reference(reference, rank, bioentry_id)
  File "/usr/local/lib/python2.6/dist-packages/BioSQL/Loader.py", line 733, in
_load_reference
    start = 1 + int(str(reference.location[0].start))
ValueError: invalid literal for int() with base 10: 'None'
dwyllie at dwyllie:~/programs/CheckleyProject/src$


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Mon May 25 18:23:52 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 May 2009 14:23:52 -0400
Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from
	BioSQL, trying to save it to another database fails with loader
	db_loader.load_seqrecord in _load_reference
In-Reply-To: <bug-2840-42@http.bugzilla.open-bio.org/>
Message-ID: <200905251823.n4PINq60005295@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2840


david.wyllie at ndm.ox.ac.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|When a record has been      |When a record has been
                   |loaded from BioSQL, trying  |loaded from BioSQL, trying
                   |to save it to another       |to save it to another
                   |database fails with loader  |database fails with loader
                   |db_loader.load_seqrecord    |db_loader.load_seqrecord in
                   |fails in _load_reference    |_load_reference


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Mon May 25 22:23:20 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 May 2009 18:23:20 -0400
Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from
	BioSQL, trying to save it to another database fails with loader
	db_loader.load_seqrecord in _load_reference
In-Reply-To: <bug-2840-42@http.bugzilla.open-bio.org/>
Message-ID: <200905252223.n4PMNKL7023601@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2840


------- Comment #1 from david.wyllie at ndm.ox.ac.uk  2009-05-25 18:23 EST -------
I have modified the dbtestcase.py script to show the contents of the reference
of the record downloaded from genbank, and from the record recovered from
BioSQL.

Here is a print out of the last two references before saving to BioSQL:

authors: Sugita,M., Sugiura,C., Arikawa,T. and Higuchi,M.
title: Molecular evidence of an rpoA gene in the basal moss chloroplast
genomes: rpoA is a useful molecular marker for phylogenetic analysis of mosses
journal: Hikobia 14, 171-175 (2004)
medline id: 
pubmed id: 
comment: 

location: [0:789]
authors: Sugita,M.
title: Direct Submission
journal: Submitted (25-DEC-2002) Mamoru Sugita, Nagoya University, Center for
Gene Research; Chikusa-ku, Nagoya, Aichi 464-8602, Japan
(E-mail:sugita at gene.nagoya-u.ac.jp, Tel:81-52-789-3080(ex.3080),
Fax:81-52-789-3080)
medline id: 
pubmed id: 
comment: 

--- note: no location in the first one; only a location in the last reference
(why? - should references have a location?  I suppose they might, if they
referred to a part of a chromosome?)

Now, after saving to BioSQL and recovering, all the records have a location,
but in some cases, it is [None:None]; here are the same two records.

location: [None:None]
authors: Sugita,M., Sugiura,C., Arikawa,T. and Higuchi,M.
title: Molecular evidence of an rpoA gene in the basal moss chloroplast
genomes: rpoA is a useful molecular marker for phylogenetic analysis of mosses
journal: Hikobia 14, 171-175 (2004)
medline id: 
pubmed id: 
comment: 

location: [0:789]
authors: Sugita,M.
title: Direct Submission
journal: Submitted (25-DEC-2002) Mamoru Sugita, Nagoya University, Center for
Gene Research; Chikusa-ku, Nagoya, Aichi 464-8602, Japan
(E-mail:sugita at gene.nagoya-u.ac.jp, Tel:81-52-789-3080(ex.3080),
Fax:81-52-789-3080)
medline id: 
pubmed id: 
comment: 


After this, the db.load method calls _load_reference.  

I think the problem is because the last line doesn't cope with none values.
If one edits 
_load_reference to put the last reference inside a test for the null condition

     if (start is not None and end is not None):        
            sql = "INSERT INTO bioentry_reference (bioentry_id, reference_id,"
\
                  " start_pos, end_pos, rank)" \
                  " VALUES (%s, %s, %s, %s, %s)"
            self.adaptor.execute(sql, (bioentry_id, reference_id,
                                       start, end, rank + 1))

Then the problem is solved, but I'm not sure how this fits in the bigger scheme
of things.

d


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Mon May 25 22:26:21 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 May 2009 18:26:21 -0400
Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from
	BioSQL, trying to save it to another database fails with loader
	db_loader.load_seqrecord in _load_reference
In-Reply-To: <bug-2840-42@http.bugzilla.open-bio.org/>
Message-ID: <200905252226.n4PMQK9o023893@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2840


------- Comment #2 from david.wyllie at ndm.ox.ac.uk  2009-05-25 18:26 EST -------
Created an attachment (id=1305)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1305&action=view)
A program which tests for the problem.  Alter the ad=AuthDetails line to
include MySQl passwords for your system; using root and no password in the
script as is.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Tue May 26 00:14:40 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 May 2009 20:14:40 -0400
Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from
	BioSQL, trying to save it to another database fails with loader
	db_loader.load_seqrecord in _load_reference
In-Reply-To: <bug-2840-42@http.bugzilla.open-bio.org/>
Message-ID: <200905260014.n4Q0EeBh030704@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2840


------- Comment #3 from cymon.cox at gmail.com  2009-05-25 20:14 EST -------
(In reply to comment #1)
> I have modified the dbtestcase.py script to show the contents of the reference
> of the record downloaded from genbank, and from the record recovered from
> BioSQL.
> 
> Here is a print out of the last two references before saving to BioSQL:
> 
> authors: Sugita,M., Sugiura,C., Arikawa,T. and Higuchi,M.
> title: Molecular evidence of an rpoA gene in the basal moss chloroplast
> genomes: rpoA is a useful molecular marker for phylogenetic analysis of mosses
> journal: Hikobia 14, 171-175 (2004)
> medline id: 
> pubmed id: 
> comment: 
> 
> location: [0:789]
> authors: Sugita,M.
> title: Direct Submission
> journal: Submitted (25-DEC-2002) Mamoru Sugita, Nagoya University, Center for
> Gene Research; Chikusa-ku, Nagoya, Aichi 464-8602, Japan
> (E-mail:sugita at gene.nagoya-u.ac.jp, Tel:81-52-789-3080(ex.3080),
> Fax:81-52-789-3080)
> medline id: 
> pubmed id: 
> comment: 
> 
> --- note: no location in the first one; only a location in the last reference
> (why? - should references have a location?  I suppose they might, if they
> referred to a part of a chromosome?)
> 
> Now, after saving to BioSQL and recovering, all the records have a location,
> but in some cases, it is [None:None]; here are the same two records.
> 
> location: [None:None]
> authors: Sugita,M., Sugiura,C., Arikawa,T. and Higuchi,M.
> title: Molecular evidence of an rpoA gene in the basal moss chloroplast
> genomes: rpoA is a useful molecular marker for phylogenetic analysis of mosses
> journal: Hikobia 14, 171-175 (2004)
> medline id: 
> pubmed id: 
> comment: 
> 
> location: [0:789]
> authors: Sugita,M.
> title: Direct Submission
> journal: Submitted (25-DEC-2002) Mamoru Sugita, Nagoya University, Center for
> Gene Research; Chikusa-ku, Nagoya, Aichi 464-8602, Japan
> (E-mail:sugita at gene.nagoya-u.ac.jp, Tel:81-52-789-3080(ex.3080),
> Fax:81-52-789-3080)
> medline id: 
> pubmed id: 
> comment: 
> 
> 
> After this, the db.load method calls _load_reference.  
> 
> I think the problem is because the last line doesn't cope with none values.
> If one edits 
> _load_reference to put the last reference inside a test for the null condition
> 
>      if (start is not None and end is not None):        
>             sql = "INSERT INTO bioentry_reference (bioentry_id, reference_id,"
> \
>                   " start_pos, end_pos, rank)" \
>                   " VALUES (%s, %s, %s, %s, %s)"
>             self.adaptor.execute(sql, (bioentry_id, reference_id,
>                                        start, end, rank + 1))
> 
> Then the problem is solved, but I'm not sure how this fits in the bigger scheme
> of things.
> 
> d
> 

The BioSQL loader uses None for "start" and "end" if a reference doesn't have a
location. When the reference is retrieved the location remains set to
["None","None"]

Try this alteration to BioSeq.py, it should solve your problem:
cymon at gyra:~/git/github-master/BioSQL$ git diff BioSeq.py
diff --git a/BioSQL/BioSeq.py b/BioSQL/BioSeq.py
index cc47cf4..8d1e02a 100644
--- a/BioSQL/BioSeq.py
+++ b/BioSQL/BioSeq.py
@@ -351,8 +351,11 @@ def _retrieve_reference(adaptor, primary_id):
     references = []
     for start, end, location, title, authors, dbname, accession in refs:
         reference = SeqFeature.Reference()
-        if start: start -= 1
-        reference.location = [SeqFeature.FeatureLocation(start, end)]
+        if start:
+            start -= 1
+            reference.location = [SeqFeature.FeatureLocation(start, end)]
+        else:
+            reference.location = []
         #Don't replace the default "" with None.
         if authors : reference.authors = authors
         if title : reference.title = title


Heres a patch for the unittest to compare locations of injected and retrieved
records:
diff --git a/Tests/test_BioSQL_SeqIO.py b/Tests/test_BioSQL_SeqIO.py
index 2d8caf8..9479e02 100644
--- a/Tests/test_BioSQL_SeqIO.py
+++ b/Tests/test_BioSQL_SeqIO.py
@@ -360,6 +360,19 @@ def compare_records(old, new) :
             assert len(old.annotations[key]) == len(new.annotations[key])
             for old_r, new_r in zip(old.annotations[key],
new.annotations[key]) :
                 compare_references(old_r, new_r)
+            for old_ref, new_ref in zip(old.annotations[key],
+                    new.annotations[key]):
+                if old_ref.location == []:
+                    assert new_ref.location == [], "old_reference.location %s
!=" \
+                        "new_reference location %s" % (old_ref.location,
+                        new_ref.location)
+                else:
+                    assert old_ref.location[0].start ==
new_ref.location[0].start, \
+                    "old ref.location[0].start %s != new ref.location[0].start
%s" % \
+                    (old_ref.location[0].start, new_ref.location[0].start)
+                    assert old_ref.location[0].end == new_ref.location[0].end,
\
+                    "old ref.location[0].end %s != new ref.location[0].end %s"
% \
+                    (old_ref.location[0].end, new_ref.location[0].end)
         elif key == "comment":
             if isinstance(old.annotations[key], list):
                 old_comment = [comm.replace("\n", " ") for comm in \

Cheers, Cymon


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Tue May 26 14:17:48 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 26 May 2009 10:17:48 -0400
Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from
	BioSQL, trying to save it to another database fails with loader
	db_loader.load_seqrecord in _load_reference
In-Reply-To: <bug-2840-42@http.bugzilla.open-bio.org/>
Message-ID: <200905261417.n4QEHmf9007821@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2840


------- Comment #4 from cymon.cox at gmail.com  2009-05-26 10:17 EST -------
(In reply to comment #3)
> (In reply to comment #1)

The functions in old Tests/BioSQL_Seq.py have moved to seq_tests_common.py. So
ive updated the seq_tests_common:

diff --git a/Tests/seq_tests_common.py b/Tests/seq_tests_common.py
index d3b7fb4..392a96c 100644
--- a/Tests/seq_tests_common.py
+++ b/Tests/seq_tests_common.py
@@ -40,10 +40,17 @@ def compare_references(old_r, new_r) :
     #allow us to store a consortium.
     assert new_r.consrtm == ""

-    #TODO - reference location?
-    #The parser seems to give a location object (i.e. which
-    #nucleotides from the file is the reference for), while the
-    #we seem to use the database to hold the journal details (!)
+    # Reference location
+    if old_r.location == []:
+        assert new_r.location == [], "old_r.location %s != " \
+            "new_r.location %s" % (old_r.location, new_r.location)
+    else:
+        assert old_r.location[0].start == new_r.location[0].start, \
+        "old_r.location[0].start %s != new_r.location[0].start %s" % \
+        (old_r.location[0].start, new_r.location[0].start)
+        assert old_r.location[0].end == new_r.location[0].end, \
+        "old_r.location[0].end %s != new_r.location[0].end %s" % \
+        (old_r.location[0].end, new_r.location[0].end)
     return True

Pushed to http://github.com/cymon/biopython-github-master/tree/bug2840
C.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Tue May 26 17:32:34 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 26 May 2009 13:32:34 -0400
Subject: [Biopython-dev] [Bug 2841] New: SeqFeature constructor ignores
	qualifiers and sub_features arguments
Message-ID: <bug-2841-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2841

           Summary: SeqFeature constructor ignores qualifiers and
                    sub_features arguments
           Product: Biopython
           Version: 1.50
          Platform: All
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: n.j.loman at bham.ac.uk


The constructor to Bio.SeqFeature.SeqFeature ignores qualifiers and
sub_features, although the prototype to the constructor allows these keyword
arguments to be specified.

I see in the code there is a reason for it to be ignored:
        # XXX right now sub_features and qualifiers cannot be set
        # from the initializer because this causes all kinds
        # of recursive import problems. I can't understand why this is
        # at all :-<
        self.qualifiers = {}
        self.sub_features = []

However, would it not be better to get rid of the keyword arguments from the
constructor prototype to stop people getting confused? I keep stumbling over
this problem myself and forgetting about it.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Wed May 27 07:57:05 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 27 May 2009 03:57:05 -0400
Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from
	BioSQL, trying to save it to another database fails with loader
	db_loader.load_seqrecord in _load_reference
In-Reply-To: <bug-2840-42@http.bugzilla.open-bio.org/>
Message-ID: <200905270757.n4R7v5iv004300@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2840


------- Comment #5 from david.wyllie at ndm.ox.ac.uk  2009-05-27 03:57 EST -------
Thank you very much!

I haven't tested the unit tests but the patch in #3 resolves the problem.

With best wishes


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From mjldehoon at yahoo.com  Sat May 30 09:37:35 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Sat, 30 May 2009 02:37:35 -0700 (PDT)
Subject: [Biopython-dev] More SwissProt inconsistencies
Message-ID: <880385.97797.qm@web62401.mail.re1.yahoo.com>


Looking some more at how Bio.SeqIO and Bio.SwissProt store the information in a SwissProt file, I found the following two inconsistencies:

1) A multi-line author list such as the following:
RA   Pain A., Renauld H., Berriman M., Murphy L., Yeats C.A., Weir W.,
RA   Kerhornou A., Aslett M., Bishop R., Bouchier C., Cochet M.,
RA   Coulson R.M.R., Cronin A., de Villiers E.P., Fraser A., Fosker N.,
RA   Gardner M., Goble A., Griffiths-Jones S., Harris D.E., Katzer F.,
RA   Larke N., Lord A., Maser P., McKellar S., Mooney P., Morton F.,
RA   Nene V., O'Neil S., Price C., Quail M.A., Rabbinowitsch E.,
RA   Rawlings N.D., Rutter S., Saunders D., Seeger K., Shah T., Squares R.,
RA   Squares S., Tivey A., Walker A.R., Woodward J., Dobbelaere D.A.E.,
RA   Langsley G., Rajandream M.A., McKeever D., Shiels B., Tait A.,
RA   Barrell B.G., Hall N.;
is stored without newlines by Bio.SeqIO:
>>> seq_record.annotations['references'][0].authors
"Pain A., Renauld H., Berriman M., Murphy L., Yeats C.A., Weir W.,Kerhornou A., Aslett M., Bishop R., Bouchier C., Cochet M.,Coulson R.M.R., Cronin A., de Villiers E.P., Fraser A., Fosker N.,Gardner M., Goble A., Griffiths-Jones S., Harris D.E., Katzer F.,Larke N., Lord A., Maser P., McKellar S., Mooney P., Morton F.,Nene V., O'Neil S., Price C., Quail M.A., Rabbinowitsch E.,Rawlings N.D., Rutter S., Saunders D., Seeger K., Shah T., Squares R.,Squares S., Tivey A., Walker A.R., Woodward J., Dobbelaere D.A.E.,Langsley G., Rajandream M.A., McKeever D., Shiels B., Tait A.,Barrell B.G., Hall N.;"
but with newlines by Bio.SwissProt:
>>> swiss_record.references[0].authors
"Pain A., Renauld H., Berriman M., Murphy L., Yeats C.A., Weir W.,\nKerhornou A., Aslett M., Bishop R., Bouchier C., Cochet M.,\nCoulson R.M.R., Cronin A., de Villiers E.P., Fraser A., Fosker N.,\nGardner M., Goble A., Griffiths-Jones S., Harris D.E., Katzer F.,\nLarke N., Lord A., Maser P., McKellar S., Mooney P., Morton F.,\nNene V., O'Neil S., Price C., Quail M.A., Rabbinowitsch E.,\nRawlings N.D., Rutter S., Saunders D., Seeger K., Shah T., Squares R.,\nSquares S., Tivey A., Walker A.R., Woodward J., Dobbelaere D.A.E.,\nLangsley G., Rajandream M.A., McKeever D., Shiels B., Tait A.,\nBarrell B.G., Hall N.;"

To me, the Bio.SeqIO approach seems more reasonable. I think we should add a space though at places where there is a newline in the file.

The same happens for multiline RL such as

RL   (In) Baker M.J., Crush J.R., Humphreys L.R. (eds.);
RL   Proceedings of the XVII international grassland congress,
RL   pp.2:1033-1034, Dunmore Press, Palmerston North (1993).

and for multiline RT lines such as

RT   "Genome of the host-cell transforming parasite Theileria annulata
RT   compared with T. parva.";

This is stored by Bio.SeqIO as

'"Genome of the host-cell transforming parasite Theileria annulatacompared with T. parva.";'

and by Bio.SwissProt as

'"Genome of the host-cell transforming parasite Theileria annulata\ncompared with T. parva.";'

whereas I think that both should be stored as

'"Genome of the host-cell transforming parasite Theileria annulata compared with T. parva.";'


2) Comments in a references such as the following:
RC   STRAIN=cv. VF36; TISSUE=Anther;
are stored as a single string by Bio.SeqIO:
>>> seq_record.annotations['references'][i].comment
'STRAIN=cv. VF36; TISSUE=Anther;'
but as a list of (key, value) pairs by Bio.SwissProt:
[('STRAIN', 'cv. VF36'), ('TISSUE', 'Anther')]
Whereas I think both are reasonable, Bio.SeqIO drops the space between two (key, value) pairs if they are on two separate lines:
RC   STRAIN=C57BL/6J;
RC   TISSUE=Bone marrow, Embryo, Kidney, Liver, Thymus, and Visual cortex;
is stored as
>>> seq_record.annotations['references'][i].comment
'STRAIN=C57BL/6J;TISSUE=Bone marrow, Embryo, Kidney, Liver, Thymus, and Visual cortex;'
I think we should add a space here, or just store these as (key, value) pairs as Bio.SwissProt is doing.

Any objections or comments?

--Michiel