From biopython at maubp.freeserve.co.uk  Tue Jun  1 05:05:43 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 1 Jun 2010 10:05:43 +0100
Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB
	enhancements
In-Reply-To: <AANLkTikj60J4uhtzzgTYos67aeH3eJ3xJF8-QsKq7qWS@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<AANLkTin1nilGCBk3zuU4ZMf7vl0XdSlX8Ujai6_g4B-z@mail.gmail.com>
	<AANLkTikj60J4uhtzzgTYos67aeH3eJ3xJF8-QsKq7qWS@mail.gmail.com>
Message-ID: <AANLkTim3hndnmdtvlXcBqo723U48FOmpd5AiESwqxM8P@mail.gmail.com>

2010/6/1 Eric Talevich:
> On Mon, May 31, 2010 at 11:53 AM, Peter wrote:
>
> Under this proposed scheme, what would you see as the basic record type
>> (analogous to a SeqRecord, alignment or tree in Bio.SeqIO, Bio.AlignIO
>> and Bio.Phylo)? It would be nice to say a protein chain, but there is the
>> issue of multiple models (e.g. from NMR). I presume you'd go with the
>> model as the basic unit (where each model may contain multiple chains).
>>
>
> I'd consider a structure to be the basic unit of I/O. If we're going to make
> better use of header info, that's generally associated with the whole
> structure and not individual models -- we'd have to duplicate the header
> info in each Model object emitted, which would be weird.
>
> Are there any formats that store more than one structure in a file? If not,
> then there's probably no need for a parse() function in Bio.Struct.

OK, yes - a whole structure as the unit would work, so we would
only need the read function (one file is one structure) and not the
parse function (no point in iterating over one thing).

>> > from Bio.Struct import WHATIF, Jpred
>> > # Servers each get their own module
>>
>> Hmm - perhaps we may need have another level here, Bio.Struct.Servers
>> or Bio.Struct.WWW or something. How many of these do you expect?
>>
>
> Jo?o's project plan includes Dali and WHATIF:
> http://biopython.org/wiki/GSOC2010_Joao
>
> These servers do different things so I wouldn't expect any similarity in the
> code between them. There are lots of servers that we *could* support...
> Aesthetically, a Servers or WWW subdirectory would match
> Bio.Struct.Applications and make the whole package a little more
> self-documenting.

My thoughts exactly.

> Here's one more idea: Fetching a single PDB file from RCSB requires a
> separate import and a couple of calls. Should we make this even easier by
> mimicking the efetch function in Bio.Entrez, something like
>
>>>> handle = Bio.PDB.fetch("1MOT")
>
> or
>
>>>> from Bio.Struct.WWW import RCSB
>>>> handle = RCSB.fetch("1MOT", "pdb")
>
> ?
>

That seems nice.

Peter


From krother at rubor.de  Tue Jun  1 05:59:31 2010
From: krother at rubor.de (Kristian Rother)
Date: Tue, 1 Jun 2010 11:59:31 +0200
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
Message-ID: <ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>

Hi,

Got some comments & questions.

> 2. PDB headers seem to have become better structured in recent years, in
> ... parse_pdb_header needs some attention as well.

I haven't looked into this code for years .. I think it might be a little
messy.


> 3. Kristian asked on this list awhile ago about the proper location for
> his new code that works with RNA structures. While RCSB's PDB contains
> some RNA structures, the RNA world doesn't revolve around it. Similarly,
> Jo?o needs a place to put code for structure prediction/validation
> servers, command-line wrappers, secondary structures, etc.
>
> I propose a new sub-package called Bio.Struct for these enhancements:
>
> from Bio.Struct import RNA
> # Would this work for you, Kristian?

Yes, it would be more descriptive than the originally proposed Bio.RNA . I
am just concerned whether I could keep the 2D structure-related modules in
the same package.

> Alternatively, we could do all of this within the PDB module -- so picture
> the above examples with "PDB" in place of "Struct". This raises the chance
> of naming collisions, though, and doesn't solve issue #3 above.

I like Bio.PDB.RNA less for the same reasons plus the 2D structure issue.


> We'll leave the existing PDB module layout alone, in general. I think it
> will be necessary to add a few more attributes to the
> Bio.PDB.Structure.Structure class, but we can do this without breaking
> compatibility.
>
> Comments?

What about the modules for constructing coordinates & Loop Closure
(currently available on my Github branch)? I placed them in Bio.PDB
because they are not limited to RNA and are conceptually similar to the
operations performed by Bio.PDB.NeighborSearch and Bio.PDB.SVDSuperimposer
- or would it be better to gather such things in some other package within
Bio.PDB.Struct?

Cheers,
     Kristian


From biopython at maubp.freeserve.co.uk  Tue Jun  1 07:42:53 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 1 Jun 2010 12:42:53 +0100
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
Message-ID: <AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>

2010/6/1 Kristian Rother <krother at rubor.de>:
>
>> 3. Kristian asked on this list awhile ago about the proper location for
>> his new code that works with RNA structures. While RCSB's PDB
>> contains some RNA structures, the RNA world doesn't revolve around
>> it. Similarly, Jo?o needs a place to put code for structure prediction/
>> validation servers, command-line wrappers, secondary structures, etc.
>>
>> I propose a new sub-package called Bio.Struct for these enhancements:
>>
>> from Bio.Struct import RNA
>> # Would this work for you, Kristian?
>
> Yes, it would be more descriptive than the originally proposed Bio.RNA . I
> am just concerned whether I could keep the 2D structure-related modules
> in the same package.

I don't necessarily see a problem with Bio.Struct or Bio.Structure covering
both 2D and 3D structures. Does this 2D stuff include file parsers? That
would complicate plans for Bio.Struct.read() etc. Maybe Bio.RNA is better.

Peter


From biopython at maubp.freeserve.co.uk  Tue Jun  1 09:10:05 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 1 Jun 2010 14:10:05 +0100
Subject: [Biopython-dev] subprocess and calling application wrappers
In-Reply-To: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
References: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
Message-ID: <AANLkTikfxQ87Cpqx466oHhYZF8fn7_bJtriZ-ocQ_2O2@mail.gmail.com>

On Mon, May 31, 2010 at 3:50 PM, Peter wrote:
> Hi all,
>
> With the new command line wrappers and the tutorial pushing
> users towards using subprocess we've had more queries
> about how to use it. The subprocess module itself is rather
> scary I guess, and things could be made a lot easier.
>
> I think the most typical use cases are:
>
> (1) Run the command, return the error code (integer)
> (2) Run the command, return stdout, stderr and error code
>
> In theory the function subprocess.call() would take care
> of the first example, but there is a cross platform annoyance
> here with the shell parameter. Also, if you want the output
> too things get even more tricky. It hasn't helped that there
> are a few platform specific quirks/bugs in subprocess itself
> (the different behaviour of the shell option on Windows,
> bug http://bugs.python.org/issue1124861 in old Pythons,
> the risk of deadlocks with large output files, etc).

In fact I've often found using os.system() much easier than
subprocess for the first use case - running a command and
getting the return code. I wondered about adding an example
of this to the tutorial but didn't find time before the last release
(even if the Python documentation does try and encourage
using subprocess instead).

Peter

From chapmanb at 50mail.com  Tue Jun  1 09:23:55 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 1 Jun 2010 09:23:55 -0400
Subject: [Biopython-dev] subprocess and calling application wrappers
In-Reply-To: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
References: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
Message-ID: <20100601132355.GU1054@sobchak.mgh.harvard.edu>

Peter;

> With the new command line wrappers and the tutorial pushing
> users towards using subprocess we've had more queries
> about how to use it. The subprocess module itself is rather
> scary I guess, and things could be made a lot easier.
[...]
> We could instead make the wrapper objects callable (define
> the magic method __call__) to offer this kind of functionality.
> This seems quite elegant to me. 

This is a good idea, although I'm 50/50 on the __call__ idea.
Having a run() command or something similar might be more intuitive
then the more magical call, if the idea is to appeal to users who
find subprocess too problematic.

I'd suggest having an option to not capture stdout and stderr, which
would help users avoid those cases where a program spews a lot to
stdout and it's unwieldy to capture and stick it into a string.

Brad

From biopython at maubp.freeserve.co.uk  Tue Jun  1 09:48:30 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 1 Jun 2010 14:48:30 +0100
Subject: [Biopython-dev] Blast parsers and records
In-Reply-To: <1275332206.4c04066ed4ec5@webmail.upv.es>
References: <1275332206.4c04066ed4ec5@webmail.upv.es>
Message-ID: <AANLkTimdncn6t6mMTfW3o-1aijnhVemB9XEpJC6qHFbN@mail.gmail.com>

On Mon, May 31, 2010 at 7:56 PM, Blanca Postigo Jose Miguel
<jblanca at btc.upv.es> wrote:
> Mensaje citado por Michael Sandford <sandford at ufl.edu>:
>
>> I've got a few comments as well:
>> > 4) The current Blast record stores its information in attributes. If you
>> use Bio.Entrez to parse Blast XML output (Biopython 1.54 contains the
>> necessary DTDs to do so), the information is stored in dictionaries. This has
>> some advantages. For example, it allows you to use record.keys() to find out
>> what the record contains. Ideally, I think that a Blast Record class should
>> inherit from a dictionary.
>
> I've developed for my own use a dict structure that represents a blast result.
> This structure also can represent many other results, like exonerate, SSAHA or
> any other number of aligners. Having a common representations for all of them
> allows you to create common filters that work with the same interface. I don't
> know if it is very efficient, but it has proven to be very convinient for us.
> You can take a look at:
>
> http://github.com/JoseBlanca/franklin/blob/master/franklin/alignment_search_result.py
>
> Best regards,
>
> Jose Blanca

It has some similarities to what I was imagining for a BioPerl-SearchIO-like
module. I'm still not convinced that we should just be using (subclasses of)
dictionaries - I would rather have important core properties like the hit
co-ordinates held explicitly as properties or attributes (and always using
Python counting, not whatever a given file format uses, like one-based
locations in BLAST output).

Peter

From krother at rubor.de  Tue Jun  1 10:11:51 2010
From: krother at rubor.de (Kristian Rother)
Date: Tue, 1 Jun 2010 16:11:51 +0200
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>
Message-ID: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>

Hi,

>>> from Bio.Struct import RNA
>>> # Would this work for you, Kristian?
>>
>> Yes, it would be more descriptive than the originally proposed Bio.RNA .
>> I
>> am just concerned whether I could keep the 2D structure-related modules
>> in the same package.
>
> I don't necessarily see a problem with Bio.Struct or Bio.Structure
> covering
> both 2D and 3D structures. Does this 2D stuff include file parsers? That
> would complicate plans for Bio.Struct.read() etc. Maybe Bio.RNA is better.

Yes, currently, RNA contains 2D stuff. It would complicate Struct.read().
On the other hand, the 2D stuff is independent from the 3D modules - could
be split into two packages -- but I think keeping RNA is simpler.

Best Regards,
   Kristian


From biopython at maubp.freeserve.co.uk  Tue Jun  1 11:15:03 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 1 Jun 2010 16:15:03 +0100
Subject: [Biopython-dev] subprocess and calling application wrappers
In-Reply-To: <20100601132355.GU1054@sobchak.mgh.harvard.edu>
References: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
	<20100601132355.GU1054@sobchak.mgh.harvard.edu>
Message-ID: <AANLkTinJplczWd6_m4sCmY8Gkh-AsjptvfAwa2TDa5M7@mail.gmail.com>

On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Peter;
>
>> With the new command line wrappers and the tutorial pushing
>> users towards using subprocess we've had more queries
>> about how to use it. The subprocess module itself is rather
>> scary I guess, and things could be made a lot easier.
> [...]
>> We could instead make the wrapper objects callable (define
>> the magic method __call__) to offer this kind of functionality.
>> This seems quite elegant to me.
>
> This is a good idea, although I'm 50/50 on the __call__ idea.
> Having a run() command or something similar might be more intuitive
> then the more magical call, if the idea is to appeal to users who
> find subprocess too problematic.

Fair point. We'd have to audit all the existing wrappers to make
sure we have some suitable names free (e.g run or execute).

> I'd suggest having an option to not capture stdout and stderr, which
> would help users avoid those cases where a program spews a lot to
> stdout and it's unwieldy to capture and stick it into a string.

We need to avoid any risk of deadlocks, so I guess the safe
implementation here would be call subprocess with stdout and
stderr sent to dev null.

Peter

From eric.talevich at gmail.com  Tue Jun  1 14:25:52 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 1 Jun 2010 14:25:52 -0400
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com> 
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com> 
	<312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
Message-ID: <AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>

On Tue, Jun 1, 2010 at 10:11 AM, Kristian Rother <krother at rubor.de> wrote:

> Hi,
>
> >>> from Bio.Struct import RNA
> >>> # Would this work for you, Kristian?
> >>
> >> Yes, it would be more descriptive than the originally proposed Bio.RNA .
> >> I
> >> am just concerned whether I could keep the 2D structure-related modules
> >> in the same package.
> >
> > I don't necessarily see a problem with Bio.Struct or Bio.Structure
> > covering
> > both 2D and 3D structures. Does this 2D stuff include file parsers? That
> > would complicate plans for Bio.Struct.read() etc. Maybe Bio.RNA is
> better.
>
> Yes, currently, RNA contains 2D stuff. It would complicate Struct.read().
> On the other hand, the 2D stuff is independent from the 3D modules - could
> be split into two packages -- but I think keeping RNA is simpler.
>
> Best Regards,
>    Kristian
>
>
I could be totally wrong here, but I think it's useful to lay out some
assumptions and intuitions explicitly.

To me, secondary structure is not really a separate dimension in its own
right, the way tertiary structure corresponds to 3D space and primary
structure corresponds to a linear sequence. Instead, secondary structure has
meaning in 3D space, but is usually serialized as a linear sequence. That
is, we want to parse something that resembles a sequence, but be able to map
it onto a 3D structure. (More for proteins than for RNA, usually.)

(For non-RNA folk, here's an example of RNA secondary structure:
http://github.com/krother/biopython/blob/rna/Tests/RNA/sample.vienna
)

For instance, the output of DSSP and Jpred describes a protein's secondary
structure, but the input to DSSP is a 3D structure, while Jpred accepts a
protein sequence. The representation of secondary structure isn't distinct
from either of these. I'd want both of these available in Bio.Struct
(eventually).

This means that some interaction between Bio.Struct and SeqIO is necessary.
It would be neat if secondary structure regions were represented as
SeqFeature instances, and secondary-structure parsers returned some kind of
subclass of SeqRecord -- or a standard SeqRecord containing a special kind
of Seq.

The secondary-structure parsers for RNA and proteins should be separate,
too, since the annotated features are different. So the function
Bio.Struct.read() can apply exclusively to 3D structures. Would it be
reasonable for Bio.Struct.RNA.read() to apply exclusively to RNA secondary
structures -- assuming that anything that's not a secondary structure, 3D
structure, or nucleotide sequence is something special that belongs in its
own module?

As for protein secondary structure, it's usually associated with a sequence
or a structure, so maybe we could get by with storing that information in an
ordinary Structure or SeqRecord object without inventing a new subclass.

Best,
Eric

From jblanca at btc.upv.es  Wed Jun  2 02:21:36 2010
From: jblanca at btc.upv.es (Jose Blanca)
Date: Wed, 2 Jun 2010 08:21:36 +0200
Subject: [Biopython-dev] subprocess and calling application wrappers
In-Reply-To: <AANLkTinJplczWd6_m4sCmY8Gkh-AsjptvfAwa2TDa5M7@mail.gmail.com>
References: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
	<20100601132355.GU1054@sobchak.mgh.harvard.edu>
	<AANLkTinJplczWd6_m4sCmY8Gkh-AsjptvfAwa2TDa5M7@mail.gmail.com>
Message-ID: <201006020821.36486.jblanca@btc.upv.es>

On Tuesday 01 June 2010 17:15:03 Peter wrote:
> On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> > Peter;
> >
> >> With the new command line wrappers and the tutorial pushing
> >> users towards using subprocess we've had more queries
> >> about how to use it. The subprocess module itself is rather
> >> scary I guess, and things could be made a lot easier.

We had the same need. We solved it with a call function. You can take a look 
at:

http://github.com/JoseBlanca/franklin/blob/master/franklin/utils/cmd_utils.py

Regards,

-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)

From krother at rubor.de  Wed Jun  2 04:17:01 2010
From: krother at rubor.de (Kristian Rother)
Date: Wed, 2 Jun 2010 10:17:01 +0200
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>
	<312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>
Message-ID: <efd9344002b1ace781f63182344f0859-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxTXg5dXg==-webmailer2@server06.webmailer.hosteurope.de>

Hi,

>> >>> from Bio.Struct import RNA
..
>> > I don't necessarily see a problem with Bio.Struct or Bio.Structure
>> > covering both 2D and 3D structures.


Eric, I agree with you - the secondary structure of RNA maps nicely to 3D
space. Generally, I think it is a little more common to work with RNA 2D
structures in absence of 3D information than in proteins - 2D prediction
of RNA is maybe simply a less nasty target.


Eric wrote:

> I could be totally wrong here, but I think it's useful to lay out some
> assumptions and intuitions explicitly.
>
> To me, secondary structure is not really a separate dimension in its own
> right, the way tertiary structure corresponds to 3D space and primary
> structure corresponds to a linear sequence. Instead, secondary structure
> has
> meaning in 3D space, but is usually serialized as a linear sequence. That
> is, we want to parse something that resembles a sequence, but be able to
> map
> it onto a 3D structure. (More for proteins than for RNA, usually.)
>
> (For non-RNA folk, here's an example of RNA secondary structure:
> http://github.com/krother/biopython/blob/rna/Tests/RNA/sample.vienna
> )
>
> For instance, the output of DSSP and Jpred describes a protein's secondary
> structure, but the input to DSSP is a 3D structure, while Jpred accepts a
> protein sequence. The representation of secondary structure isn't distinct
> from either of these. I'd want both of these available in Bio.Struct
> (eventually).
>
> This means that some interaction between Bio.Struct and SeqIO is
> necessary.
> It would be neat if secondary structure regions were represented as
> SeqFeature instances, and secondary-structure parsers returned some kind
> of
> subclass of SeqRecord -- or a standard SeqRecord containing a special kind
> of Seq.

So far the Secstruc parsers I've implemented just return
(sequence,secstruc) tuples. But putting this into a SeqRecord makes sense
- I understand this fits better to the BioPython architecture.

Maybe instead of a Seq or SeqRecord subclass we could use the decorator
pattern (decorating a class, not the Python decorator function syntax).


A potential problem that I'd like to point out early is that we are
working with modified RNA nucleotides a lot (up to 20% of residues in
every tRNA). This would require extending the RNA Alphabet (which now just
is "AGCU") - but I see this as remote from the Bio.XXXX.read() thread.

> The secondary-structure parsers for RNA and proteins should be separate,
> too, since the annotated features are different. So the function
> Bio.Struct.read() can apply exclusively to 3D structures. Would it be
> reasonable for Bio.Struct.RNA.read() to apply exclusively to RNA secondary
> structures -- assuming that anything that's not a secondary structure, 3D
> structure, or nucleotide sequence is something special that belongs in its
> own module?

To summarize, we could use:

1) protein 3D structures:
   Bio.Struct.read() --> Bio.PDB.Structure

2) RNA 3D structures:
   Bio.Struct.read() --> Bio.PDB.Structure

3) RNA 2D structures:
   Bio.Struct.RNA.read() --> Bio.SeqRecord (extended/decorated by a
secstruc field)

4) protein 2D structures: uses special parser module??

5) plain sequences:
   Bio.read() --> Bio.SeqRecord


Eric, does this summarize your thoughts correctly?

This would work for me. Any comments from the others.

Best,
   Kristian


From biopython at maubp.freeserve.co.uk  Wed Jun  2 04:44:54 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 2 Jun 2010 09:44:54 +0100
Subject: [Biopython-dev] subprocess and calling application wrappers
In-Reply-To: <201006020821.36486.jblanca@btc.upv.es>
References: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
	<20100601132355.GU1054@sobchak.mgh.harvard.edu>
	<AANLkTinJplczWd6_m4sCmY8Gkh-AsjptvfAwa2TDa5M7@mail.gmail.com>
	<201006020821.36486.jblanca@btc.upv.es>
Message-ID: <AANLkTilCHrgSCgkkqd0votf3qqIW97wzawk3pAC7ho7Z@mail.gmail.com>

On Wed, Jun 2, 2010 at 7:21 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> On Tuesday 01 June 2010 17:15:03 Peter wrote:
>> On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>> > Peter;
>> >
>> >> With the new command line wrappers and the tutorial pushing
>> >> users towards using subprocess we've had more queries
>> >> about how to use it. The subprocess module itself is rather
>> >> scary I guess, and things could be made a lot easier.
>
> We had the same need. We solved it with a call function. You can take
> a look at:
>
> http://github.com/JoseBlanca/franklin/blob/master/franklin/utils/cmd_utils.py
>

It looks complicated (and I'm sure with good reason), but I'd guess
you've never tried this on Windows?

We used to have the Bio.Application.generic_run function for calling
a command - but making the command line wrapper callable or
having a method on the command line wrapper is much easier to
use (no extra import needed).

Peter

From biopython at maubp.freeserve.co.uk  Wed Jun  2 05:23:15 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 2 Jun 2010 10:23:15 +0100
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>
	<312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>
Message-ID: <AANLkTimtoy9zWv74_KwwRhU-UQny1u0uzHQsy894wXJ8@mail.gmail.com>

On Tue, Jun 1, 2010 at 7:25 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
> I could be totally wrong here, but I think it's useful to lay out some
> assumptions and intuitions explicitly.
>
> To me, secondary structure is not really a separate dimension in its own
> right, the way tertiary structure corresponds to 3D space and primary
> structure corresponds to a linear sequence. Instead, secondary structure has
> meaning in 3D space, but is usually serialized as a linear sequence. That
> is, we want to parse something that resembles a sequence, but be able to map
> it onto a 3D structure. (More for proteins than for RNA, usually.)
>
> (For non-RNA folk, here's an example of RNA secondary structure:
> http://github.com/krother/biopython/blob/rna/Tests/RNA/sample.vienna
> )
>
> For instance, the output of DSSP and Jpred describes a protein's secondary
> structure, but the input to DSSP is a 3D structure, while Jpred accepts a
> protein sequence. The representation of secondary structure isn't distinct
> from either of these. I'd want both of these available in Bio.Struct
> (eventually).
>
> This means that some interaction between Bio.Struct and SeqIO is necessary.
> It would be neat if secondary structure regions were represented as
> SeqFeature instances, and secondary-structure parsers returned some kind of
> subclass of SeqRecord -- or a standard SeqRecord containing a special kind
> of Seq.
>
> ...
>
> As for protein secondary structure, it's usually associated with a sequence
> or a structure, so maybe we could get by with storing that information in an
> ordinary Structure or SeqRecord object without inventing a new subclass.

Maybe all/most secondary structure parsers can just go into Bio.SeqIO (for
both proteins, RNA and DNA). We can store a secondary structure string as
per-letter-annotation, or things like helix regions as SeqFeature objects.

Peter

From jblanca at btc.upv.es  Wed Jun  2 05:24:24 2010
From: jblanca at btc.upv.es (Jose Blanca)
Date: Wed, 2 Jun 2010 11:24:24 +0200
Subject: [Biopython-dev] subprocess and calling application wrappers
In-Reply-To: <AANLkTilCHrgSCgkkqd0votf3qqIW97wzawk3pAC7ho7Z@mail.gmail.com>
References: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
	<201006020821.36486.jblanca@btc.upv.es>
	<AANLkTilCHrgSCgkkqd0votf3qqIW97wzawk3pAC7ho7Z@mail.gmail.com>
Message-ID: <201006021124.24499.jblanca@btc.upv.es>

On Wednesday 02 June 2010 10:44:54 Peter wrote:
> On Wed, Jun 2, 2010 at 7:21 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> > On Tuesday 01 June 2010 17:15:03 Peter wrote:
> >> On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> >> > Peter;
> >> >
> >> >> With the new command line wrappers and the tutorial pushing
> >> >> users towards using subprocess we've had more queries
> >> >> about how to use it. The subprocess module itself is rather
> >> >> scary I guess, and things could be made a lot easier.
> >
> > We had the same need. We solved it with a call function. You can take
> > a look at:
> >
> > http://github.com/JoseBlanca/franklin/blob/master/franklin/utils/cmd_util
> >s.py
>
> It looks complicated (and I'm sure with good reason), but I'd guess
> you've never tried this on Windows?

Yes it is somewhat complicated. We need some functionalities like accepting 
stdout to be a file or just a pipe (some programs have very long stdouts). We 
have added everything we have required for our programs.

No, we haven't test anything on windows.

-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)

From biopython at maubp.freeserve.co.uk  Wed Jun  2 05:25:47 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 2 Jun 2010 10:25:47 +0100
Subject: [Biopython-dev] RNA alphabets; was Bio.PDB enhancements
Message-ID: <AANLkTinC97We1P65J3Hjfp1e5NTjrBRG_ILda1W2MWO3@mail.gmail.com>

On Wed, Jun 2, 2010 at 9:17 AM, Kristian Rother <krother at rubor.de> wrote:
>
> A potential problem that I'd like to point out early is that we are
> working with modified RNA nucleotides a lot (up to 20% of residues in
> every tRNA). This would require extending the RNA Alphabet (which now just
> is "AGCU") - but I see this as remote from the Bio.XXXX.read() thread.
>

What letters are you missing? There is a commented out ExtendedIUPACRNA
alphabet that may be relevant in Bio/Alphabets/IUPAC.py

Peter

From biopython at maubp.freeserve.co.uk  Wed Jun  2 07:36:46 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 2 Jun 2010 12:36:46 +0100
Subject: [Biopython-dev] subprocess and calling application wrappers
In-Reply-To: <AANLkTinJplczWd6_m4sCmY8Gkh-AsjptvfAwa2TDa5M7@mail.gmail.com>
References: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
	<20100601132355.GU1054@sobchak.mgh.harvard.edu>
	<AANLkTinJplczWd6_m4sCmY8Gkh-AsjptvfAwa2TDa5M7@mail.gmail.com>
Message-ID: <AANLkTilkILMvG5huqqZ2-rKMSQNqOAQGiirVO3rNBCwt@mail.gmail.com>

On Tue, Jun 1, 2010 at 4:15 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>> I'd suggest having an option to not capture stdout and stderr, which
>> would help users avoid those cases where a program spews a lot to
>> stdout and it's unwieldy to capture and stick it into a string.
>
> We need to avoid any risk of deadlocks, so I guess the safe
> implementation here would be call subprocess with stdout and
> stderr sent to dev null.

How does this look? Tested on Mac and Windows:
http://github.com/peterjc/biopython/tree/app-exec2

Example usage without capturing the output:

    from Bio.Emboss.Applications import WaterCommandline
    water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True,
                                 asequence="a.fasta", bsequence="b.fasta")
    print "About to run:\n%s" % water_cmd
    return_code = water_cmd()
    print "Return code: %i" % return_code

Example usage with stdout and stderr capture:

    from Bio.Emboss.Applications import WaterCommandline
    water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True,
                                 asequence="a.fasta", bsequence="b.fasta")
    print "About to run:\n%s" % water_cmd
    stdout, stderr, return_code = water_cmd(capture=True)
    print "Return code: %i" % return_code
    print "Tool output:\n%s" % stdout

Note in this implementation it either returns an integer error level
(the default) or a tuple of stdout, stderr and the error level return
code. If we opt for adding methods rather than using __call__
these could be different methods instead.

Another potentially useful option would be to copy the
subprocess.check_call() function in Python 2.5+ which verifies
the return code (error level) is zero and raises an exception if not
(probably only sensible if not capturing the output?). Maybe this
could even be the default behaviour?

[I would prefer to keep the interface as simple as possible though,
less options is better! KISS principle.]

Peter

From biopython at maubp.freeserve.co.uk  Wed Jun  2 07:59:46 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 2 Jun 2010 12:59:46 +0100
Subject: [Biopython-dev] subprocess and calling application wrappers
In-Reply-To: <AANLkTilkILMvG5huqqZ2-rKMSQNqOAQGiirVO3rNBCwt@mail.gmail.com>
References: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
	<20100601132355.GU1054@sobchak.mgh.harvard.edu>
	<AANLkTinJplczWd6_m4sCmY8Gkh-AsjptvfAwa2TDa5M7@mail.gmail.com>
	<AANLkTilkILMvG5huqqZ2-rKMSQNqOAQGiirVO3rNBCwt@mail.gmail.com>
Message-ID: <AANLkTimZShDqgJ__BO8sqvkJl7DBsLXS2iz-0ATW0saa@mail.gmail.com>

On Wed, Jun 2, 2010 at 12:36 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Tue, Jun 1, 2010 at 4:15 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>>> I'd suggest having an option to not capture stdout and stderr, which
>>> would help users avoid those cases where a program spews a lot to
>>> stdout and it's unwieldy to capture and stick it into a string.
>>
>> We need to avoid any risk of deadlocks, so I guess the safe
>> implementation here would be call subprocess with stdout and
>> stderr sent to dev null.
>
> How does this look? Tested on Mac and Windows:
> http://github.com/peterjc/biopython/tree/app-exec2
>
> Example usage without capturing the output:
>
> ? ?from Bio.Emboss.Applications import WaterCommandline
> ? ?water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? asequence="a.fasta", bsequence="b.fasta")
> ? ?print "About to run:\n%s" % water_cmd
> ? ?return_code = water_cmd()
> ? ?print "Return code: %i" % return_code
>
> Example usage with stdout and stderr capture:
>
> ? ?from Bio.Emboss.Applications import WaterCommandline
> ? ?water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? asequence="a.fasta", bsequence="b.fasta")
> ? ?print "About to run:\n%s" % water_cmd
> ? ?stdout, stderr, return_code = water_cmd(capture=True)
> ? ?print "Return code: %i" % return_code
> ? ?print "Tool output:\n%s" % stdout
>
> Note in this implementation it either returns an integer error level
> (the default) or a tuple of stdout, stderr and the error level return
> code. If we opt for adding methods rather than using __call__
> these could be different methods instead.
>
> Another potentially useful option would be to copy the
> subprocess.check_call() function in Python 2.5+ which verifies
> the return code (error level) is zero and raises an exception if not
> (probably only sensible if not capturing the output?). Maybe this
> could even be the default behaviour?
>
> [I would prefer to keep the interface as simple as possible though,
> less options is better! KISS principle.]

With that in mind, as I mentioned yesterday maybe we should just
update the documentation to suggest using os.system() when you
just need the return code and there is no stdin to worry about:

    import os
    from Bio.Emboss.Applications import WaterCommandline
    water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True,
                                 asequence="a.fasta", bsequence="b.fasta")
    print "About to run:\n%s" % water_cmd
    return_code = os.system(water_cmd)
    print "Return code: %i" % return_code

Even if the Python documentation seems to be discouraging it,
using os.system() seems simple, robust, and cross platform. We
could even update the tutorial now and post it online - it should
make some people's lives a little easier.

[Note this is actually a silly example, I should be telling water to
output to a file, not stdout which is then ignored.]

Peter


From krother at rubor.de  Wed Jun  2 08:14:05 2010
From: krother at rubor.de (Kristian Rother)
Date: Wed, 2 Jun 2010 14:14:05 +0200
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <AANLkTimtoy9zWv74_KwwRhU-UQny1u0uzHQsy894wXJ8@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>
	<312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>
	<AANLkTimtoy9zWv74_KwwRhU-UQny1u0uzHQsy894wXJ8@mail.gmail.com>
Message-ID: <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de>


Hi Peter,

Bio.SeqIO would be a nice place for RNA 2D parsers. I can create a new
branch for that (on Git: krother/biopython).

Putting secondary structures like '((((....))))' for a hairpin into the
letter_annotation field makes sense. I think it even would work for
pseudoknotted RNA (which is hard to represent as a string, one possible
notation would be '(((..[[[....)))..]]]'.

Where should the str subclass for secondary structures that the parsers
create go? Could it be Bio.Struct.RNA?

Best,
   Kristian


Putting RNA secondary structures
>> As for protein secondary structure, it's usually associated with a
>> sequence
>> or a structure, so maybe we could get by with storing that information
>> in an
>> ordinary Structure or SeqRecord object without inventing a new subclass.
>
> Maybe all/most secondary structure parsers can just go into Bio.SeqIO (for
> both proteins, RNA and DNA). We can store a secondary structure string as
> per-letter-annotation, or things like helix regions as SeqFeature objects.
>
> Peter
>
>


From krother at rubor.de  Wed Jun  2 08:21:43 2010
From: krother at rubor.de (Kristian Rother)
Date: Wed, 2 Jun 2010 14:21:43 +0200
Subject: [Biopython-dev] RNA alphabets; was Bio.PDB enhancements
In-Reply-To: <AANLkTinC97We1P65J3Hjfp1e5NTjrBRG_ILda1W2MWO3@mail.gmail.com>
References: <AANLkTinC97We1P65J3Hjfp1e5NTjrBRG_ILda1W2MWO3@mail.gmail.com>
Message-ID: <837b33ddc1456279e108d21c0d12d3fb-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWQtfXA==-webmailer2@server06.webmailer.hosteurope.de>


Hi Peter,

I'm afraid the matter is more complicated. To date, we have 115 modified
RNA bases, which means in practice that you run out of nice ASCII
characters. Moreover, some people use one-letter symbols in RNA as
wildcards (R for purine, Y for pyrimidine). As a consequence, several sets
of abbreviations have been developed - see
http://modomics.genesilico.pl/modification_list to get an impression.

We've written for our own purposes a class containing different ways of
nomenclature, but I think its incompatible to Bio.Alphabet - but I'd like
to change that.

Best Regards,
   Kristian


> On Wed, Jun 2, 2010 at 9:17 AM, Kristian Rother <krother at rubor.de> wrote:
>>
>> A potential problem that I'd like to point out early is that we are
>> working with modified RNA nucleotides a lot (up to 20% of residues in
>> every tRNA). This would require extending the RNA Alphabet (which now
>> just
>> is "AGCU") - but I see this as remote from the Bio.XXXX.read() thread.
>>
> What letters are you missing? There is a commented out ExtendedIUPACRNA
> alphabet that may be relevant in Bio/Alphabets/IUPAC.py
>
> Peter
>
>


From biopython at maubp.freeserve.co.uk  Wed Jun  2 09:22:36 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 2 Jun 2010 14:22:36 +0100
Subject: [Biopython-dev] RNA alphabets; was Bio.PDB enhancements
In-Reply-To: <837b33ddc1456279e108d21c0d12d3fb-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWQtfXA==-webmailer2@server06.webmailer.hosteurope.de>
References: <AANLkTinC97We1P65J3Hjfp1e5NTjrBRG_ILda1W2MWO3@mail.gmail.com>
	<837b33ddc1456279e108d21c0d12d3fb-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWQtfXA==-webmailer2@server06.webmailer.hosteurope.de>
Message-ID: <AANLkTikTZLYiEVOhrOe3gHxel5b_ijCCCF-HOb-X_tPT@mail.gmail.com>

On Wed, Jun 2, 2010 at 1:21 PM, Kristian Rother <krother at rubor.de> wrote:
>
> Hi Peter,
>
> I'm afraid the matter is more complicated. To date, we have 115 modified
> RNA bases, which means in practice that you run out of nice ASCII
> characters. Moreover, some people use one-letter symbols in RNA as
> wildcards (R for purine, Y for pyrimidine). As a consequence, several sets
> of abbreviations have been developed - see
> http://modomics.genesilico.pl/modification_list to get an impression.
>
> We've written for our own purposes a class containing different ways of
> nomenclature, but I think its incompatible to Bio.Alphabet - but I'd like
> to change that.
>
> Best Regards,
> ? Kristian

Hmm. I wonder if the HTML entities would work nicely in Python
(as unicode)? That way you could have an unambiguous string
representation where each letter is one character long.
I'm thinking a Seq subclass (with a special alphabet) might be
the way to go here, allowing access to the single character
entities by default but also the longer codes as well.

There are similarities with modified peptide sequences where
there are clear three letter codes, but not one letter codes.

Tricky.

Peter


From biopython at maubp.freeserve.co.uk  Wed Jun  2 09:24:49 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 2 Jun 2010 14:24:49 +0100
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>
	<312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>
	<AANLkTimtoy9zWv74_KwwRhU-UQny1u0uzHQsy894wXJ8@mail.gmail.com>
	<6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de>
Message-ID: <AANLkTinJE4eBPO8ydgrunBzWZqeN5Nw4nyuo8GEaRTzr@mail.gmail.com>

On Wed, Jun 2, 2010 at 1:14 PM, Kristian Rother <krother at rubor.de> wrote:
>
> Hi Peter,
>
> Bio.SeqIO would be a nice place for RNA 2D parsers. I can create a new
> branch for that (on Git: krother/biopython).
>
> Putting secondary structures like '((((....))))' for a hairpin into the
> letter_annotation field makes sense. I think it even would work for
> pseudoknotted RNA (which is hard to represent as a string, one possible
> notation would be '(((..[[[....)))..]]]'.
>
> Where should the str subclass for secondary structures that the parsers
> create go? Could it be Bio.Struct.RNA?
>
> Best,
> ? Kristian

You don't think plain strings in the SeqRecord's letter_annotation
dict would be enough? Assuming you do need something then
perhaps under Bio.Seq or Bio.SeqUtils might be worth considering
as alternatives to Bio.Struct.RNA.

Peter


From eric.talevich at gmail.com  Thu Jun  3 12:17:09 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 3 Jun 2010 12:17:09 -0400
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com> 
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com> 
	<312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com> 
	<AANLkTimtoy9zWv74_KwwRhU-UQny1u0uzHQsy894wXJ8@mail.gmail.com> 
	<6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de>
Message-ID: <AANLkTimUOe06zg8ksdoyLXtg6SkPZtTg-RdtAWSQ-cQi@mail.gmail.com>

On Wed, Jun 2, 2010 at 8:14 AM, Kristian Rother <krother at rubor.de> wrote:

>
> Putting secondary structures like '((((....))))' for a hairpin into the
> letter_annotation field makes sense. I think it even would work for
> pseudoknotted RNA (which is hard to represent as a string, one possible
> notation would be '(((..[[[....)))..]]]'.
>
>
Here's another format that was designed to represent pseudoknots:
http://www.uga.edu/RNA-Informatics/files/software/RNApasta.help.html#Format

I'm not sure how standardized or widely used it is, but the program
RNA-pasta works with it:
http://www.uga.edu/RNA-Informatics/?f=software&p=RNApasta

-Eric

From biopython at maubp.freeserve.co.uk  Thu Jun  3 12:43:47 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 3 Jun 2010 17:43:47 +0100
Subject: [Biopython-dev] More SeqRecord methods
In-Reply-To: <AANLkTikXgZH_CYjV4J_Ii8DtPWwtsN35SJ-Hc_gpq3B1@mail.gmail.com>
References: <AANLkTikXgZH_CYjV4J_Ii8DtPWwtsN35SJ-Hc_gpq3B1@mail.gmail.com>
Message-ID: <AANLkTikDZm4AjPG_gbZhBTy0i11PRT_FtDg_cmrvoqI0@mail.gmail.com>

On Mon, May 31, 2010 at 3:53 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> What do people think of adding upper and lower methods to the SeqRecord?
> http://bugzilla.open-bio.org/show_bug.cgi?id=3054

I checked that in with an example in the tutorial.

> If that is well received, how about adding another Seq method to the
> SeqRecord, the newish ungap method?
> http://bugzilla.open-bio.org/show_bug.cgi?id=3060

This one I would like some feedback on first. I'm sure the implementation
could me made much more efficient too.

Peter

From bugzilla-daemon at portal.open-bio.org  Thu Jun  3 12:45:16 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 3 Jun 2010 12:45:16 -0400
Subject: [Biopython-dev] [Bug 3054] Add upper and lower methods to the
	SeqRecord
In-Reply-To: <bug-3054-42@http.bugzilla.open-bio.org/>
Message-ID: <201006031645.o53GjGd9019264@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3054


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-03 12:45 EST -------
Checked in:
http://github.com/biopython/biopython/tree/f4f11a9c4e7aca10c33cfe93c78d4972a0d736f8

With an example in the tutorial too:
http://github.com/biopython/biopython/commit/3de8bbd423010eb0b480b8966041f7c6d8e9890d

Marking this as fixed. See also:
http://lists.open-bio.org/pipermail/biopython-dev/2010-May/007772.html
http://lists.open-bio.org/pipermail/biopython-dev/2010-June/007801.html


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Thu Jun  3 13:24:43 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 3 Jun 2010 18:24:43 +0100
Subject: [Biopython-dev] More SeqRecord methods
In-Reply-To: <AANLkTikDZm4AjPG_gbZhBTy0i11PRT_FtDg_cmrvoqI0@mail.gmail.com>
References: <AANLkTikXgZH_CYjV4J_Ii8DtPWwtsN35SJ-Hc_gpq3B1@mail.gmail.com>
	<AANLkTikDZm4AjPG_gbZhBTy0i11PRT_FtDg_cmrvoqI0@mail.gmail.com>
Message-ID: <AANLkTinvkmF3ZUfO7z7n0HXMxj6W4b_kG9as-s8zDMlq@mail.gmail.com>

On Thu, Jun 3, 2010 at 5:43 PM, Peter wrote:
> On Mon, May 31, 2010 at 3:53 PM, Peter wrote:
>
>> ..., how about adding another Seq method to the
>> SeqRecord, the newish ungap method?
>> http://bugzilla.open-bio.org/show_bug.cgi?id=3060
>
> This one I would like some feedback on first. I'm sure the
> implementation could be made much more efficient too.

Maybe I should mention that I also envisage a similar method for the
alignment object, to give a new alignment with any all-gap-columns
removed (perhaps with an optional argument to specify a threshold
for the number of gaps required - defaulting to only removing columns
which are all gaps).

Again, the simplest way to implement this is to re-use the new
alignment slicing and addition features - much as how I did it
for the proposed SeqRecord ungap method.

Peter

From eric.talevich at gmail.com  Thu Jun  3 15:10:51 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 3 Jun 2010 15:10:51 -0400
Subject: [Biopython-dev] Fixup branch for Bio.PDB
Message-ID: <AANLkTikl98AiYImXrrYZBrWW8tKu1ZV5LM5jnLNhdBEX@mail.gmail.com>

Hi all,

I've poked around Bugzilla, taken patches for some outstanding bugs, and
applied them to a branch on GitHub:
http://github.com/etal/biopython/tree/pdbfixes
http://github.com/etal/biopython/commits/pdbfixes

I'd like to encourage people to test this branch with their own code, and if
it all still works (or nobody's interested in testing this branch), I'll
push it to the Biopython trunk so it gets tested more. Time frame: if this
branch lingers too long, there's a high chance it will cause conflicts for
Jo?o (our GSoC student) the next time he merges. How about a week?

The branch has patches for bugs 2820, 2948, 2879, 2950 and 2951:
http://bugzilla.open-bio.org/show_bug.cgi?id=2820
http://bugzilla.open-bio.org/show_bug.cgi?id=2948
http://bugzilla.open-bio.org/show_bug.cgi?id=2879
http://bugzilla.open-bio.org/show_bug.cgi?id=2950
http://bugzilla.open-bio.org/show_bug.cgi?id=2951

Thanks,
Eric


From biopython at maubp.freeserve.co.uk  Fri Jun  4 04:44:19 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 4 Jun 2010 09:44:19 +0100
Subject: [Biopython-dev] Fixup branch for Bio.PDB
In-Reply-To: <AANLkTikl98AiYImXrrYZBrWW8tKu1ZV5LM5jnLNhdBEX@mail.gmail.com>
References: <AANLkTikl98AiYImXrrYZBrWW8tKu1ZV5LM5jnLNhdBEX@mail.gmail.com>
Message-ID: <AANLkTin3Yr8UX4PmksYXSPXD-0OB_D_H-mJ4d4lWSOP2@mail.gmail.com>

On Thu, Jun 3, 2010 at 8:10 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
> Hi all,
>
> I've poked around Bugzilla, taken patches for some outstanding bugs, and
> applied them to a branch on GitHub:
> http://github.com/etal/biopython/tree/pdbfixes
> http://github.com/etal/biopython/commits/pdbfixes
>
> I'd like to encourage people to test this branch with their own code, and if
> it all still works (or nobody's interested in testing this branch), I'll
> push it to the Biopython trunk so it gets tested more. Time frame: if this
> branch lingers too long, there's a high chance it will cause conflicts for
> Jo?o (our GSoC student) the next time he merges. How about a week?
>
> The branch has patches for bugs 2820, 2948, 2879, 2950 and 2951:
> http://bugzilla.open-bio.org/show_bug.cgi?id=2820
> http://bugzilla.open-bio.org/show_bug.cgi?id=2948
> http://bugzilla.open-bio.org/show_bug.cgi?id=2879
> http://bugzilla.open-bio.org/show_bug.cgi?id=2950
> http://bugzilla.open-bio.org/show_bug.cgi?id=2951
>
> Thanks,
> Eric

That sounds like a good plan.

Peter


From mjldehoon at yahoo.com  Fri Jun  4 11:55:27 2010
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Fri, 4 Jun 2010 08:55:27 -0700 (PDT)
Subject: [Biopython-dev] Blast parsers and records
In-Reply-To: <901919.44402.qm@web62402.mail.re1.yahoo.com>
Message-ID: <933074.46322.qm@web62405.mail.re1.yahoo.com>

Michael, Peter, Sebastian, Laurent, Jose, and others,

Thanks for your comments. It looks like there are lots of things to discuss, so let's start with the easiest ones.

About converting a record to a string (point 5): I agree that using __str__ is probably not the best choice, so let's use __format__ instead, or add a "write" method. The added advantage of these is that we can print out a record in different formats (xml, text, table) by specifying the requested format as an argument.

For point 3), maybe my wording was confusing; actually what I had in mind is the case where a given Blast program can produce different output formats (xml, text, table, etc.). This was inspired by this bug report:
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
In my mind, the different output formats are just different intermediates, but in essence they are the same and should therefore be stored in the same class. So, if I run blastp, save the result as XML, and parse it, I'd expect the same class as when I run blastp and save and parse the output in table format. Just in the latter case, some information may be missing if it is not available in the output in table format. Does that sound acceptable?

--Michiel.

 
--- On Fri, 5/28/10, Michiel de Hoon <mjldehoon at yahoo.com> wrote:

> From: Michiel de Hoon <mjldehoon at yahoo.com>
> Subject: [Biopython-dev] Blast parsers and records
> To: biopython-dev at biopython.org
> Date: Friday, May 28, 2010, 11:23 PM
> Hi everybody,
> 
> With Biopython 1.54 out (thanks Peter!), and NCBI
> encouraging to use its new Blast+ suite of Blast programs,
> maybe this is a good time to tackle some older bugs related
> to Blast output parsing in Biopython:
> 
> http://bugzilla.open-bio.org/show_bug.cgi?id=2176
> (inconsistencies in the output of different Blast parsers)
> 
> http://bugzilla.open-bio.org/show_bug.cgi?id=2929
> (inconsistencies between Psi-blast parsers)
> 
> http://bugzilla.open-bio.org/show_bug.cgi?id=2319
> (parsing Blast table output)
> 
> and more generally think about the design of the Blast
> record class and Blast parsing. In my opinion, these are the
> major issues:
> 
> 1) Blast parsers are located in several modules
> (Bio.Blast.NCBIXML, Bio.Blast.NCBIStandalone,
> Bio.Blast.ParseBlastTable). I think we should have one
> read() function and one parse() function under Bio.Blast,
> with arguments specifying which format the Blast output is
> in.
> 
> 2) Blast records produced by any of the parsers should be
> consistent with each other. As XML output by blast and
> psi-blast follow the same DTD, we should be able to
> represent both by a single Record class.
> 
> 3) Different parsers should store information in this
> Record class in the same way.
> 
> 4) The current Blast record stores its information in
> attributes. If you use Bio.Entrez to parse Blast XML output
> (Biopython 1.54 contains the necessary DTDs to do so), the
> information is stored in dictionaries. This has some
> advantages. For example, it allows you to use record.keys()
> to find out what the record contains. Ideally, I think that
> a Blast Record class should inherit from a dictionary.
> 
> 5) We should be able to print a Blast record object to
> generate output that is close to the plain-text output
> generated by blast. This would allow us to generate and
> store Blast output as XML, and to convert the output to
> plain-text to make it more human-readable.
> 
> 6) The current Blast record inherits from
> Bio.Blast.Record.Header, Bio.Blast.Record.DatabaseReport,
> and Bio.Blast.Record.Parameters. I don't see the rationale
> for this inheritance, and I think we should remove it.
> 
> Any comments, suggestions (in particular about by proposal
> to have a Blast Record class that inherits from a
> dictionary? Btw, to avoid breaking scripts, I propose that
> any changes to the Blast record and parser are implemented
> separately from the existing parsers and record, and to
> leave those untouched.
> 
> --Michiel.
> 
> 
> ? ? ? 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
> 


From biopython at maubp.freeserve.co.uk  Sat Jun  5 10:49:39 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 5 Jun 2010 15:49:39 +0100
Subject: [Biopython-dev] EuroSciPy 2010 conference in Paris
Message-ID: <AANLkTimJSOqEgUHokfs3P-6MNS5yKxl4_CNB5f8-X0AR@mail.gmail.com>

Hi all,

Are any Biopython folk planning to be at the EuroSciPy
conference in Paris this year (July 2010)? They are still
finalising the Scientific track, but the list of tutorials is
quite interesting already:

http://www.euroscipy.org/conference/euroscipy2010

Peter

From biopython at maubp.freeserve.co.uk  Mon Jun  7 05:35:15 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 7 Jun 2010 10:35:15 +0100
Subject: [Biopython-dev] Working directly on the main git repository
Message-ID: <AANLkTikhnHfMIQfWU3ORrlfG1z2Zxsdx-sEycFSB3oa-@mail.gmail.com>

Hi all,

I thought I'd write down some notes about how I've been using git recently.
This may be of interest to any of the other core developers (those of us
with read-write access to the main repository), and I might get some good
tips from any discussion. The key point is that I have read+write access
to two repositories on github (the official repository AND my own fork),
so there are different advantages/disadvantages about which I choose
to work with directly as my main repository.

Our official repository has just a single stable master branch, and I
often need to work directly with this (e.g. committing small bug fixes
or adding more documentation). I therefore if I setup a clone of the
master repository I can work on the main branch very easily.

Now, when working on a branch for new features, I could just do this
locally, and when they are ready, merge them direct to the master.
However, this means others cannot look at my work (and I find it a
problem when working on multiple machines).

Alternatively, I could push the branches to the public "master" repository.
This would be the simplest option BUT the high visibility gives any such
experimental branch disproportionate status. I think this would be
a good idea for important (multi-person) efforts, like Python 3 work.

Instead, I have a github repository of my own (what github calls a
fork), and I push branches there.

http://github.com/biopython/biopython - the official branch(es)
http://github.com/peterjc/biopython - my branches

How does this work in practice? Like this - I clone the master
and add a reference to my repository (and I do the same when I
want to grab a branch from another developer):

git clone git at github.com:biopython/biopython.git
cd biopython
git remote add peterjc git at github.com:peterjc/biopython.git
git fetch peterjc

Then make a new local branch as usual, and when ready to share
it publicly, I push it to *my* repository on github:

git branch new-work
git checkout new-work
git commit ...
git push peterjc new-work

This would then appear as a new-work branch on my github page.
Then if I (or someone else) wants to access these branches later
(e.g. from another machine) just use the checkout tracked remote
branch. For example,

git clone git at github.com:biopython/biopython.git
cd biopython
git remote add peterjc git at github.com:peterjc/biopython.git
git fetch peterjc
git checkout -t peterjc/seqio-imgt

This then looks like a normal branch (called just "seqio-imgt" in
this example), but git knows it is linked to the remote branch on
the "peterjc" repository (not the origin which is the "official"
repository).

I'd have to check, but I guess that if the original git clone is done
with git://github.com/biopython/biopython.git instead (read only
access) the same procedure could be used by non core devs.
However, I'm not sure this is clearer for them. I think the current
procedure (on our wiki) where you add a remote reference to
the "upstream" official repository works better in this case.

Comments?

Peter

Useful links from Google searches:
http://www.gitready.com/intermediate/2009/01/09/checkout-remote-tracked-branch.html
http://www.gitready.com/beginner/2009/03/09/remote-tracking-branches.html

From biopython at maubp.freeserve.co.uk  Mon Jun  7 09:40:54 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 7 Jun 2010 14:40:54 +0100
Subject: [Biopython-dev] EuroSciPy 2010 conference in Paris
In-Reply-To: <AANLkTimJSOqEgUHokfs3P-6MNS5yKxl4_CNB5f8-X0AR@mail.gmail.com>
References: <AANLkTimJSOqEgUHokfs3P-6MNS5yKxl4_CNB5f8-X0AR@mail.gmail.com>
Message-ID: <AANLkTilToD0eVKZ8ZryPhJhyqVi-leNZkD8mnKcivDL2@mail.gmail.com>

On Sat, Jun 5, 2010 at 3:49 PM, Peter wrote:
> Hi all,
>
> Are any Biopython folk planning to be at the EuroSciPy
> conference in Paris this year (July 2010)? They are still
> finalising the Scientific track, but the list of tutorials is
> quite interesting already:
>
> http://www.euroscipy.org/conference/euroscipy2010
>
> Peter

Hi all,

The track list for the EuroSciPy 2010 Scientific track has
now been announced, and I'm delighted that I will be able
to present a talk on Biopython (likely 4pm Saturday 10 July).
While I hope there will be some other Biopython users there,
this is a nice opportunity to meet the broader scientific python
community. There are still places at the moment if you want
to attend:

http://www.euroscipy.org/conference/euroscipy2010

Unfortunately I will not be attending BOSC or ISMB this
year. However Brad Chapman will be there to present the
annual "Biopython Project Update" talk (as well as helping
to organise this year's BOSC and the associated CodeFest
event preceding it). I'd love to have been there too, but I'm
sure everyone attending will have a great time. Again,
registration is still open:

http://www.open-bio.org/wiki/BOSC_2010
http://www.open-bio.org/wiki/Codefest_2010

Regards,

Peter

P.S. Those of you in North America you might also be
interested in the main SciPy conference in Austin, Texas
(28 June to 3 July 2010):

http://conference.scipy.org/scipy2010/

From biopython at maubp.freeserve.co.uk  Mon Jun  7 09:50:06 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 7 Jun 2010 14:50:06 +0100
Subject: [Biopython-dev] Blast parsers and records
In-Reply-To: <933074.46322.qm@web62405.mail.re1.yahoo.com>
References: <901919.44402.qm@web62402.mail.re1.yahoo.com>
	<933074.46322.qm@web62405.mail.re1.yahoo.com>
Message-ID: <AANLkTikenpBKpMWK7pPSk6RYhCW5mvb5LXjxpFilopvD@mail.gmail.com>

On Fri, Jun 4, 2010 at 4:55 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> Michael, Peter, Sebastian, Laurent, Jose, and others,
>
> Thanks for your comments. It looks like there are lots of things to discuss,
> so let's start with the easiest ones.
>
> About converting a record to a string (point 5): I agree that using __str__ is
> probably not the best choice, so let's use __format__ instead, or add a "write"
> method. The added advantage of these is that we can print out a record in
> different formats (xml, text, table) by specifying the requested format as an argument.

The __format__ or format method sounds like a great idea (following other
bits of Biopython).

> For point 3), maybe my wording was confusing; actually what I had in mind
> is the case where a given Blast program can produce different output formats
> (xml, text, table, etc.). This was inspired by this bug report:
> http://bugzilla.open-bio.org/show_bug.cgi?id=2176
> In my mind, the different output formats are just different intermediates, but
> in essence they are the same and should therefore be stored in the same
> class. So, if I run blastp, save the result as XML, and parse it, I'd expect the
> same class as when I run blastp and save and parse the output in table format.
> Just in the latter case, some information may be missing if it is not available in
> the output in table format. Does that sound acceptable?

I agree that records from all the different BLAST output formats should be
represented by a common base class - but not necessarily the same class.
For example, the default plain text and XML formats include the pairwise
alignments, but the tabular output does not. To me having a sub-class which
stores the pairwise alignments seems natural here.

Peter

From biopython at maubp.freeserve.co.uk  Mon Jun  7 13:45:57 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 7 Jun 2010 18:45:57 +0100
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
Message-ID: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>

Hi all,

Thanks for the lively discussion on the main list,

http://lists.open-bio.org/pipermail/biopython/2010-June/006546.html
...
http://lists.open-bio.org/pipermail/biopython/2010-June/006580.html

I've spent the afternoon updating my old branch which uses SQLite
to store the record identifier to file offset mapping. Using the code
on this branch, Bio.SeqIO.index() supports a new optional argument
currently called "db" (other names I like including "cache", suggestions
welcome):

http://github.com/peterjc/biopython/tree/index-sqlite

The default (False) is not to use SQLite, but continue with an in
memory Python dictionary. As long as you have enough RAM
and don't plan to use the index at a later date, this will be fastest.

If set to True or a filename, then an SQLite index is used to hold
the offsets. This means very low RAM requirements, but is a lot
slower because the offsets are written to disk and the SQLite
index is updated as we go. I expect this part can be optimised
(e.g. try to build the index at the end, try committing in batches).

I'm still testing this, but the core of the work is done I think.
Once we're happy with the public API, we can concentrate
on things like the SQLite schema, and optimising the code.

Peter

P.S. I know it will need a little work to fail gracefully on Python 2.4
when SQLite isn't installed.

From biopython at maubp.freeserve.co.uk  Mon Jun  7 14:23:05 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 7 Jun 2010 19:23:05 +0100
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
Message-ID: <AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>

Peter wrote:
>...
>
> http://github.com/peterjc/biopython/tree/index-sqlite
>
> ... an SQLite index is used to hold
> the offsets. This means very low RAM requirements, but is a lot
> slower because the offsets are written to disk and the SQLite
> index is updated as we go. I expect this part can be optimised
> (e.g. try to build the index at the end, try committing in batches).

Having now tried using this on some files with tens of millions of
records, tuning how we use SQLite is going to be important.

Peter

From bioinformed at gmail.com  Mon Jun  7 17:10:42 2010
From: bioinformed at gmail.com (Kevin Jacobs <jacobs@bioinformed.com>)
Date: Mon, 7 Jun 2010 17:10:42 -0400
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
Message-ID: <AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>

On Mon, Jun 7, 2010 at 2:23 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> Peter wrote:
> >...
> >
> > http://github.com/peterjc/biopython/tree/index-sqlite
> >
> > ... an SQLite index is used to hold
> > the offsets. This means very low RAM requirements, but is a lot
> > slower because the offsets are written to disk and the SQLite
> > index is updated as we go. I expect this part can be optimised
> > (e.g. try to build the index at the end, try committing in batches).
>
> Having now tried using this on some files with tens of millions of
> records, tuning how we use SQLite is going to be important.
>
>
Wouldn't a Berkeley database be much much faster for constructing simple key
to offset mappings?

-Kevin

From anaryin at gmail.com  Mon Jun  7 20:45:05 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Mon, 7 Jun 2010 19:45:05 -0500
Subject: [Biopython-dev] [GSOC] Report - Week 1
Message-ID: <AANLkTinnFWWM2ojjfBtBu2Qt4911cUZTJyI0J1hVYoji@mail.gmail.com>

Dear all,

Eric suggested me to write a weekly email wrapping up my progress, any
problems I encountered, new ideas, etc. So, here's week 1 :)

*Proposed Tasks:*
Wiki<http://www.biopython.org/wiki/GSOC2010_Joao#Week_1_.5B31st_May_-_6th_June.5D>
*Project's Github account:*
Link<http://github.com/JoaoRodrigues/biopython/tree/GSOC2010>
*
Progress:*

*1. Renumbering Residues*

I wrote a small function in Structure.py
(link<http://github.com/JoaoRodrigues/biopython/blob/GSOC2010/Bio/PDB/Structure.py#L57>)
that iterates over the residues in a chain and subtracts the original first
residue number. This keeps gaps intacts. Worked on my machine for a set of
75 proteins I was working on. Also allows for people to change the starting
residue for whatever reason, the default being 1.

I had originally thought of having a SEQREQ parsing function and using this
as a base for the new renumbering. However, most structures that lack
residues (gaps) still count them in the numbering. Since there is no parser
for SEQRES, I thought this to be the best option.

*Example
*
...
s = p.get_structure('a', '2KSX.pdb')
s.renumber_residues()
s.renumber_residues(start=0)


*2. Disulphide bond search*

I originally proposed to use the NeighborSearch method but I didn't know
that subtracting two atom objects gave me their distance. I used this
instead.

I defined a threshold of 3A for a S-S since the average is 2.05A. I tried to
get some paper/doc from other software where such a limit would be already
defined but I didn't find any.. thus, I assigned 3 because its results
agreed with the SSBOND records. The user can provide a threshold integer or
float as an argument to make the search stricter or broader.

The function generates first an iterator with all the pairs of cysteines
possible in the protein. It then checks and yields those with distances
between the SG atoms of the cystein below the threshold. The result is also
an iterator with tuples containing pairs of residue objects.

*Example*

...
s = p.get_structure('a', '2KSX.pdb')
[i for i in s.search_ss_bonds()]
[(<Residue CYS het=  resseq=1 icode= >, <Residue CYS het=  resseq=98 icode=
>), (<Residue CYS het=  resseq=29 icode= >, <Residue CYS het=  resseq=138
icode= >), (<Residue CYS het=  resseq=12 icode= >, <Residue CYS het=
resseq=95 icode= >), (<Residue CYS het=  resseq=58 icode= >, <Residue CYS
het=  resseq=66 icode= >), (<Residue CYS het=  resseq=180 icode= >, <Residue
CYS het=  resseq=200 icode= >)]
len([i for i in s.search_ss_bonds(threshold=100)])
45


*Problems:*

*3. Biological Unit*

I added code to parse_pdb_header to extract the REMARK 350 section. They
contain something like this (1IHM.pdb<http://www.pdb.org/pdb/files/1IHM.pdb>
):

REMARK 350   BIOMT1   1  1.000000  0.000000  0.000000
0.00000
REMARK 350   BIOMT2   1  0.000000  1.000000  0.000000
0.00000
REMARK 350   BIOMT3   1  0.000000  0.000000  1.000000
0.00000
REMARK 350   BIOMT1   2  0.500000 -0.809017 -0.309017
0.00000
REMARK 350   BIOMT2   2  0.809017  0.309017  0.500000
0.00000
REMARK 350   BIOMT3   2 -0.309017 -0.500000  0.809017
0.00000
REMARK 350   BIOMT1   3 -0.309017 -0.500000 -0.809017        0.00000

I parse out the 4th column to identify each transformation. I store a 3x3
rotation matrix and the translation vector separately. It is then easy to
apply them to each atom record via the transform function.

Now, the problem lies in what the output should be. We broke it down to two
main options:

a. Create a new structure object for each rotated/translated object, thus
making the final output a list of structures. This takes quite a while
actually. I tried this with a deepcopy method to copy each structure and it
took over 30 seconds on my machine for that PDB file above.

b. Add the new rotated objects as new chains in the original structure. This
is actually a good solution because it allows people to use other methods
(the SS search comes to mind) on quartenary structures. It also allows the
user to write a file with all the structures in their place using PDBIO
quite seamlessly. However, it might be complicated to deal with an excess of
chains, or if not all chains are supposed to be rotated (dunno if the case
actually exists).

My personal belief is that B is the way to go. Although it adulterates the
original structure with alien chains, it allows much greater flexibility. I
haven't tested it though.

----

Comments? :)

Jo?o [...] Rodrigues


From anaryin at gmail.com  Mon Jun  7 23:42:27 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Mon, 7 Jun 2010 22:42:27 -0500
Subject: [Biopython-dev] [GSOC] Report - Week 1
In-Reply-To: <AANLkTinnFWWM2ojjfBtBu2Qt4911cUZTJyI0J1hVYoji@mail.gmail.com>
References: <AANLkTinnFWWM2ojjfBtBu2Qt4911cUZTJyI0J1hVYoji@mail.gmail.com>
Message-ID: <AANLkTinyVRT9SfeFJDl3sfI7TTmtqjdRalFh3nbE5O3h@mail.gmail.com>

Just my own heads up and comment.

I thought of using MODEL records to hold the rotated structures. Citing the
PDB format guidelines:

This record is used only when more than one model appears in an entry.
*Generally,
> it is employed mainly for NMR structures.* The chemical connectivity
> should be the same for each model. ATOM, HETATM, ANISOU, and TER records for
> each model structure and are interspersed as needed between MODEL and ENDMDL
> records.
>

Since REMARK 350 seems to be a X-Ray exclusive feature and conversely MODEL
a NMR one, I believe this could also be a possible solution. I'm adding the
code I wrote to Git. There is a huge speed problem with that deepcopy
method.. if someone has a faster/better alternative, it would be great as
this takes around 2 seconds per matrix.

Best!

Jo?o [...] Rodrigues
@ http://www.biopython.org/wiki/User:Joaor


On Mon, Jun 7, 2010 at 7:45 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:

> Dear all,
>
> Eric suggested me to write a weekly email wrapping up my progress, any
> problems I encountered, new ideas, etc. So, here's week 1 :)
>
> *Proposed Tasks:* Wiki<http://www.biopython.org/wiki/GSOC2010_Joao#Week_1_.5B31st_May_-_6th_June.5D>
> *Project's Github account:* Link<http://github.com/JoaoRodrigues/biopython/tree/GSOC2010>
> *
> Progress:*
>
> *1. Renumbering Residues*
>
> I wrote a small function in Structure.py (link<http://github.com/JoaoRodrigues/biopython/blob/GSOC2010/Bio/PDB/Structure.py#L57>)
> that iterates over the residues in a chain and subtracts the original first
> residue number. This keeps gaps intacts. Worked on my machine for a set of
> 75 proteins I was working on. Also allows for people to change the starting
> residue for whatever reason, the default being 1.
>
> I had originally thought of having a SEQREQ parsing function and using this
> as a base for the new renumbering. However, most structures that lack
> residues (gaps) still count them in the numbering. Since there is no parser
> for SEQRES, I thought this to be the best option.
>
> *Example
> *
> ...
> s = p.get_structure('a', '2KSX.pdb')
> s.renumber_residues()
> s.renumber_residues(start=0)
>
>
> *2. Disulphide bond search*
>
> I originally proposed to use the NeighborSearch method but I didn't know
> that subtracting two atom objects gave me their distance. I used this
> instead.
>
> I defined a threshold of 3A for a S-S since the average is 2.05A. I tried
> to get some paper/doc from other software where such a limit would be
> already defined but I didn't find any.. thus, I assigned 3 because its
> results agreed with the SSBOND records. The user can provide a threshold
> integer or float as an argument to make the search stricter or broader.
>
> The function generates first an iterator with all the pairs of cysteines
> possible in the protein. It then checks and yields those with distances
> between the SG atoms of the cystein below the threshold. The result is also
> an iterator with tuples containing pairs of residue objects.
>
> *Example*
>
> ...
> s = p.get_structure('a', '2KSX.pdb')
> [i for i in s.search_ss_bonds()]
> [(<Residue CYS het=  resseq=1 icode= >, <Residue CYS het=  resseq=98 icode=
> >), (<Residue CYS het=  resseq=29 icode= >, <Residue CYS het=  resseq=138
> icode= >), (<Residue CYS het=  resseq=12 icode= >, <Residue CYS het=
> resseq=95 icode= >), (<Residue CYS het=  resseq=58 icode= >, <Residue CYS
> het=  resseq=66 icode= >), (<Residue CYS het=  resseq=180 icode= >, <Residue
> CYS het=  resseq=200 icode= >)]
> len([i for i in s.search_ss_bonds(threshold=100)])
> 45
>
>
>
> *Problems:*
>
> *3. Biological Unit*
>
> I added code to parse_pdb_header to extract the REMARK 350 section. They
> contain something like this (1IHM.pdb<http://www.pdb.org/pdb/files/1IHM.pdb>
> ):
>
> REMARK 350   BIOMT1   1  1.000000  0.000000  0.000000
> 0.00000
> REMARK 350   BIOMT2   1  0.000000  1.000000  0.000000
> 0.00000
> REMARK 350   BIOMT3   1  0.000000  0.000000  1.000000
> 0.00000
> REMARK 350   BIOMT1   2  0.500000 -0.809017 -0.309017
> 0.00000
> REMARK 350   BIOMT2   2  0.809017  0.309017  0.500000
> 0.00000
> REMARK 350   BIOMT3   2 -0.309017 -0.500000  0.809017
> 0.00000
> REMARK 350   BIOMT1   3 -0.309017 -0.500000 -0.809017        0.00000
>
> I parse out the 4th column to identify each transformation. I store a 3x3
> rotation matrix and the translation vector separately. It is then easy to
> apply them to each atom record via the transform function.
>
> Now, the problem lies in what the output should be. We broke it down to two
> main options:
>
> a. Create a new structure object for each rotated/translated object, thus
> making the final output a list of structures. This takes quite a while
> actually. I tried this with a deepcopy method to copy each structure and it
> took over 30 seconds on my machine for that PDB file above.
>
> b. Add the new rotated objects as new chains in the original structure.
> This is actually a good solution because it allows people to use other
> methods (the SS search comes to mind) on quartenary structures. It also
> allows the user to write a file with all the structures in their place using
> PDBIO quite seamlessly. However, it might be complicated to deal with an
> excess of chains, or if not all chains are supposed to be rotated (dunno if
> the case actually exists).
>
> My personal belief is that B is the way to go. Although it adulterates the
> original structure with alien chains, it allows much greater flexibility. I
> haven't tested it though.
>
> ----
>
> Comments? :)
>
> Jo?o [...] Rodrigues
>
>


From thomas.hamelryck at gmail.com  Tue Jun  8 02:39:53 2010
From: thomas.hamelryck at gmail.com (Thomas Hamelryck)
Date: Tue, 8 Jun 2010 08:39:53 +0200
Subject: [Biopython-dev] [GSOC] Report - Week 1
In-Reply-To: <AANLkTinnFWWM2ojjfBtBu2Qt4911cUZTJyI0J1hVYoji@mail.gmail.com>
References: <AANLkTinnFWWM2ojjfBtBu2Qt4911cUZTJyI0J1hVYoji@mail.gmail.com>
Message-ID: <AANLkTikObrjMkhEUNgEeDozAK3i06aRjfi85D_1LA9kC@mail.gmail.com>

Hi all,

I think it's great that Bio.PDB is being updated.

Here are some remarks:

I haven't seen much discussion about the one key feature of Bio.PDB
that definitely needs to be improved: its speed. With the enormous
increase of the number of structures, extracting data using Bio.PDB is
too slow. Would be good to move some parts to C.

A second issues is nicely illustrated by the following code snippet:

> s = p.get_structure('a', '2KSX.pdb')
> [i for i in s.search_ss_bonds()]

I think this is NOT the way to do it. PDB files can contain anything
RNA, DNA, sugars, small molecules... It is thus not a good idea to
directly associate protein-specific methods to the structure class; it
will lead to a bloated Structure class and a lot of irrelevant methods
(ie. search_ss_bonds is meaningless for a PDB file that contains RNA).

Currently, one creates Polypeptide objects from a Structure object
using a factory design pattern (via PPBuilder); the Polypeptide class
implements some protein specific methods. I believe that is a much
cleaner way to do it (though we need a Protein class that represents
collections of connected polypeptides). One can also make sure that
all such derived objects (Protein, NA, DNA,...) adhere to the same
interface by providing a suitable base class with shared functionality
- in that way, the whole thing is also extendible.

Something like:

s = p.get_structure('a', '2KSX.pdb')
pb = ProteinBuilder()
proteins = pb.build(structure)
ssbridges = proteins.get_ss_bonds()

Here, "proteins" would represent a collection of polypeptide chains.

Cheers,

-Thomas

-- 
Thomas Hamelryck, Assoc. Prof.
Group leader Structural Bioinformatics
Bioinformatics center
Department of Biology
University of Copenhagen
Ole Maaloes Vej 5
DK-2200 Copenhagen N
Denmark
http://www.binf.ku.dk/research/structural_bioinformatics/

From lgautier at gmail.com  Tue Jun  8 03:00:10 2010
From: lgautier at gmail.com (Laurent)
Date: Tue, 08 Jun 2010 09:00:10 +0200
Subject: [Biopython-dev] Biopython-dev Digest, Vol 89, Issue 8
In-Reply-To: <mailman.2432.1275979197.3120.biopython-dev@lists.open-bio.org>
References: <mailman.2432.1275979197.3120.biopython-dev@lists.open-bio.org>
Message-ID: <4C0DEA7A.1020606@gmail.com>

On 08/06/10 08:39, biopython-dev-request at lists.open-bio.org wrote:
> On Mon, Jun 7, 2010 at 2:23 PM, Peter<biopython at maubp.freeserve.co.uk>wrote:
>
>> >  Peter wrote:
>>> >  >...
>>> >  >
>>> >  >  http://github.com/peterjc/biopython/tree/index-sqlite
>>> >  >
>>> >  >  ... an SQLite index is used to hold
>>> >  >  the offsets. This means very low RAM requirements, but is a lot
>>> >  >  slower because the offsets are written to disk and the SQLite
>>> >  >  index is updated as we go. I expect this part can be optimised
>>> >  >  (e.g. try to build the index at the end, try committing in batches).
>> >
>> >  Having now tried using this on some files with tens of millions of
>> >  records, tuning how we use SQLite is going to be important.
>> >
>> >
> Wouldn't a Berkeley database be much much faster for constructing simple key
> to offset mappings?
>
> -Kevin
>

Yes. If one is only looking for a key/value associative structure, the 
NOSQL solutions will be faster (tokyocabinet seems to be one of the 
fastest, up to 100x when compared to BerkleyDB
http://www.ioremap.net/node/235
).

L.

From biopython at maubp.freeserve.co.uk  Tue Jun  8 05:35:15 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 8 Jun 2010 10:35:15 +0100
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
	<AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
Message-ID: <AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>

On Mon, Jun 7, 2010 at 10:10 PM, Kevin Jacobs wrote:
> On Mon, Jun 7, 2010 at 2:23 PM, Peter wrote:
>>
>> Having now tried using this on some files with tens of millions of
>> records, tuning how we use SQLite is going to be important.
>>
>>
> Wouldn't a Berkeley database be much much faster for constructing
> simple key to offset mappings?
>

Maybe - now that I've done the refactoring on Bio.SeqIO.index() to
allow two back ends (python dict or SQLite) trying a third (BDB) is
much easier. Did you know BDB was used in the old OBDA index
files? However, Python 2.6 deprecated bsddb (the Python Interface
to Berkeley DB library) and Python is pushing people to SQLite3
instead.

Peter

From krother at rubor.de  Tue Jun  8 05:59:43 2010
From: krother at rubor.de (Kristian Rother)
Date: Tue, 8 Jun 2010 11:59:43 +0200
Subject: [Biopython-dev] Tested  Fixup branch for Bio.PDB
Message-ID: <df95eaa0e6f3c40d451630cb54332b3c-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FaLkQAWF9QSVlUUAw=-webmailer2@server04.webmailer.hosteurope.de>


Hi Eric,

I've checked out your pdbfixes branch and ran our 431 Unit Tests of
ModeRNA with it. There were no changes to the master Bio.PDB branch -->
for us everything OK.

Details:
ModeRNA (http://www.genesilico.pl/moderna) engineers RNA 3D structures and
uses Bio.PDB for most of its operations: reading files,
adding/copying/manipulating residues/atoms, superimposing structures,
searching neighbors by KDTree, writing files.

Right, the tests most probably did not depend directly on the code you
changed, but as I understand you wanted to go sure the branch didnt break
anything by accident.

Best Regards,
    Kristian


From bioinformed at gmail.com  Tue Jun  8 07:00:44 2010
From: bioinformed at gmail.com (Kevin Jacobs <jacobs@bioinformed.com>)
Date: Tue, 8 Jun 2010 07:00:44 -0400
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
	<AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
	<AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>
Message-ID: <AANLkTincBvgf1pBsEq3a60ftvGPYgFUP9hC800BWC1lV@mail.gmail.com>

On Tue, Jun 8, 2010 at 5:35 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Mon, Jun 7, 2010 at 10:10 PM, Kevin Jacobs wrote:
> > On Mon, Jun 7, 2010 at 2:23 PM, Peter wrote:
> >>
> >> Having now tried using this on some files with tens of millions of
> >> records, tuning how we use SQLite is going to be important.
> >>
> > Wouldn't a Berkeley database be much much faster for constructing
> > simple key to offset mappings?
>
> Maybe - now that I've done the refactoring on Bio.SeqIO.index() to
> allow two back ends (python dict or SQLite) trying a third (BDB) is
> much easier. Did you know BDB was used in the old OBDA index
> files? However, Python 2.6 deprecated bsddb (the Python Interface
> to Berkeley DB library) and Python is pushing people to SQLite3
> instead.
>
>
Hi Peter,

I am aware that SQLite is taking over the job of serving as the default
embedded database for Python and am in vigorous agreement with that trend.
 I use SQLite for a wide range of tasks and am extremely happy with it for
most applications.  Unfortunately, for pure key-value mapping tasks, I've
found  SQLite to be 4-10x slower than a well-tuned BDB tree, even with
batched updates and using the most aggressive SQLite performance pragmas. My
results may not be typical, but I thought I'd raise the issue given the
magnitude of the performance difference.

Best regards,
-Kevin

From mjldehoon at yahoo.com  Tue Jun  8 08:19:28 2010
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 8 Jun 2010 05:19:28 -0700 (PDT)
Subject: [Biopython-dev] Blast parsers and records
In-Reply-To: <AANLkTikenpBKpMWK7pPSk6RYhCW5mvb5LXjxpFilopvD@mail.gmail.com>
Message-ID: <14055.47665.qm@web62401.mail.re1.yahoo.com>

--- On Mon, 6/7/10, Peter <biopython at maubp.freeserve.co.uk> wrote:
> I agree that records from all the different BLAST output
> formats should be represented by a common base class -
> but not necessarily the same class.
> For example, the default plain text and XML formats include
> the pairwise alignments, but the tabular output does not. To
> me having a sub-class which stores the pairwise alignments seems
> natural here.

Why do we need a sub-class? We don't do this in Bio.SeqIO, where GenBank files contain much more information than Fasta files, but both are represented by a SeqRecord.

Best,
--Michiel.


From biopython at maubp.freeserve.co.uk  Tue Jun  8 08:32:05 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 8 Jun 2010 13:32:05 +0100
Subject: [Biopython-dev] Blast parsers and records
In-Reply-To: <14055.47665.qm@web62401.mail.re1.yahoo.com>
References: <AANLkTikenpBKpMWK7pPSk6RYhCW5mvb5LXjxpFilopvD@mail.gmail.com>
	<14055.47665.qm@web62401.mail.re1.yahoo.com>
Message-ID: <AANLkTimVwjENsgiOgMw167HZ5IVkUAXg7aDtNB3xQd0K@mail.gmail.com>

On Tue, Jun 8, 2010 at 1:19 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> --- On Mon, 6/7/10, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> I agree that records from all the different BLAST output
>> formats should be represented by a common base class -
>> but not necessarily the same class.
>> For example, the default plain text and XML formats include
>> the pairwise alignments, but the tabular output does not. To
>> me having a sub-class which stores the pairwise alignments seems
>> natural here.
>
> Why do we need a sub-class? We don't do this in Bio.SeqIO,
> where GenBank files contain much more information than Fasta
> files, but both are represented by a SeqRecord.

OK, I guess you could have some properties which are left empty
(like the annotations dictionary or features list in a SeqRecord
from a FASTA file).

Peter

From mjldehoon at yahoo.com  Tue Jun  8 09:44:01 2010
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 8 Jun 2010 06:44:01 -0700 (PDT)
Subject: [Biopython-dev] Blast parsers and records
In-Reply-To: <AANLkTimVwjENsgiOgMw167HZ5IVkUAXg7aDtNB3xQd0K@mail.gmail.com>
Message-ID: <756890.46421.qm@web62404.mail.re1.yahoo.com>

--- On Tue, 6/8/10, Peter <biopython at maubp.freeserve.co.uk> wrote:
> > Why do we need a sub-class? We don't do this in
> > Bio.SeqIO, where GenBank files contain much more
> > information than Fasta files, but both are
> > represented by a SeqRecord.
> 
> OK, I guess you could have some properties which are left
> empty
> (like the annotations dictionary or features list in a
> SeqRecord from a FASTA file).

I would prefer that, as it keeps things simple and consistent with other parts of Biopython. But let's see how it goes. Over the weekend I'll set up a rudimentary Blast parser and record so we can see what it would look like in practice.

--Michiel


From bpederse at gmail.com  Tue Jun  8 11:47:18 2010
From: bpederse at gmail.com (Brent Pedersen)
Date: Tue, 8 Jun 2010 08:47:18 -0700
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTincBvgf1pBsEq3a60ftvGPYgFUP9hC800BWC1lV@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
	<AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
	<AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>
	<AANLkTincBvgf1pBsEq3a60ftvGPYgFUP9hC800BWC1lV@mail.gmail.com>
Message-ID: <AANLkTimaz_cdp8Ca2VEzJhXgOfZv9ftw3wgU8itF3nKt@mail.gmail.com>

On Tue, Jun 8, 2010 at 4:00 AM, Kevin Jacobs <jacobs at bioinformed.com>
<bioinformed at gmail.com> wrote:
> On Tue, Jun 8, 2010 at 5:35 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:
>
>> On Mon, Jun 7, 2010 at 10:10 PM, Kevin Jacobs wrote:
>> > On Mon, Jun 7, 2010 at 2:23 PM, Peter wrote:
>> >>
>> >> Having now tried using this on some files with tens of millions of
>> >> records, tuning how we use SQLite is going to be important.
>> >>
>> > Wouldn't a Berkeley database be much much faster for constructing
>> > simple key to offset mappings?
>>
>> Maybe - now that I've done the refactoring on Bio.SeqIO.index() to
>> allow two back ends (python dict or SQLite) trying a third (BDB) is
>> much easier. Did you know BDB was used in the old OBDA index
>> files? However, Python 2.6 deprecated bsddb (the Python Interface
>> to Berkeley DB library) and Python is pushing people to SQLite3
>> instead.
>>
>>
> Hi Peter,
>
> I am aware that SQLite is taking over the job of serving as the default
> embedded database for Python and am in vigorous agreement with that trend.
> ?I use SQLite for a wide range of tasks and am extremely happy with it for
> most applications. ?Unfortunately, for pure key-value mapping tasks, I've
> found ?SQLite to be 4-10x slower than a well-tuned BDB tree, even with
> batched updates and using the most aggressive SQLite performance pragmas. My
> results may not be typical, but I thought I'd raise the issue given the
> magnitude of the performance difference.
>
> Best regards,
> -Kevin
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>

my results may not be typical either, but using an earlier version of
peter's sqlite biopython branch and comparing to screed
(http://github.com/acr/screed), and my file-index
(http://github.com/brentp/bio-playground/tree/master/fileindex/ ) i
found that biopython's implementation is at most, a bit more than 2x
slower. and it does the fastq parsing much more rigorously.

also, i didn't see much difference between berkeleydb and
tokyocabinet--though the ctypes-based TC wrapper i was using has since
been streamlined.
here's what i saw for 15+ million records with this script:
http://github.com/brentp/bio-playground/blob/master/fileindex/examples/bench.py

/opt/src/methylcode/data/s_1_sequence.txt
benchmarking fastq file with 15646356 records (62585424 lines)
performing 500000 random queries

screed
------
create: 704.764
search: 51.717

biopython-sqlite
----------------
create: 727.868
search: 92.947

fileindex
---------
create: 294.356
search: 53.701


From biopython at maubp.freeserve.co.uk  Tue Jun  8 12:35:07 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 8 Jun 2010 17:35:07 +0100
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTimaz_cdp8Ca2VEzJhXgOfZv9ftw3wgU8itF3nKt@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
	<AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
	<AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>
	<AANLkTincBvgf1pBsEq3a60ftvGPYgFUP9hC800BWC1lV@mail.gmail.com>
	<AANLkTimaz_cdp8Ca2VEzJhXgOfZv9ftw3wgU8itF3nKt@mail.gmail.com>
Message-ID: <AANLkTikkuO37pyLQFeHG4Doq8Ebo463HUd-h6uxcfSGe@mail.gmail.com>

On Tue, Jun 8, 2010 at 4:47 PM, Brent Pedersen <bpederse at gmail.com> wrote:
>
> my results may not be typical either, but using an earlier version of
> peter's sqlite biopython branch and comparing to screed
> (http://github.com/acr/screed), and my file-index
> (http://github.com/brentp/bio-playground/tree/master/fileindex/ ) i
> found that biopython's implementation is at most, a bit more than 2x
> slower. and it does the fastq parsing much more rigorously.
>
> also, i didn't see much difference between berkeleydb and
> tokyocabinet--though the ctypes-based TC wrapper i was using has since
> been streamlined.
> here's what i saw for 15+ million records with this script:
> http://github.com/brentp/bio-playground/blob/master/fileindex/examples/bench.py
>
> /opt/src/methylcode/data/s_1_sequence.txt
> benchmarking fastq file with 15646356 records (62585424 lines)
> performing 500000 random queries
>
> screed
> ------
> create: 704.764
> search: 51.717
>
> biopython-sqlite
> ----------------
> create: 727.868
> search: 92.947
>
> fileindex
> ---------
> create: 294.356
> search: 53.701

Are you using a recent version of screed (with SQLite internally)?

Which back end are your "fileindex" numbers for? BDB?

I'd say that the slow "search" from (the old branch of) Biopython is
down to our FASTQ parsing time, which includes lots of object
creation. The get_raw method can be useful here depending on
what you want to achieve:
http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/

The version you tried didn't do anything clever with the SQLite
indexes, batched inserts etc. I'm hoping the current code will be
faster (although there is likely a penalty from having two switchable
back ends). Brent, could you re-run this benchmark with this code:
http://github.com/peterjc/biopython/tree/index-sqlite-batched

You'll need to change the Biopython call in your test script from
this (it was renamed before landing on the trunk):

fi = SeqIO.indexed_dict(f, idx, "fastq")

to this:

fi = SeqIO.index(f, idx, "fastq", db=True)

or give an explicit filename:

fi = SeqIO.index(f, idx, "fastq", db="/tmp/filename.idx")

where db is the new parameter for controlling where and if
the lookup table is stored on disk.

Peter

From anaryin at gmail.com  Tue Jun  8 13:10:48 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 8 Jun 2010 12:10:48 -0500
Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB
	enhancements
In-Reply-To: <AANLkTim3hndnmdtvlXcBqo723U48FOmpd5AiESwqxM8P@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com> 
	<AANLkTin1nilGCBk3zuU4ZMf7vl0XdSlX8Ujai6_g4B-z@mail.gmail.com> 
	<AANLkTikj60J4uhtzzgTYos67aeH3eJ3xJF8-QsKq7qWS@mail.gmail.com> 
	<AANLkTim3hndnmdtvlXcBqo723U48FOmpd5AiESwqxM8P@mail.gmail.com>
Message-ID: <AANLkTilUjxzLPqKvfqki6wpznWKUSlQGJjJ_6x-cqhmS@mail.gmail.com>

Hello all,

I'm replying here to what Thomas wrote on the GSOC Report thread because it
seems a better place.

PDB files can contain anything RNA, DNA, sugars, small molecules... It is
> thus not a good idea to
> directly associate protein-specific methods to the structure class; it will
> lead to a bloated Structure class and a lot of irrelevant methods (ie.
> search_ss_bonds is meaningless for a PDB file that contains RNA).


Agree.

Currently, one creates Polypeptide objects from a Structure object using a
> factory design pattern (via PPBuilder); the Polypeptide class implements
> some protein specific methods. I believe that is a much cleaner way to do it
> (though we need a Protein class that represents collections of connected
> polypeptides). One can also make sure that all such derived objects
> (Protein, NA, DNA,...) adhere to the same interface by providing a suitable
> base class with shared functionality - in that way, the whole thing is also
> extendible.
>

I think there has been already some discussion about this. My personal
opinion/suggestion is having a structure like:

Bio.PDB/
_______/Protein.py
_______/DNA.py
_______/RNA.py

that would translate to an usage of something like:

from Bio.PDB import Protein
structure = Protein('1ABC.pdb')
structure.search_ss_bonds()

but not

structure.calc_melting_temperature() (just an example)

Protein() would call PDBParser(). It could also include, to a certain
extent, an Alphabet-like feature to assure residue names are OK (this goes a
bit with this proposal<http://www.biopython.org/wiki/GSOC2010_Joao#Residue_name_normalisation>).
I believe this goes a bit into what you said. Having a class that basically
abstracts what we do now (Bio.PDB.PDBParser) and allows for
molecule-specific methods. However, it also leads to some problems:
Protein/DNA complexes come to mind.

How does this sound? I think it goes with what Eric said in the first post
of this thread and what Thomas replied in the GSOC thread. We should also
change the PDB name to Struct to better reflect the purpose of the module.
All of the other additions like Bio.Struct.WWW would still apply. And I
don't see a major problem in breaking the existing code by adding this.

Jo?o


From tiagoantao at gmail.com  Tue Jun  8 15:12:00 2010
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 8 Jun 2010 20:12:00 +0100
Subject: [Biopython-dev] Working directly on the main git repository
In-Reply-To: <AANLkTikhnHfMIQfWU3ORrlfG1z2Zxsdx-sEycFSB3oa-@mail.gmail.com>
References: <AANLkTikhnHfMIQfWU3ORrlfG1z2Zxsdx-sEycFSB3oa-@mail.gmail.com>
Message-ID: <AANLkTimsuefKtzbYOWGHpCfs0xfqMWfHdPVRn6UfHl8L@mail.gmail.com>

On Mon, Jun 7, 2010 at 10:35 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Comments?

Maybe put this on the wiki as doc for good practice?

From biopython at maubp.freeserve.co.uk  Tue Jun  8 15:41:03 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 8 Jun 2010 20:41:03 +0100
Subject: [Biopython-dev] Working directly on the main git repository
In-Reply-To: <AANLkTimsuefKtzbYOWGHpCfs0xfqMWfHdPVRn6UfHl8L@mail.gmail.com>
References: <AANLkTikhnHfMIQfWU3ORrlfG1z2Zxsdx-sEycFSB3oa-@mail.gmail.com>
	<AANLkTimsuefKtzbYOWGHpCfs0xfqMWfHdPVRn6UfHl8L@mail.gmail.com>
Message-ID: <AANLkTin452_8Ka5Y_dLlggF6B3SnvxE6jWru5Hhnr-sQ@mail.gmail.com>

2010/6/8 Tiago Ant?o <tiagoantao at gmail.com>:
> On Mon, Jun 7, 2010 at 10:35 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> Comments?
>
> Maybe put this on the wiki as doc for good practice?

So this does seems like a sensible approach (for those
of use with commit access to the main repository)?

We can add it to the git usage page then...
http://www.biopython.org/wiki/GitUsage

Peter


From eric.talevich at gmail.com  Tue Jun  8 17:45:42 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 8 Jun 2010 17:45:42 -0400
Subject: [Biopython-dev] Working directly on the main git repository
In-Reply-To: <AANLkTikhnHfMIQfWU3ORrlfG1z2Zxsdx-sEycFSB3oa-@mail.gmail.com>
References: <AANLkTikhnHfMIQfWU3ORrlfG1z2Zxsdx-sEycFSB3oa-@mail.gmail.com>
Message-ID: <AANLkTiluRrKJ9AhHHIVwUSm0zXqpyQqn-TIVQUZHkBBF@mail.gmail.com>

On Mon, Jun 7, 2010 at 5:35 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> Hi all,
>
> I thought I'd write down some notes about how I've been using git recently.
> This may be of interest to any of the other core developers (those of us
> with read-write access to the main repository), and I might get some good
> tips from any discussion. The key point is that I have read+write access
> to two repositories on github (the official repository AND my own fork),
> so there are different advantages/disadvantages about which I choose
> to work with directly as my main repository.
>
> [...]
>
> Instead, I have a github repository of my own (what github calls a
> fork), and I push branches there.
>
> http://github.com/biopython/biopython - the official branch(es)
> http://github.com/peterjc/biopython - my branches
>
> How does this work in practice? Like this - I clone the master
> and add a reference to my repository (and I do the same when I
> want to grab a branch from another developer):
>
> git clone git at github.com:biopython/biopython.git
> cd biopython
> git remote add peterjc git at github.com:peterjc/biopython.git
> git fetch peterjc
>
> Then make a new local branch as usual, and when ready to share
> it publicly, I push it to *my* repository on github:
>
> git branch new-work
> git checkout new-work
> git commit ...
> git push peterjc new-work
>
> This would then appear as a new-work branch on my github page.
> Then if I (or someone else) wants to access these branches later
> (e.g. from another machine) just use the checkout tracked remote
> branch. For example,
>
> git clone git at github.com:biopython/biopython.git
> cd biopython
> git remote add peterjc git at github.com:peterjc/biopython.git
> git fetch peterjc
> git checkout -t peterjc/seqio-imgt
>
> This then looks like a normal branch (called just "seqio-imgt" in
> this example), but git knows it is linked to the remote branch on
> the "peterjc" repository (not the origin which is the "official"
> repository).
>

This looks reasonable to me. I'd add that the procedure to delete a public
branch from your personal fork on GitHub is a little obscure:

git branch -a   # list local and remote branches
git branch -d new-work   # delete a local branch that's been merged already
git push peterjc :new-work  # delete the public branch from GitHub

This doesn't do what you'd expect:
git branch -d peterjc/new-work

That only removes your local reference to the the public branch; the branch
is still visible on GitHub.

(It's kind of hard to find in the GitHub documentation.)


I'd have to check, but I guess that if the original git clone is done
> with git://github.com/biopython/biopython.git instead (read only
> access) the same procedure could be used by non core devs.
> However, I'm not sure this is clearer for them. I think the current
> procedure (on our wiki) where you add a remote reference to
> the "upstream" official repository works better in this case.
>

I still have an "upstream" reference to the main repo. I wouldn't want to
accidentally push something foolish to the main repo with a stray "git
push"... better to have the safe thing happen by default.

If the initial clone was from biopython master, and you later create a
personal forkon GitHub, then it's not too hard to switch the references
around in your local repo to make the public fork your "origin".

-Eric

From bugzilla-daemon at portal.open-bio.org  Tue Jun  8 18:52:28 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 8 Jun 2010 18:52:28 -0400
Subject: [Biopython-dev] [Bug 3096] New: PPBuilder build_peptides bugs
Message-ID: <bug-3096-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3096

           Summary: PPBuilder build_peptides bugs
           Product: Biopython
           Version: Not Applicable
          Platform: Other
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: skong at zymeworks.com


Given a chain of backbone connected residues 'IXRGXTGL' that contains two
non-standard amino acids 'X' in between, building peptide with only standard
amino acid builder should return two peptides 'RG' and 'TGL'. 'I' should not be
returned as a peptide since it is just one residue. Currently biopython would
return 'IXGXGL', with two bugs in between:

1. Skipping a standard amino acid R and T after each X, while keeping X (Should
skip X instead not R or T). Related to
http://bugzilla.open-bio.org/show_bug.cgi?id=2910 and
http://lists.open-bio.org/pipermail/biopython/2009-September/005532.html
2. Return one peptide even though after filtering the two X residues which
connect 'I', 'RG', 'TGL' are no longer present and fragment 'IRGTGL' cannot be
considered as a valid peptide without the two Xs connecting them.

The above sequence 'IXRGXTGL' are taken from 1bfe and mutated. The 'mutation'
referred here is simply renaming the residue name to something that is not
standard and represented as 'X'. 

Each solution proposed below is meant to fix respective bug above: 
1. Insert (not accept(prev) or not accept(next)) after if aa_only check at line
299 of Bio/PDB/Polypeptide.py
2. Insert pp=None when either of the residues compared are filtered at line 300
or Bio/PDB/Polypeptide.py


Amino acids filtering bug in method build_peptides() of class _PPBuilder ofin
Bio/PDB/Polypeptide.py:

Original:
        for chain in chain_list:
            chain_it=iter(chain)
            prev=chain_it.next()
            pp=None
            for next in chain_it:
                if aa_only and not accept(prev):
                    prev=next
                    continue
                if is_connected(prev, next):
                    if pp is None:
                        pp=Polypeptide()
                        pp.append(prev)
                        pp_list.append(pp)
                    pp.append(next)
                else:
                    pp=None
                prev=next
        return pp_list


Fixed:

        for chain in chain_list:
            chain_it=iter(chain)
            prev=chain_it.next()
            pp=None
            for next in chain_it:
                if aa_only and (not accept(prev) or not accept(next)):
                    prev=next; pp=None
                    continue
                if is_connected(prev, next):
                    if pp is None:
                        pp=Polypeptide()
                        pp.append(prev)
                        pp_list.append(pp)
                    pp.append(next)
                else:
                    pp=None
                prev=next
        return pp_list

Attached here is the code used to test the above case, with and without
mutations, and with and without standard amino acid filtering. The case without
mutation is just to show that the backbone atoms of the mutated version are
connected:

from Bio.PDB.PDBParser import PDBParser
from Bio.PDB.Polypeptide import PPBuilder, is_aa 

class StandardAABuilder(PPBuilder): 
    """ Polypeptide builder which accepts only standard amino acids.""" 
    def _accept(self, residue): 
        return is_aa(residue, standard=True) 

def extract_peptides(model):
    """Extracts the peptides from a model.
    Returns a list of Peptide object."""
    output = []
    for peptide in PPBuilder().build_peptides(model): 
        seq = str(peptide.get_sequence())
        output.append(seq)
    return output

def extract_peptides_saa(model):
    """Extracts the peptides from a model.
    Returns a list of Peptide object."""
    output = []
    for peptide in StandardAABuilder().build_peptides(model): 
        seq = str(peptide.get_sequence())
        output.append(seq)
    return output

if __name__ == '__main__':

    oripdb = open('chopped_pdb1bfe.ent')
    sto = PDBParser().get_structure('', oripdb)
    seqao = extract_peptides(sto)
    seqbo = extract_peptides_saa(sto)
    print 'ori seq all '
    print seqao  
    print 'ori seq standard only'
    print seqbo

    pdb = open('chopped_mutated_pdb1bfe.ent')
    st = PDBParser().get_structure('', pdb)
    seqa = extract_peptides(st)
    seqb = extract_peptides_saa(st)
    print 'mut seq all'
    print seqa
    print 'mut seq standard only '
    print seqb


Attached below are the two fragments of PDB files, pre and post mutated.

chopped_pdb1bfe.ent
ATOM     85  N   ILE A 316      37.386  71.217  31.070  1.00 36.97           N  
ATOM     86  CA  ILE A 316      38.311  71.290  29.949  1.00 33.71           C  
ATOM     87  C   ILE A 316      37.634  72.103  28.862  1.00 33.93           C  
ATOM     88  O   ILE A 316      36.415  72.216  28.839  1.00 36.46           O  
ATOM     89  CB  ILE A 316      38.651  69.876  29.404  1.00 35.79           C  
ATOM     90  CG1 ILE A 316      39.331  69.049  30.501  1.00 36.78           C  
ATOM     91  CG2 ILE A 316      39.572  69.979  28.187  1.00 37.71           C  
ATOM     92  CD1 ILE A 316      39.881  67.724  30.023  1.00 39.20           C  
ATOM     93  N   HIS A 317      38.425  72.679  27.969  1.00 35.61           N  
ATOM     94  CA  HIS A 317      37.880  73.473  26.881  1.00 37.92           C  
ATOM     95  C   HIS A 317      38.360  72.928  25.540  1.00 37.79           C  
ATOM     96  O   HIS A 317      39.463  73.240  25.094  1.00 37.44           O  
ATOM     97  CB  HIS A 317      38.303  74.930  27.052  1.00 35.19           C  
ATOM     98  CG  HIS A 317      37.888  75.519  28.363  1.00 35.76           C  
ATOM     99  ND1 HIS A 317      36.611  75.981  28.602  1.00 37.74           N  
ATOM    100  CD2 HIS A 317      38.575  75.701  29.516  1.00 37.59           C  
ATOM    101  CE1 HIS A 317      36.529  76.420  29.844  1.00 38.74           C  
ATOM    102  NE2 HIS A 317      37.706  76.262  30.421  1.00 36.76           N  
ATOM    103  N   ARG A 318      37.527  72.109  24.905  1.00 38.78           N  
ATOM    104  CA  ARG A 318      37.884  71.512  23.627  1.00 42.04           C  
ATOM    105  C   ARG A 318      38.469  72.559  22.699  1.00 45.14           C  
ATOM    106  O   ARG A 318      39.592  72.425  22.205  1.00 42.05           O  
ATOM    107  CB  ARG A 318      36.657  70.880  22.967  1.00 42.93           C  
ATOM    108  CG  ARG A 318      36.934  70.321  21.576  1.00 38.60           C  
ATOM    109  CD  ARG A 318      35.654  70.038  20.821  1.00 35.39           C  
ATOM    110  NE  ARG A 318      34.624  69.538  21.724  1.00 34.96           N  
ATOM    111  CZ  ARG A 318      34.539  68.278  22.141  1.00 31.51           C  
ATOM    112  NH1 ARG A 318      35.419  67.373  21.736  1.00 25.19           N  
ATOM    113  NH2 ARG A 318      33.579  67.929  22.983  1.00 29.10           N  
ATOM    114  N   GLY A 319      37.690  73.604  22.461  1.00 49.96           N  
ATOM    115  CA  GLY A 319      38.138  74.668  21.592  1.00 55.53           C  
ATOM    116  C   GLY A 319      38.459  74.219  20.180  1.00 58.85           C  
ATOM    117  O   GLY A 319      37.583  73.766  19.440  1.00 58.98           O  
ATOM    118  N   SER A 320      39.734  74.334  19.823  1.00 61.64           N  
ATOM    119  CA  SER A 320      40.219  73.992  18.493  1.00 63.16           C  
ATOM    120  C   SER A 320      40.212  72.517  18.110  1.00 65.27           C  
ATOM    121  O   SER A 320      39.558  72.127  17.145  1.00 65.12           O  
ATOM    122  CB  SER A 320      41.634  74.542  18.316  1.00 65.36           C  
ATOM    123  OG  SER A 320      42.124  74.255  17.019  1.00 72.05           O  
ATOM    124  N   THR A 321      40.955  71.702  18.853  1.00 67.43           N  
ATOM    125  CA  THR A 321      41.049  70.274  18.562  1.00 67.73           C  
ATOM    126  C   THR A 321      40.220  69.430  19.529  1.00 66.41           C  
ATOM    127  O   THR A 321      39.244  69.917  20.095  1.00 70.21           O  
ATOM    128  CB  THR A 321      42.517  69.810  18.620  1.00 70.22           C  
ATOM    129  OG1 THR A 321      42.613  68.453  18.169  1.00 77.03           O  
ATOM    130  CG2 THR A 321      43.049  69.915  20.045  1.00 72.07           C  
ATOM    131  N   GLY A 322      40.608  68.168  19.707  1.00 61.22           N  
ATOM    132  CA  GLY A 322      39.892  67.286  20.614  1.00 53.23           C  
ATOM    133  C   GLY A 322      40.037  67.705  22.065  1.00 48.00           C  
ATOM    134  O   GLY A 322      40.138  68.892  22.372  1.00 50.41           O  
ATOM    135  N   LEU A 323      40.044  66.734  22.968  1.00 41.92           N  
ATOM    136  CA  LEU A 323      40.190  67.033  24.385  1.00 35.58           C  
ATOM    137  C   LEU A 323      41.613  66.738  24.874  1.00 31.41           C  
ATOM    138  O   LEU A 323      41.932  66.921  26.046  1.00 30.47           O  
ATOM    139  CB  LEU A 323      39.160  66.240  25.191  1.00 35.76           C  
ATOM    140  CG  LEU A 323      37.716  66.576  24.802  1.00 39.50           C  
ATOM    141  CD1 LEU A 323      36.733  65.796  25.670  1.00 38.15           C  
ATOM    142  CD2 LEU A 323      37.493  68.074  24.955  1.00 38.58           C

PDB FILE: mutated_chopped_pdb1bfe.ent
ATOM     85  N   ILE A 316      37.386  71.217  31.070  1.00 36.97           N  
ATOM     86  CA  ILE A 316      38.311  71.290  29.949  1.00 33.71           C  
ATOM     87  C   ILE A 316      37.634  72.103  28.862  1.00 33.93           C  
ATOM     88  O   ILE A 316      36.415  72.216  28.839  1.00 36.46           O  
ATOM     89  CB  ILE A 316      38.651  69.876  29.404  1.00 35.79           C  
ATOM     90  CG1 ILE A 316      39.331  69.049  30.501  1.00 36.78           C  
ATOM     91  CG2 ILE A 316      39.572  69.979  28.187  1.00 37.71           C  
ATOM     92  CD1 ILE A 316      39.881  67.724  30.023  1.00 39.20           C  
ATOM     93  N   HIE A 317      38.425  72.679  27.969  1.00 35.61           N  
ATOM     94  CA  HIE A 317      37.880  73.473  26.881  1.00 37.92           C  
ATOM     95  C   HIE A 317      38.360  72.928  25.540  1.00 37.79           C  
ATOM     96  O   HIE A 317      39.463  73.240  25.094  1.00 37.44           O  
ATOM     97  CB  HIE A 317      38.303  74.930  27.052  1.00 35.19           C  
ATOM     98  CG  HIE A 317      37.888  75.519  28.363  1.00 35.76           C  
ATOM     99  ND1 HIE A 317      36.611  75.981  28.602  1.00 37.74           N  
ATOM    100  CD2 HIE A 317      38.575  75.701  29.516  1.00 37.59           C  
ATOM    101  CE1 HIE A 317      36.529  76.420  29.844  1.00 38.74           C  
ATOM    102  NE2 HIE A 317      37.706  76.262  30.421  1.00 36.76           N
ATOM    103  N   ARG A 318      37.527  72.109  24.905  1.00 38.78           N  
ATOM    104  CA  ARG A 318      37.884  71.512  23.627  1.00 42.04           C  
ATOM    105  C   ARG A 318      38.469  72.559  22.699  1.00 45.14           C  
ATOM    106  O   ARG A 318      39.592  72.425  22.205  1.00 42.05           O  
ATOM    107  CB  ARG A 318      36.657  70.880  22.967  1.00 42.93           C  
ATOM    108  CG  ARG A 318      36.934  70.321  21.576  1.00 38.60           C  
ATOM    109  CD  ARG A 318      35.654  70.038  20.821  1.00 35.39           C  
ATOM    110  NE  ARG A 318      34.624  69.538  21.724  1.00 34.96           N  
ATOM    111  CZ  ARG A 318      34.539  68.278  22.141  1.00 31.51           C  
ATOM    112  NH1 ARG A 318      35.419  67.373  21.736  1.00 25.19           N  
ATOM    113  NH2 ARG A 318      33.579  67.929  22.983  1.00 29.10           N  
ATOM    114  N   GLY A 319      37.690  73.604  22.461  1.00 49.96           N  
ATOM    115  CA  GLY A 319      38.138  74.668  21.592  1.00 55.53           C  
ATOM    116  C   GLY A 319      38.459  74.219  20.180  1.00 58.85           C  
ATOM    117  O   GLY A 319      37.583  73.766  19.440  1.00 58.98           O  
ATOM    118  N   XQQ A 320      39.734  74.334  19.823  1.00 61.64           N  
ATOM    119  CA  XQQ A 320      40.219  73.992  18.493  1.00 63.16           C  
ATOM    120  C   XQQ A 320      40.212  72.517  18.110  1.00 65.27           C  
ATOM    121  O   XQQ A 320      39.558  72.127  17.145  1.00 65.12           O  
ATOM    122  CB  XQQ A 320      41.634  74.542  18.316  1.00 65.36           C  
ATOM    123  OG  XQQ A 320      42.124  74.255  17.019  1.00 72.05           O
ATOM    124  N   THR A 321      40.955  71.702  18.853  1.00 67.43           N  
ATOM    125  CA  THR A 321      41.049  70.274  18.562  1.00 67.73           C  
ATOM    126  C   THR A 321      40.220  69.430  19.529  1.00 66.41           C  
ATOM    127  O   THR A 321      39.244  69.917  20.095  1.00 70.21           O  
ATOM    128  CB  THR A 321      42.517  69.810  18.620  1.00 70.22           C  
ATOM    129  OG1 THR A 321      42.613  68.453  18.169  1.00 77.03           O  
ATOM    130  CG2 THR A 321      43.049  69.915  20.045  1.00 72.07           C  
ATOM    131  N   GLY A 322      40.608  68.168  19.707  1.00 61.22           N  
ATOM    132  CA  GLY A 322      39.892  67.286  20.614  1.00 53.23           C  
ATOM    133  C   GLY A 322      40.037  67.705  22.065  1.00 48.00           C  
ATOM    134  O   GLY A 322      40.138  68.892  22.372  1.00 50.41           O  
ATOM    135  N   LEU A 323      40.044  66.734  22.968  1.00 41.92           N  
ATOM    136  CA  LEU A 323      40.190  67.033  24.385  1.00 35.58           C  
ATOM    137  C   LEU A 323      41.613  66.738  24.874  1.00 31.41           C  
ATOM    138  O   LEU A 323      41.932  66.921  26.046  1.00 30.47           O  
ATOM    139  CB  LEU A 323      39.160  66.240  25.191  1.00 35.76           C  
ATOM    140  CG  LEU A 323      37.716  66.576  24.802  1.00 39.50           C  
ATOM    141  CD1 LEU A 323      36.733  65.796  25.670  1.00 38.15           C  
ATOM    142  CD2 LEU A 323      37.493  68.074  24.955  1.00 38.58           C


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bpederse at gmail.com  Wed Jun  9 00:33:12 2010
From: bpederse at gmail.com (Brent Pedersen)
Date: Tue, 8 Jun 2010 21:33:12 -0700
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTikkuO37pyLQFeHG4Doq8Ebo463HUd-h6uxcfSGe@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
	<AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
	<AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>
	<AANLkTincBvgf1pBsEq3a60ftvGPYgFUP9hC800BWC1lV@mail.gmail.com>
	<AANLkTimaz_cdp8Ca2VEzJhXgOfZv9ftw3wgU8itF3nKt@mail.gmail.com>
	<AANLkTikkuO37pyLQFeHG4Doq8Ebo463HUd-h6uxcfSGe@mail.gmail.com>
Message-ID: <AANLkTil-azYjpsrP7W2PG6H6dLJhv2WV-Ag0tY1kRH_C@mail.gmail.com>

On Tue, Jun 8, 2010 at 9:35 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Tue, Jun 8, 2010 at 4:47 PM, Brent Pedersen <bpederse at gmail.com> wrote:
>>
>> my results may not be typical either, but using an earlier version of
>> peter's sqlite biopython branch and comparing to screed
>> (http://github.com/acr/screed), and my file-index
>> (http://github.com/brentp/bio-playground/tree/master/fileindex/ ) i
>> found that biopython's implementation is at most, a bit more than 2x
>> slower. and it does the fastq parsing much more rigorously.
>>
>> also, i didn't see much difference between berkeleydb and
>> tokyocabinet--though the ctypes-based TC wrapper i was using has since
>> been streamlined.
>> here's what i saw for 15+ million records with this script:
>> http://github.com/brentp/bio-playground/blob/master/fileindex/examples/bench.py
>>
>> /opt/src/methylcode/data/s_1_sequence.txt
>> benchmarking fastq file with 15646356 records (62585424 lines)
>> performing 500000 random queries
>>
>> screed
>> ------
>> create: 704.764
>> search: 51.717
>>
>> biopython-sqlite
>> ----------------
>> create: 727.868
>> search: 92.947
>>
>> fileindex
>> ---------
>> create: 294.356
>> search: 53.701
>
> Are you using a recent version of screed (with SQLite internally)?
>
> Which back end are your "fileindex" numbers for? BDB?
>
> I'd say that the slow "search" from (the old branch of) Biopython is
> down to our FASTQ parsing time, which includes lots of object
> creation. The get_raw method can be useful here depending on
> what you want to achieve:
> http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/
>
> The version you tried didn't do anything clever with the SQLite
> indexes, batched inserts etc. I'm hoping the current code will be
> faster (although there is likely a penalty from having two switchable
> back ends). Brent, could you re-run this benchmark with this code:
> http://github.com/peterjc/biopython/tree/index-sqlite-batched
>
> You'll need to change the Biopython call in your test script from
> this (it was renamed before landing on the trunk):
>
> fi = SeqIO.indexed_dict(f, idx, "fastq")
>
> to this:
>
> fi = SeqIO.index(f, idx, "fastq", db=True)
>
> or give an explicit filename:
>
> fi = SeqIO.index(f, idx, "fastq", db="/tmp/filename.idx")
>
> where db is the new parameter for controlling where and if
> the lookup table is stored on disk.
>
> Peter
>

done. the previous times and the current were using py-tcdb not bsddb.
the author of tcdb made some improvements so it's faster this time,
and your SeqIO implementation is almost 2x as fast to load as the
previous one. that's a nice implementation. i didn't try get_raw.

these timints are are with your latest version, and the version of
screed pulled from http://github.com/acr/screed master today.

/opt/src/methylcode/data/s_1_sequence.txt
benchmarking fastq file with 15646356 records (62585424 lines)
performing 500000 random queries

screed
------
create: 699.210
search: 51.043

biopython-sqlite
----------------
create: 386.647
search: 93.391

fileindex
---------
create: 184.088
search: 48.887

From bugzilla-daemon at portal.open-bio.org  Wed Jun  9 04:43:02 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 9 Jun 2010 04:43:02 -0400
Subject: [Biopython-dev] [Bug 3096] PPBuilder build_peptides bugs
In-Reply-To: <bug-3096-42@http.bugzilla.open-bio.org/>
Message-ID: <201006090843.o598h2tx024780@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3096


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-09 04:43 EST -------
(In reply to comment #0)
> Given a chain of backbone connected residues 'IXRGXTGL' that contains two
> non-standard amino acids 'X' in between, building peptide with only standard
> amino acid builder should return two peptides 'RG' and 'TGL'. 'I' should not
> be returned as a peptide since it is just one residue. Currently biopython
> would return 'IXGXGL', with two bugs in between:

What is wrong with returning 'IXGXGL'? The PDB contains a peptide of six
linked residues doesn't it? It looks like Bio.PDB is doing something sensible.

P.S. You didn't fill in which version of Biopython you are using.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Wed Jun  9 04:55:37 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 9 Jun 2010 09:55:37 +0100
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTil-azYjpsrP7W2PG6H6dLJhv2WV-Ag0tY1kRH_C@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
	<AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
	<AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>
	<AANLkTincBvgf1pBsEq3a60ftvGPYgFUP9hC800BWC1lV@mail.gmail.com>
	<AANLkTimaz_cdp8Ca2VEzJhXgOfZv9ftw3wgU8itF3nKt@mail.gmail.com>
	<AANLkTikkuO37pyLQFeHG4Doq8Ebo463HUd-h6uxcfSGe@mail.gmail.com>
	<AANLkTil-azYjpsrP7W2PG6H6dLJhv2WV-Ag0tY1kRH_C@mail.gmail.com>
Message-ID: <AANLkTilMHNtJx5QivAWSjFiXnW1BFWi86H-ow8kBq6pi@mail.gmail.com>

On Wed, Jun 9, 2010 at 5:33 AM, Brent Pedersen <bpederse at gmail.com> wrote:
>>
>> The version you tried didn't do anything clever with the SQLite
>> indexes, batched inserts etc. I'm hoping the current code will be
>> faster (although there is likely a penalty from having two switchable
>> back ends). Brent, could you re-run this benchmark with this code:
>> http://github.com/peterjc/biopython/tree/index-sqlite-batched
>> ...
>
> done.

Thank you Brent :)

> the previous times and the current were using py-tcdb not bsddb.
> the author of tcdb made some improvements so it's faster this time,

OK, so you are using Tokyo Cabinet to store the lookup table here
rather than BDB. Link, http://code.google.com/p/py-tcdb/

> and your SeqIO implementation is almost 2x as fast to load as the
> previous one. that's a nice implementation. i didn't try get_raw.

I've got some more re-factoring in mind which should help a little
more (but mainly to make the structure clearer).

> these timints are are with your latest version, and the version of
> screed pulled from http://github.com/acr/screed master today.

Having had a quick look, they are using SQLite3 in much the
say way as I was initially. They create the index before loading
(rather than after loading) and they use a single insert per
offset (rather than using a batch in a transaction or the
executemany method). I'm pretty sure from my experiments
those changes would speed up screed's loading time a lot
(probably inline with the speed up I achieved).

> /opt/src/methylcode/data/s_1_sequence.txt
> benchmarking fastq file with 15646356 records (62585424 lines)
> performing 500000 random queries
>
> screed
> ------
> create: 699.210
> search: 51.043
>
> biopython-sqlite
> ----------------
> create: 386.647
> search: 93.391
>
> fileindex
> ---------
> create: 184.088
> search: 48.887

That's got us looking more competitive. As noted above, I think
sceed's loading time could be much reduced by tweaking how
they use SQLite3. I wonder what the breakdown for fileindex is
between calling Tokyo Cabinet and the fileindex code itself?
I guess we should try TK as the back end in Bio.SeqIO.index()
for comparison.

Peter

P.S. Could you measure the database file sizes on disk?

From thomas.hamelryck at gmail.com  Wed Jun  9 08:18:41 2010
From: thomas.hamelryck at gmail.com (Thomas Hamelryck)
Date: Wed, 9 Jun 2010 14:18:41 +0200
Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB
	enhancements
In-Reply-To: <AANLkTilUjxzLPqKvfqki6wpznWKUSlQGJjJ_6x-cqhmS@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<AANLkTin1nilGCBk3zuU4ZMf7vl0XdSlX8Ujai6_g4B-z@mail.gmail.com>
	<AANLkTikj60J4uhtzzgTYos67aeH3eJ3xJF8-QsKq7qWS@mail.gmail.com>
	<AANLkTim3hndnmdtvlXcBqo723U48FOmpd5AiESwqxM8P@mail.gmail.com>
	<AANLkTilUjxzLPqKvfqki6wpznWKUSlQGJjJ_6x-cqhmS@mail.gmail.com>
Message-ID: <AANLkTilC9Jqf4Kl0QOIYkGWl309F-kDpnbWHASyRE1T5@mail.gmail.com>

Hi,

On Tue, Jun 8, 2010 at 7:10 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:

>
> from Bio.PDB import Protein
> structure = Protein('1ABC.pdb')
> structure.search_ss_bonds()
>

Indeed, that would run into problems for complexes where proteins, RNA, DNA,
etc. occur in the same file. It makes much more sense to have a Structure
centred approach:

proteins=Protein(structure)
chains=proteins.get_chains()
chain_a=chains["A"]
polypeptides=chain_a.get_peptides()

rnas=RNA(structure)

etc.

-Thomas

-- 
Thomas Hamelryck, Assoc. Prof.
Group leader Structural Bioinformatics
Bioinformatics center
Department of Biology
University of Copenhagen
Ole Maaloes Vej 5
DK-2200 Copenhagen N
Denmark
http://wiki.binf.ku.dk/User:Thomas_Hamelryck
http://www.binf.ku.dk/research/structural_bioinformatics/


From lgautier at gmail.com  Wed Jun  9 08:28:20 2010
From: lgautier at gmail.com (Laurent)
Date: Wed, 09 Jun 2010 14:28:20 +0200
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <mailman.2432.1275979197.3120.biopython-dev@lists.open-bio.org>
References: <mailman.2432.1275979197.3120.biopython-dev@lists.open-bio.org>
Message-ID: <4C0F88E4.7070607@gmail.com>

What about having a class instance instead ? This would let one change 
the index storage system very easily.

For example, to use a dictionary:

Bio.SeqIO.index(keyval_map = dict() )

A minimal requirement for the instance 'keyval_map' passed would be to 
implement the methods __getitem__(self, key) and __setitem__(self, key, 
value), allowing the "duck typing" approach commonly found in Python.

An SQLite-based index would be a matter of having a class such as:

class KeyValSQLite(object):
   def __init__(self, filename):
       # create the database into file "filename"
       pass

   def __getitem__(self, key):
       """ return the value """
       # select whatever in something where key='<key>'...
       pass

   def __setitem__(self, key, value):
       # update...
       pass


The this would be a call like:

Bio.SeqIO.index(keyval_map = KeyValSQLite("myindex.db"))


Now that you have the idea, getting a custom index based on BDB or 
anything should be a breeze...


L.

On 08/06/10 08:39, biopython-dev-request at lists.open-bio.org wrote:
> Hi all,
>
> Thanks for the lively discussion on the main list,
>
> http://lists.open-bio.org/pipermail/biopython/2010-June/006546.html
> ...
> http://lists.open-bio.org/pipermail/biopython/2010-June/006580.html
>
> I've spent the afternoon updating my old branch which uses SQLite
> to store the record identifier to file offset mapping. Using the code
> on this branch, Bio.SeqIO.index() supports a new optional argument
> currently called "db" (other names I like including "cache", suggestions
> welcome):
>
> http://github.com/peterjc/biopython/tree/index-sqlite
>
> The default (False) is not to use SQLite, but continue with an in
> memory Python dictionary. As long as you have enough RAM
> and don't plan to use the index at a later date, this will be fastest.
>
> If set to True or a filename, then an SQLite index is used to hold
> the offsets. This means very low RAM requirements, but is a lot
> slower because the offsets are written to disk and the SQLite
> index is updated as we go. I expect this part can be optimised
> (e.g. try to build the index at the end, try committing in batches).
>
> I'm still testing this, but the core of the work is done I think.
> Once we're happy with the public API, we can concentrate
> on things like the SQLite schema, and optimising the code.
>
> Peter
>
> P.S. I know it will need a little work to fail gracefully on Python 2.4
> when SQLite isn't installed.
>


From biopython at maubp.freeserve.co.uk  Wed Jun  9 08:53:39 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 9 Jun 2010 13:53:39 +0100
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <4C0F88E4.7070607@gmail.com>
References: <mailman.2432.1275979197.3120.biopython-dev@lists.open-bio.org>
	<4C0F88E4.7070607@gmail.com>
Message-ID: <AANLkTilZPP2928RbnzTl7c-CcyWlJ8QBThG1XOuiJ8ZX@mail.gmail.com>

On Wed, Jun 9, 2010 at 1:28 PM, Laurent <lgautier at gmail.com> wrote:
> What about having a class instance instead ? This would let one change the
> index storage system very easily.

That is essentially what the recent code on my branch is doing, but
the back end isn't being exposed to the public API (yet).

> The this would be a call like:
>
> Bio.SeqIO.index(keyval_map = KeyValSQLite("myindex.db"))
>
>
> Now that you have the idea, getting a custom index based on BDB or
> anything should be a breeze...

Indeed. Most DB like back ends should offset a bulk loader we can
exploit via the dict's update method.

Peter

From eric.talevich at gmail.com  Wed Jun  9 09:31:18 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Wed, 9 Jun 2010 09:31:18 -0400
Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB
	enhancements
In-Reply-To: <AANLkTilUjxzLPqKvfqki6wpznWKUSlQGJjJ_6x-cqhmS@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com> 
	<AANLkTin1nilGCBk3zuU4ZMf7vl0XdSlX8Ujai6_g4B-z@mail.gmail.com> 
	<AANLkTikj60J4uhtzzgTYos67aeH3eJ3xJF8-QsKq7qWS@mail.gmail.com> 
	<AANLkTim3hndnmdtvlXcBqo723U48FOmpd5AiESwqxM8P@mail.gmail.com> 
	<AANLkTilUjxzLPqKvfqki6wpznWKUSlQGJjJ_6x-cqhmS@mail.gmail.com>
Message-ID: <AANLkTikldxobdl9u2B2NDlYe2qpe74H1exiZxmGMcEnY@mail.gmail.com>

On Tue, Jun 8, 2010 at 1:10 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:

> Hello all,
>
> I'm replying here to what Thomas wrote on the GSOC Report thread because it
> seems a better place.
>
> PDB files can contain anything RNA, DNA, sugars, small molecules... It is
>> thus not a good idea to
>> directly associate protein-specific methods to the structure class; it
>> will lead to a bloated Structure class and a lot of irrelevant methods (ie.
>> search_ss_bonds is meaningless for a PDB file that contains RNA).
>
>
> Agree.
>
> Currently, one creates Polypeptide objects from a Structure object using a
>> factory design pattern (via PPBuilder); the Polypeptide class implements
>> some protein specific methods. I believe that is a much cleaner way to do it
>> (though we need a Protein class that represents collections of connected
>> polypeptides). One can also make sure that all such derived objects
>> (Protein, NA, DNA,...) adhere to the same interface by providing a suitable
>> base class with shared functionality - in that way, the whole thing is also
>> extendible.
>>
>
> I think there has been already some discussion about this. My personal
> opinion/suggestion is having a structure like:
>
> Bio.PDB/
> _______/Protein.py
> _______/DNA.py
> _______/RNA.py
>
> that would translate to an usage of something like:
>
> from Bio.PDB import Protein
> structure = Protein('1ABC.pdb')
> structure.search_ss_bonds()
>
> but not
>
> structure.calc_melting_temperature() (just an example)
>

How about:

from Bio import Struct

# extract the protein from a bound TF structure
complex = Struct.read("3IKT.pdb")
prot = complex.as_protein()

# which is a wrapper for:
from Bio.Struct.Protein import Protein
# if Protein contains a Structure instance:
prot = Protein(complex)
# or, if Protein inherits from Structure:
prot = Protein.from_structure(complex)


The Bio.Struct.Protein module would mostly wrap Bio.PDB's protein-specific
functionality, and contain a class called Protein which you construct using
a Bio.PDB.Structure.Structure instance, in some way.

I think the convenience methods as_protein, as_dna and as_rna are acceptable
additions to the Structure class if that saves us from (a) polluting
Structure with protein- and RNA-specific methods, or (b) requiring a slew of
imports to reach any new functionality. You can add as_protein yourself and
leave the other methods for other brave souls to implement. (Bio.Struct.RNA
deserves its own directory, and I don't know of anyone working on a
structural DNA branch.)


Protein() would call PDBParser(). It could also include, to a certain
> extent, an Alphabet-like feature to assure residue names are OK (this goes a
> bit with this proposal<http://www.biopython.org/wiki/GSOC2010_Joao#Residue_name_normalisation>).
> I believe this goes a bit into what you said. Having a class that basically
> abstracts what we do now (Bio.PDB.PDBParser) and allows for
> molecule-specific methods. However, it also leads to some problems:
> Protein/DNA complexes come to mind.
>
> How does this sound? I think it goes with what Eric said in the first post
> of this thread and what Thomas replied in the GSOC thread. We should also
> change the PDB name to Struct to better reflect the purpose of the module.
> All of the other additions like Bio.Struct.WWW would still apply. And I
> don't see a major problem in breaking the existing code by adding this.
>

To be clear, we don't need to rename anything -- Bio.Struct and Bio.PDB can
live in harmony for the foreseeable future.

Best,
Eric


From bpederse at gmail.com  Wed Jun  9 10:42:29 2010
From: bpederse at gmail.com (Brent Pedersen)
Date: Wed, 9 Jun 2010 07:42:29 -0700
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTilMHNtJx5QivAWSjFiXnW1BFWi86H-ow8kBq6pi@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
	<AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
	<AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>
	<AANLkTincBvgf1pBsEq3a60ftvGPYgFUP9hC800BWC1lV@mail.gmail.com>
	<AANLkTimaz_cdp8Ca2VEzJhXgOfZv9ftw3wgU8itF3nKt@mail.gmail.com>
	<AANLkTikkuO37pyLQFeHG4Doq8Ebo463HUd-h6uxcfSGe@mail.gmail.com>
	<AANLkTil-azYjpsrP7W2PG6H6dLJhv2WV-Ag0tY1kRH_C@mail.gmail.com>
	<AANLkTilMHNtJx5QivAWSjFiXnW1BFWi86H-ow8kBq6pi@mail.gmail.com>
Message-ID: <AANLkTimmo4rIJYBZ5zCV-fXpLkrqsUNs-VBvOOsLpk-a@mail.gmail.com>

On Wed, Jun 9, 2010 at 1:55 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Wed, Jun 9, 2010 at 5:33 AM, Brent Pedersen <bpederse at gmail.com> wrote:
>>>
>>> The version you tried didn't do anything clever with the SQLite
>>> indexes, batched inserts etc. I'm hoping the current code will be
>>> faster (although there is likely a penalty from having two switchable
>>> back ends). Brent, could you re-run this benchmark with this code:
>>> http://github.com/peterjc/biopython/tree/index-sqlite-batched
>>> ...
>>
>> done.
>
> Thank you Brent :)
>
>> the previous times and the current were using py-tcdb not bsddb.
>> the author of tcdb made some improvements so it's faster this time,
>
> OK, so you are using Tokyo Cabinet to store the lookup table here
> rather than BDB. Link, http://code.google.com/p/py-tcdb/
>
>> and your SeqIO implementation is almost 2x as fast to load as the
>> previous one. that's a nice implementation. i didn't try get_raw.
>
> I've got some more re-factoring in mind which should help a little
> more (but mainly to make the structure clearer).
>
>> these timints are are with your latest version, and the version of
>> screed pulled from http://github.com/acr/screed master today.
>
> Having had a quick look, they are using SQLite3 in much the
> say way as I was initially. They create the index before loading
> (rather than after loading) and they use a single insert per
> offset (rather than using a batch in a transaction or the
> executemany method). I'm pretty sure from my experiments
> those changes would speed up screed's loading time a lot
> (probably inline with the speed up I achieved).
>
>> /opt/src/methylcode/data/s_1_sequence.txt
>> benchmarking fastq file with 15646356 records (62585424 lines)
>> performing 500000 random queries
>>
>> screed
>> ------
>> create: 699.210
>> search: 51.043
>>
>> biopython-sqlite
>> ----------------
>> create: 386.647
>> search: 93.391
>>
>> fileindex
>> ---------
>> create: 184.088
>> search: 48.887
>
> That's got us looking more competitive. As noted above, I think
> sceed's loading time could be much reduced by tweaking how
> they use SQLite3. I wonder what the breakdown for fileindex is
> between calling Tokyo Cabinet and the fileindex code itself?
> I guess we should try TK as the back end in Bio.SeqIO.index()
> for comparison.
>
> Peter
>
> P.S. Could you measure the database file sizes on disk?
>

for raw reads, screed, fileindex(tcdb), biopython respectively:
-rw-r--r-T 1 brentp users  3.3G 2009-11-17 13:32
/opt/src/methylcode/data/s_1_sequence.txt
-rw-r--r-- 1 brentp brentp 3.8G 2010-06-08 16:09
/opt/src/methylcode/data/s_1_sequence.txt_screed
-rw-r--r-- 1 brentp brentp 1.2G 2010-06-08 16:21
/opt/src/methylcode/data/s_1_sequence.txt.fidx
-rw-r--r-- 1 brentp brentp 1.5G 2010-06-08 21:15
/opt/src/methylcode/data/s_1_sequence.txt.bidx

that's not using any compression for the fileindex.
i think the overhead of the fileindex code + tcdb code is pretty low
now. i think there'd only be improvement
using a cython or c version of a TC wrapper--and even then, not much.

-brentp

From biopython at maubp.freeserve.co.uk  Wed Jun  9 10:55:23 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 9 Jun 2010 15:55:23 +0100
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTilMHNtJx5QivAWSjFiXnW1BFWi86H-ow8kBq6pi@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
	<AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
	<AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>
	<AANLkTincBvgf1pBsEq3a60ftvGPYgFUP9hC800BWC1lV@mail.gmail.com>
	<AANLkTimaz_cdp8Ca2VEzJhXgOfZv9ftw3wgU8itF3nKt@mail.gmail.com>
	<AANLkTikkuO37pyLQFeHG4Doq8Ebo463HUd-h6uxcfSGe@mail.gmail.com>
	<AANLkTil-azYjpsrP7W2PG6H6dLJhv2WV-Ag0tY1kRH_C@mail.gmail.com>
	<AANLkTilMHNtJx5QivAWSjFiXnW1BFWi86H-ow8kBq6pi@mail.gmail.com>
Message-ID: <AANLkTinD2RBMlJF3AF_eOAahPN7Bn3hWyg--GhWCnk8y@mail.gmail.com>

On Wed, Jun 9, 2010 at 9:55 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> Having had a quick look, they are using SQLite3 in much the
> say way as I was initially. They create the index before loading
> (rather than after loading) and they use a single insert per
> offset (rather than using a batch in a transaction or the
> executemany method). I'm pretty sure from my experiments
> those changes would speed up screed's loading time a lot
> (probably inline with the speed up I achieved).
>

Do you fancy trying this version of screed? It seems much
faster on medium sized FASTQ files:-

http://github.com/peterjc/screed/tree/sqlite-tweaks

I'm still running a few tests myself, but will pass this on to
the screed team unless I find some regressions.

Peter

From bpederse at gmail.com  Wed Jun  9 11:56:27 2010
From: bpederse at gmail.com (Brent Pedersen)
Date: Wed, 9 Jun 2010 08:56:27 -0700
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTinD2RBMlJF3AF_eOAahPN7Bn3hWyg--GhWCnk8y@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
	<AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
	<AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>
	<AANLkTincBvgf1pBsEq3a60ftvGPYgFUP9hC800BWC1lV@mail.gmail.com>
	<AANLkTimaz_cdp8Ca2VEzJhXgOfZv9ftw3wgU8itF3nKt@mail.gmail.com>
	<AANLkTikkuO37pyLQFeHG4Doq8Ebo463HUd-h6uxcfSGe@mail.gmail.com>
	<AANLkTil-azYjpsrP7W2PG6H6dLJhv2WV-Ag0tY1kRH_C@mail.gmail.com>
	<AANLkTilMHNtJx5QivAWSjFiXnW1BFWi86H-ow8kBq6pi@mail.gmail.com>
	<AANLkTinD2RBMlJF3AF_eOAahPN7Bn3hWyg--GhWCnk8y@mail.gmail.com>
Message-ID: <AANLkTilokHoei_oEHVsFlq_2o3X1ZZdeH75G6oPo5XzU@mail.gmail.com>

On Wed, Jun 9, 2010 at 7:55 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Wed, Jun 9, 2010 at 9:55 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>>
>> Having had a quick look, they are using SQLite3 in much the
>> say way as I was initially. They create the index before loading
>> (rather than after loading) and they use a single insert per
>> offset (rather than using a batch in a transaction or the
>> executemany method). I'm pretty sure from my experiments
>> those changes would speed up screed's loading time a lot
>> (probably inline with the speed up I achieved).
>>
>
> Do you fancy trying this version of screed? It seems much
> faster on medium sized FASTQ files:-
>
> http://github.com/peterjc/screed/tree/sqlite-tweaks
>
> I'm still running a few tests myself, but will pass this on to
> the screed team unless I find some regressions.
>
> Peter
>

not too much difference.

screed
------
create: 666.381
search: 51.839

From biopython at maubp.freeserve.co.uk  Wed Jun  9 12:19:24 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 9 Jun 2010 17:19:24 +0100
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTilokHoei_oEHVsFlq_2o3X1ZZdeH75G6oPo5XzU@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
	<AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
	<AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>
	<AANLkTincBvgf1pBsEq3a60ftvGPYgFUP9hC800BWC1lV@mail.gmail.com>
	<AANLkTimaz_cdp8Ca2VEzJhXgOfZv9ftw3wgU8itF3nKt@mail.gmail.com>
	<AANLkTikkuO37pyLQFeHG4Doq8Ebo463HUd-h6uxcfSGe@mail.gmail.com>
	<AANLkTil-azYjpsrP7W2PG6H6dLJhv2WV-Ag0tY1kRH_C@mail.gmail.com>
	<AANLkTilMHNtJx5QivAWSjFiXnW1BFWi86H-ow8kBq6pi@mail.gmail.com>
	<AANLkTinD2RBMlJF3AF_eOAahPN7Bn3hWyg--GhWCnk8y@mail.gmail.com>
	<AANLkTilokHoei_oEHVsFlq_2o3X1ZZdeH75G6oPo5XzU@mail.gmail.com>
Message-ID: <AANLkTikYjpuhFAN5fjHXVtnvKO0AH-bYNVNv5GOx-W81@mail.gmail.com>

On Wed, Jun 9, 2010 at 4:56 PM, Brent Pedersen <bpederse at gmail.com> wrote:
> On Wed, Jun 9, 2010 at 7:55 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> On Wed, Jun 9, 2010 at 9:55 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>>
>> Do you fancy trying this version of screed? It seems much
>> faster on medium sized FASTQ files:-
>>
>> http://github.com/peterjc/screed/tree/sqlite-tweaks
>>
>> I'm still running a few tests myself, but will pass this on to
>> the screed team unless I find some regressions.
>>
>> Peter
>>
>
> not too much difference.
>
> screed
> ------
> create: 666.381
> search: 51.839

Still noticeable, but not quite as much of a speed up as I was
seeing (but different example, different OS, etc). Anyway, I've
sent them a "pull request" and they can merge it if they like.

Peter

From rodrigo_faccioli at uol.com.br  Wed Jun  9 13:35:24 2010
From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli)
Date: Wed, 9 Jun 2010 14:35:24 -0300
Subject: [Biopython-dev] Working directly on the main git repository
In-Reply-To: <AANLkTiluRrKJ9AhHHIVwUSm0zXqpyQqn-TIVQUZHkBBF@mail.gmail.com>
References: <AANLkTikhnHfMIQfWU3ORrlfG1z2Zxsdx-sEycFSB3oa-@mail.gmail.com>
	<AANLkTiluRrKJ9AhHHIVwUSm0zXqpyQqn-TIVQUZHkBBF@mail.gmail.com>
Message-ID: <AANLkTimbkbFYM-1M9-EMO4xyRNDbAbdxEnt68NGUlmv_@mail.gmail.com>

About your Github's problem, you may try to perform the command below, after
you removed your local branch.

git push git at github.com:<my_account>/<my_repository>.git :heads/<mybranch>

I've found the command below in [1].

[1]
http://originblog.wordpress.com/2008/04/28/github-tips-removing-a-remote-branch/

Thanks in advance,

--
Rodrigo Antonio Faccioli
Ph.D Student in Electrical Engineering
University of Sao Paulo - USP
Engineering School of Sao Carlos - EESC
Department of Electrical Engineering - SEL
Intelligent System in Structure Bioinformatics
http://laips.sel.eesc.usp.br
Phone: 55 (16) 3373-9366 Ext 229
Curriculum Lattes - http://lattes.cnpq.br/1025157978990218
Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5


On Tue, Jun 8, 2010 at 6:45 PM, Eric Talevich <eric.talevich at gmail.com>wrote:

> On Mon, Jun 7, 2010 at 5:35 AM, Peter <biopython at maubp.freeserve.co.uk
> >wrote:
>
> > Hi all,
> >
> > I thought I'd write down some notes about how I've been using git
> recently.
> > This may be of interest to any of the other core developers (those of us
> > with read-write access to the main repository), and I might get some good
> > tips from any discussion. The key point is that I have read+write access
> > to two repositories on github (the official repository AND my own fork),
> > so there are different advantages/disadvantages about which I choose
> > to work with directly as my main repository.
> >
> > [...]
> >
> > Instead, I have a github repository of my own (what github calls a
> > fork), and I push branches there.
> >
> > http://github.com/biopython/biopython - the official branch(es)
> > http://github.com/peterjc/biopython - my branches
> >
> > How does this work in practice? Like this - I clone the master
> > and add a reference to my repository (and I do the same when I
> > want to grab a branch from another developer):
> >
> > git clone git at github.com:biopython/biopython.git
> > cd biopython
> > git remote add peterjc git at github.com:peterjc/biopython.git
> > git fetch peterjc
> >
> > Then make a new local branch as usual, and when ready to share
> > it publicly, I push it to *my* repository on github:
> >
> > git branch new-work
> > git checkout new-work
> > git commit ...
> > git push peterjc new-work
> >
> > This would then appear as a new-work branch on my github page.
> > Then if I (or someone else) wants to access these branches later
> > (e.g. from another machine) just use the checkout tracked remote
> > branch. For example,
> >
> > git clone git at github.com:biopython/biopython.git
> > cd biopython
> > git remote add peterjc git at github.com:peterjc/biopython.git
> > git fetch peterjc
> > git checkout -t peterjc/seqio-imgt
> >
> > This then looks like a normal branch (called just "seqio-imgt" in
> > this example), but git knows it is linked to the remote branch on
> > the "peterjc" repository (not the origin which is the "official"
> > repository).
> >
>
> This looks reasonable to me. I'd add that the procedure to delete a public
> branch from your personal fork on GitHub is a little obscure:
>
> git branch -a   # list local and remote branches
> git branch -d new-work   # delete a local branch that's been merged already
> git push peterjc :new-work  # delete the public branch from GitHub
>
> This doesn't do what you'd expect:
> git branch -d peterjc/new-work
>
> That only removes your local reference to the the public branch; the branch
> is still visible on GitHub.
>
> (It's kind of hard to find in the GitHub documentation.)
>
>
> I'd have to check, but I guess that if the original git clone is done
> > with git://github.com/biopython/biopython.git instead (read only
> > access) the same procedure could be used by non core devs.
> > However, I'm not sure this is clearer for them. I think the current
> > procedure (on our wiki) where you add a remote reference to
> > the "upstream" official repository works better in this case.
> >
>
> I still have an "upstream" reference to the main repo. I wouldn't want to
> accidentally push something foolish to the main repo with a stray "git
> push"... better to have the safe thing happen by default.
>
> If the initial clone was from biopython master, and you later create a
> personal forkon GitHub, then it's not too hard to switch the references
> around in your local repo to make the public fork your "origin".
>
> -Eric
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>

From eric.talevich at gmail.com  Wed Jun  9 19:56:35 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Wed, 9 Jun 2010 19:56:35 -0400
Subject: [Biopython-dev] Tested Fixup branch for Bio.PDB
In-Reply-To: <df95eaa0e6f3c40d451630cb54332b3c-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FaLkQAWF9QSVlUUAw=-webmailer2@server04.webmailer.hosteurope.de>
References: <df95eaa0e6f3c40d451630cb54332b3c-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FaLkQAWF9QSVlUUAw=-webmailer2@server04.webmailer.hosteurope.de>
Message-ID: <AANLkTinESGFm0Z2m7VAWYgtWXh9wXVYRViDKtPx4rKN6@mail.gmail.com>

On Tue, Jun 8, 2010 at 5:59 AM, Kristian Rother <krother at rubor.de> wrote:

>
> Hi Eric,
>
> I've checked out your pdbfixes branch and ran our 431 Unit Tests of
> ModeRNA with it. There were no changes to the master Bio.PDB branch -->
> for us everything OK.
>
> Details:
> ModeRNA (http://www.genesilico.pl/moderna) engineers RNA 3D structures and
> uses Bio.PDB for most of its operations: reading files,
> adding/copying/manipulating residues/atoms, superimposing structures,
> searching neighbors by KDTree, writing files.
>
> Right, the tests most probably did not depend directly on the code you
> changed, but as I understand you wanted to go sure the branch didnt break
> anything by accident.
>

Thanks, Kristian! I didn't expect the patches to break anything, but it's
hard to be sure until someone else has tried it.

I've pushed the pdbfixes branch to Biopython's master branch on GitHub.

Cheers,
Eric

From biopython at maubp.freeserve.co.uk  Thu Jun 10 12:24:20 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 10 Jun 2010 17:24:20 +0100
Subject: [Biopython-dev] subprocess and calling application wrappers
In-Reply-To: <AANLkTimZShDqgJ__BO8sqvkJl7DBsLXS2iz-0ATW0saa@mail.gmail.com>
References: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
	<20100601132355.GU1054@sobchak.mgh.harvard.edu>
	<AANLkTinJplczWd6_m4sCmY8Gkh-AsjptvfAwa2TDa5M7@mail.gmail.com>
	<AANLkTilkILMvG5huqqZ2-rKMSQNqOAQGiirVO3rNBCwt@mail.gmail.com>
	<AANLkTimZShDqgJ__BO8sqvkJl7DBsLXS2iz-0ATW0saa@mail.gmail.com>
Message-ID: <AANLkTinpmWEFkXwHDQ9CeGft4UrzhUMvsJ-QDTRotwLk@mail.gmail.com>

On Wed, Jun 2, 2010 at 12:59 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> With that in mind, as I mentioned yesterday maybe we should just
> update the documentation to suggest using os.system() when you
> just need the return code and there is no stdin to worry about:
>

I've added a basic example to the tutorial now, but the potential
trouble is any output from the called tool will spew out at the
python prompt (if working at the terminal). This may or may not
be an issue. ClustalW for example is rather verbose.

Peter

From bugzilla-daemon at portal.open-bio.org  Thu Jun 10 14:18:41 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 10 Jun 2010 14:18:41 -0400
Subject: [Biopython-dev] [Bug 3098] New: GenBank/EMBL parser breaks for
	between features at origin
Message-ID: <bug-3098-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3098

           Summary: GenBank/EMBL parser breaks for between features at
                    origin
           Product: Biopython
           Version: 1.54
          Platform: PC
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk


I was testing Bio.SeqIO with with a GenBank file gbpln1.seq which includes:

LOCUS       AB042240              134545 bp    DNA     circular PLN 02-MAY-2006
...
     misc_feature    134545^1
                     /standard_name="JLA"
                     /note="Junction IRA-LSC"
ORIGIN 
...

This is a "between" feature of length zero at the origin of this circular
genome. This is a special case since normally between positions "start^end"
have end=start+1 (using one based counting) which the parser does not allow
for.

The same applies to EMBL files as well, e.g.
http://www.ebi.ac.uk/cgi-bin/expasyfetch?AB042240


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu Jun 10 14:35:48 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 10 Jun 2010 14:35:48 -0400
Subject: [Biopython-dev] [Bug 3098] GenBank/EMBL parser breaks for between
	features at origin
In-Reply-To: <bug-3098-42@http.bugzilla.open-bio.org/>
Message-ID: <201006101835.o5AIZm0b025094@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3098


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-10 14:35 EST -------
Fixed,
http://github.com/biopython/biopython/commit/80aa43e5434316d151bca5916442a3429b8724e2


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From eric.talevich at gmail.com  Thu Jun 10 15:18:38 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 10 Jun 2010 15:18:38 -0400
Subject: [Biopython-dev] subprocess and calling application wrappers
In-Reply-To: <AANLkTimZShDqgJ__BO8sqvkJl7DBsLXS2iz-0ATW0saa@mail.gmail.com>
References: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com> 
	<20100601132355.GU1054@sobchak.mgh.harvard.edu>
	<AANLkTinJplczWd6_m4sCmY8Gkh-AsjptvfAwa2TDa5M7@mail.gmail.com> 
	<AANLkTilkILMvG5huqqZ2-rKMSQNqOAQGiirVO3rNBCwt@mail.gmail.com> 
	<AANLkTimZShDqgJ__BO8sqvkJl7DBsLXS2iz-0ATW0saa@mail.gmail.com>
Message-ID: <AANLkTimKXGT6Of22aw0b1_2EEVPdXSzwSLhMuM9d4le1@mail.gmail.com>

On Wed, Jun 2, 2010 at 7:59 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

>
> Even if the Python documentation seems to be discouraging it,
> using os.system() seems simple, robust, and cross platform. We
> could even update the tutorial now and post it online - it should
> make some people's lives a little easier.
>

The Python docs claim os.system(cmd) is equivalent to subprocess.call(cmd,
shell=True):
http://docs.python.org/library/subprocess.html#replacing-os-system

As I understood it, the reason for usually skipping the shell on Unix
systems was for additional security -- the called program sees the same
thing either way.

Should we use this as a "teachable moment" involving the subprocess module
in the tutorial?

-Eric

From anaryin at gmail.com  Thu Jun 10 19:45:02 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 10 Jun 2010 18:45:02 -0500
Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB
	enhancements
In-Reply-To: <AANLkTikldxobdl9u2B2NDlYe2qpe74H1exiZxmGMcEnY@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com> 
	<AANLkTin1nilGCBk3zuU4ZMf7vl0XdSlX8Ujai6_g4B-z@mail.gmail.com> 
	<AANLkTikj60J4uhtzzgTYos67aeH3eJ3xJF8-QsKq7qWS@mail.gmail.com> 
	<AANLkTim3hndnmdtvlXcBqo723U48FOmpd5AiESwqxM8P@mail.gmail.com> 
	<AANLkTilUjxzLPqKvfqki6wpznWKUSlQGJjJ_6x-cqhmS@mail.gmail.com> 
	<AANLkTikldxobdl9u2B2NDlYe2qpe74H1exiZxmGMcEnY@mail.gmail.com>
Message-ID: <AANLkTinYFfirod6vG2PHEQ2asMz7IfVQFw9spuH_N4E9@mail.gmail.com>

Hello all,

I'm having some issues dealing with this :x

I created a module Bio.Struct that has the following contents:

__init__.py
Protein.py
WWW/

The __init__.py file has a read() method that calls PDBParser and returns a
Structure object. So far so good I think. Then I added a method to
Bio.PDB.Structure more or less like this:

    def as_protein(self):
        from Bio.Struct.Protein import Protein
        prot = Protein(self)
        return prot

so when you call it you get a new object. Protein is a class that inherits
from Structure and that has the search_ss_bonds function.

I can make the new object get all the methods from Structure AND from
Protein, but when I try to execute search_ss_bonds, it fails because
child_list, a Structure method, comes empty.. In fact, the whole SMCRA
object comes empty..

How do I effectively do the inheritance on the Protein class?

from Bio.PDB.Structure import Structure

class Protein(Structure):

    def __init__(self, protein):

        self = protein

This is what I last tried and doesn't work.. I've tried Structure.__init__,
and several other things but to no avail. I'm sure this is simple OOP but I
really can't understand that well how to do it ...

Care to give a hand to a friend in need? :)

Thanks in advance! By the way, I assume that if I got no comments on
anything else on the GSOC thread that I'm doing a perfect job :P Thanks for
that too :D

Best!

Jo?o [...] Rodrigues


From eric.talevich at gmail.com  Thu Jun 10 21:49:39 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 10 Jun 2010 21:49:39 -0400
Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB
	enhancements
In-Reply-To: <AANLkTinYFfirod6vG2PHEQ2asMz7IfVQFw9spuH_N4E9@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com> 
	<AANLkTin1nilGCBk3zuU4ZMf7vl0XdSlX8Ujai6_g4B-z@mail.gmail.com> 
	<AANLkTikj60J4uhtzzgTYos67aeH3eJ3xJF8-QsKq7qWS@mail.gmail.com> 
	<AANLkTim3hndnmdtvlXcBqo723U48FOmpd5AiESwqxM8P@mail.gmail.com> 
	<AANLkTilUjxzLPqKvfqki6wpznWKUSlQGJjJ_6x-cqhmS@mail.gmail.com> 
	<AANLkTikldxobdl9u2B2NDlYe2qpe74H1exiZxmGMcEnY@mail.gmail.com> 
	<AANLkTinYFfirod6vG2PHEQ2asMz7IfVQFw9spuH_N4E9@mail.gmail.com>
Message-ID: <AANLkTilZh3xH7nsZuGf8HgV2uTzv8eM6Omm4YmSfcQ4Y@mail.gmail.com>

On Thu, Jun 10, 2010 at 7:45 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:

> Hello all,
>
> I'm having some issues dealing with this :x
>
> I created a module Bio.Struct that has the following contents:
>
> __init__.py
> Protein.py
> WWW/
>
> The __init__.py file has a read() method that calls PDBParser and returns a
> Structure object. So far so good I think. Then I added a method to
> Bio.PDB.Structure more or less like this:
>
>     def as_protein(self):
>
>         from Bio.Struct.Protein import Protein
>         prot = Protein(self)
>         return prot
>
> so when you call it you get a new object. Protein is a class that inherits
> from Structure and that has the search_ss_bonds function.
>
> I can make the new object get all the methods from Structure AND from
> Protein, but when I try to execute search_ss_bonds, it fails because
> child_list, a Structure method, comes empty.. In fact, the whole SMCRA
> object comes empty..
>
> How do I effectively do the inheritance on the Protein class?
>
> from Bio.PDB.Structure import Structure
>
> class Protein(Structure):
>
>     def __init__(self, protein):
>
>         self = protein
>
> This is what I last tried and doesn't work.. I've tried Structure.__init__,
> and several other things but to no avail. I'm sure this is simple OOP but I
> really can't understand that well how to do it ...
>
> Care to give a hand to a friend in need? :)
>
> Thanks in advance! By the way, I assume that if I got no comments on
> anything else on the GSOC thread that I'm doing a perfect job :P Thanks for
> that too :D
>
> Best!
>
> Jo?o [...] Rodrigues
>

Hi Jo?o,

You have it mostly correct, but you need to call the parent class's
constructor, too.

Here's the constructor for Structure:

    def __init__(self, id):
        self.level="S"
        Entity.__init__(self, id)

And here it is for Entity:

    def __init__(self, id):
        self.id=id
        self.full_id=None
        self.parent=None
        self.child_list=[]
        self.child_dict={}
        # Dictionary that keeps addictional properties
        self.xtra={}

See the problem? Every subclass of Entity takes an "id" argument and sets
the other attributes separately.

In Bio.Phylo, I used another convention for converting an object of one type
to a sub-class of the original type, as you're doing here. Rather than
change the arguments to the constructor (which could have weird
side-effects), I added a class method in the target class:

@classmethod
def from_structure(cls, struct):
    # Instantiate a Protein with the structure's id
    # Assign the other attributes individually from struct


Then Structure.as_protein() becomes fairly simple. Alternatively, you could
skip implementing Protein.from_structure() and do the attribute reassignment
in as_protein(). Or, covering all the options, implement from_structure()
but not as_protein(), and let the user figure it out.

Do you think it would also be useful if as_protein() or from_structure()
dropped any non-protein molecules during the conversion, and raise an error
if nothing's left? Or would that cause more problems than it solves?

Best,
Eric


From biopython at maubp.freeserve.co.uk  Mon Jun 14 10:44:50 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 14 Jun 2010 15:44:50 +0100
Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method?
In-Reply-To: <320fb6e00910010204x5035a6edjde030e3072f5f91b@mail.gmail.com>
References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com>
	<320fb6e00910010204x5035a6edjde030e3072f5f91b@mail.gmail.com>
Message-ID: <AANLkTik6TlYQER9MOEf56NZi0bNLANe_uJK3qjSKlQVG@mail.gmail.com>

Hi all,

You may recall late last year I posted about adding a reverse
complement method to the SeqRecord, and addition support:
http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006850.html
http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006851.html

SeqRecord addition was included in Biopython 1.53, but
not the reverse_complement() method - which is something
I wanted to use again today to reverse complement an
annotated GenBank file and have all the SeqFeature
locations flipped for me. I've rescued my old code and
its unit tests and created a new branch for it:
http://github.com/peterjc/biopython/commits/seqrecord-rc

As I said at the end of last year, I think the general idea of
a SeqRecord reverse_complement() method is nice but the
details about handling the annotation is tricky. When we
discussed slicing and addition, it was agreed that we
should be cautious to avoid blindly transferring annotation
inappropriately. The code on this branch allows the user to
choose for each annotation type if it should be dropped
(False), kept (True) or set to a supplied new value. The
docstring has examples of how this works (which double
as doctests).

Jose - I've CC'd you since I know you wrote your own
SeqRecord subclass with a complement() method (but not
a reverse_complement() method) for Franklin. I'm curious
about this choice.

Cedar - I've CC'd you since you asked about this kind of
think last year:
http://lists.open-bio.org/pipermail/biopython/2009-June/005307.html

Regards,

Peter

From biopython at maubp.freeserve.co.uk  Mon Jun 14 10:50:31 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 14 Jun 2010 15:50:31 +0100
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <20100614164348.186267pfu17v2ntw@horde.genesilico.pl>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>
	<312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>
	<AANLkTimtoy9zWv74_KwwRhU-UQny1u0uzHQsy894wXJ8@mail.gmail.com>
	<6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTinJE4eBPO8ydgrunBzWZqeN5Nw4nyuo8GEaRTzr@mail.gmail.com>
	<20100614164348.186267pfu17v2ntw@horde.genesilico.pl>
Message-ID: <AANLkTikC_fuVP4kDMckW0EqIq0oZWwcHYy-TpbnM2hCl@mail.gmail.com>

On Mon, Jun 14, 2010 at 3:43 PM, Kristian Rother <krother at genesilico.pl> wrote:
>
>
> Hi Peter,
>
> just digesting BioPy mails from last week.
>
>>> Where should the str subclass for secondary structures that the parsers
>>> create go? Could it be Bio.Struct.RNA?
>>
>> You don't think plain strings in the SeqRecord's letter_annotation
>> dict would be enough?
>
> Not really - base pairing makes most normal string functions useless.
>
>
>> Assuming you do need something then
>> perhaps under Bio.Seq or Bio.SeqUtils might be worth considering
>> as alternatives to Bio.Struct.RNA.
>
> OK, I'll try that.
>
> Thanks,
> ? Kristian
>
>

Hi Kristian,

Could you explain at little more about why plain strings wouldn't be
suitable here. What kind of things do you want to do with them?

Peter


From krother at rubor.de  Mon Jun 14 10:55:21 2010
From: krother at rubor.de (Kristian Rother)
Date: Mon, 14 Jun 2010 16:55:21 +0200
Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB
 enhancements
Message-ID: <1cf21a9224e1cd3ad4c8e2853d99100b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XXwtdXg==-webmailer2@server03.webmailer.hosteurope.de>


Hi guys,

I'm fine with your ideas regarding different wrappers for
Bio.PDB.Structure objects discussed last week, in particular:

- creating Bio.Struct.RNA or Bio.PDB.RNA with a Structure instance.
- having a structure.as_rna() helper method as suggested by Eric (but this
is no must).

I'd like to take what Joao does for proteins and add some basic equivalent
for RNA structures shortly after.

Best Regards,
    Kristian


Quoting Thomas Hamelryck <thomas.hamelryck at gmail.com>:

> Hi,
>
> On Tue, Jun 8, 2010 at 7:10 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
>
>>
>> from Bio.PDB import Protein
>> structure = Protein('1ABC.pdb')
>> structure.search_ss_bonds()
>>
>
> Indeed, that would run into problems for complexes where proteins, RNA,
DNA,
> etc. occur in the same file. It makes much more sense to have a Structure
> centred approach:
>
> proteins=Protein(structure)
> chains=proteins.get_chains()
> chain_a=chains["A"]
> polypeptides=chain_a.get_peptides()
>
> rnas=RNA(structure)
>
> etc.
>
> -Thomas


From krother at rubor.de  Mon Jun 14 11:01:48 2010
From: krother at rubor.de (Kristian Rother)
Date: Mon, 14 Jun 2010 17:01:48 +0200
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <AANLkTikC_fuVP4kDMckW0EqIq0oZWwcHYy-TpbnM2hCl@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>
	<312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>
	<AANLkTimtoy9zWv74_KwwRhU-UQny1u0uzHQsy894wXJ8@mail.gmail.com>
	<6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTinJE4eBPO8ydgrunBzWZqeN5Nw4nyuo8GEaRTzr@mail.gmail.com>
	<20100614164348.186267pfu17v2ntw@horde.genesilico.pl>
	<AANLkTikC_fuVP4kDMckW0EqIq0oZWwcHYy-TpbnM2hCl@mail.gmail.com>
Message-ID: <beb3f08b05db7a4112966711a97b98e0-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XXw9fVw==-webmailer2@server03.webmailer.hosteurope.de>


Hi,

much of what I do with RNA secondary structures strongly depends on
iterating base pairs, e.g..

>>> sec = Secstruc("(((...)).)")
>>> for bp in sec.basepairs():
>>>    print bp
(0, 9)
(1, 7)
(2, 6)

also:
>>> sec.get_helices()
>>> sec.get_bulges()
>>> sec.get_hairpins()
>>> sec.contains_pseudoknot()
.. and a couple of similar ones.


The reason why I'd prefer to have something more than a string as a sec
feature is that I wouldn't want to do all the time:

sec = Secstruc(my_seq['secondary_structure'])
sec.get_helices()

but

my_seq['secondary_structure'].get_helices()

instead.

Best Regards,
   Kristian


>> Hi Peter,
>>
>> just digesting BioPy mails from last week.
>>
>>>> Where should the str subclass for secondary structures that the
>>>> parsers
>>>> create go? Could it be Bio.Struct.RNA?
>>>
>>> You don't think plain strings in the SeqRecord's letter_annotation
>>> dict would be enough?
>>
>> Not really - base pairing makes most normal string functions useless.
>>
>>
>>> Assuming you do need something then
>>> perhaps under Bio.Seq or Bio.SeqUtils might be worth considering
>>> as alternatives to Bio.Struct.RNA.
>>
>> OK, I'll try that.
>>
>> Thanks,
>> ? Kristian
>>
>>
>
> Hi Kristian,
>
> Could you explain at little more about why plain strings wouldn't be
> suitable here. What kind of things do you want to do with them?
>
> Peter
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
>


From krother at rubor.de  Mon Jun 14 11:13:19 2010
From: krother at rubor.de (Kristian Rother)
Date: Mon, 14 Jun 2010 17:13:19 +0200
Subject: [Biopython-dev] creating Protein(structure) object
Message-ID: <851b929e1adfe21a7a4d7e938c0bed7b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XUAtWVg==-webmailer2@server03.webmailer.hosteurope.de>


Hi Joao,

what you are describing is the classical Decorator Pattern (see
http://en.wikipedia.org/wiki/Decorator_pattern). In the books, they say
that the Decorator (Protein) must implement all methods of the decorated
object (Structure).
Of course, for a class as big as Bio.PDB.Structure, this sucks a lot. I
see two alternatives:

(1) override Protein.__getattr__(self, attr) to return self.struc.attr if
it exists. I tried this recently and it worked fine until the decorated
class used Python properties, when it started getting ugly again.

(2) have Protein inherit from Structure, and grab all the children from
the structure class, e.g.:

class Protein(Structure):
    def __init__(self, struc):
        """
        The given Structure instance becomes a Protein.
        """
        Structure.__init__(self, struc.id)
        for child in struc.child_list:
            # eventually check if its a protein chain.
            self.add_child(child)


Any comments?
    Kristian


> Hello all,
>
> I'm having some issues dealing with this :x
>
> I created a module Bio.Struct that has the following contents:
>
> __init__.py
> Protein.py
> WWW/
>
> The __init__.py file has a read() method that calls PDBParser and returns a
> Structure object. So far so good I think. Then I added a method to
> Bio.PDB.Structure more or less like this:
>
>     def as_protein(self):
>         from Bio.Struct.Protein import Protein
>         prot = Protein(self)
>         return prot
>
> so when you call it you get a new object. Protein is a class that inherits
> from Structure and that has the search_ss_bonds function.
>
> I can make the new object get all the methods from Structure AND from
> Protein, but when I try to execute search_ss_bonds, it fails because
> child_list, a Structure method, comes empty.. In fact, the whole SMCRA
> object comes empty..
>
> How do I effectively do the inheritance on the Protein class?
>
> from Bio.PDB.Structure import Structure
>
> class Protein(Structure):
>
>     def __init__(self, protein):
>
>         self = protein
>
> This is what I last tried and doesn't work.. I've tried Structure.__init__,
> and several other things but to no avail. I'm sure this is simple OOP but I
> really can't understand that well how to do it ...
>
> Care to give a hand to a friend in need? :)
>
> Thanks in advance! By the way, I assume that if I got no comments on
> anything else on the GSOC thread that I'm doing a perfect job :P Thanks for
> that too :D
>
> Best!
>
> Jo?o [...] Rodrigues
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
>
>
>


From biopython at maubp.freeserve.co.uk  Mon Jun 14 11:23:25 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 14 Jun 2010 16:23:25 +0100
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <beb3f08b05db7a4112966711a97b98e0-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XXw9fVw==-webmailer2@server03.webmailer.hosteurope.de>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>
	<312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>
	<AANLkTimtoy9zWv74_KwwRhU-UQny1u0uzHQsy894wXJ8@mail.gmail.com>
	<6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTinJE4eBPO8ydgrunBzWZqeN5Nw4nyuo8GEaRTzr@mail.gmail.com>
	<20100614164348.186267pfu17v2ntw@horde.genesilico.pl>
	<AANLkTikC_fuVP4kDMckW0EqIq0oZWwcHYy-TpbnM2hCl@mail.gmail.com>
	<beb3f08b05db7a4112966711a97b98e0-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XXw9fVw==-webmailer2@server03.webmailer.hosteurope.de>
Message-ID: <AANLkTinkppj-wfBPlBtNwub8edWzks3TfmcwKzKDJml5@mail.gmail.com>

On Mon, Jun 14, 2010 at 4:01 PM, Kristian Rother <krother at rubor.de> wrote:
>
> Hi,
>
> much of what I do with RNA secondary structures strongly depends on
> iterating base pairs, e.g..
>
>>>> sec = Secstruc("(((...)).)")
>>>> for bp in sec.basepairs():
>>>> ? ?print bp
> (0, 9)
> (1, 7)
> (2, 6)
>
> also:
>>>> sec.get_helices()
>>>> sec.get_bulges()
>>>> sec.get_hairpins()
>>>> sec.contains_pseudoknot()
> .. and a couple of similar ones.
>
> The reason why I'd prefer to have something more than a string as a sec
> feature is that I wouldn't want to do all the time:
>
> sec = Secstruc(my_seq['secondary_structure'])
> sec.get_helices()
>
> but
>
> my_seq['secondary_structure'].get_helices()
>
> instead.
>
> Best Regards,
> ? Kristian

That helped - thanks. Does your Secstruc object behave like a Python
sequence (string/list/tuple) in that it has a length and can be sliced (as
if acting on the string representation)? If so then it should be fine to
store in the SeqRecord's letter_annotation dictionary.

Peter


From krother at rubor.de  Mon Jun 14 11:41:05 2010
From: krother at rubor.de (Kristian Rother)
Date: Mon, 14 Jun 2010 17:41:05 +0200
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <AANLkTinkppj-wfBPlBtNwub8edWzks3TfmcwKzKDJml5@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>
	<312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>
	<AANLkTimtoy9zWv74_KwwRhU-UQny1u0uzHQsy894wXJ8@mail.gmail.com>
	<6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTinJE4eBPO8ydgrunBzWZqeN5Nw4nyuo8GEaRTzr@mail.gmail.com>
	<20100614164348.186267pfu17v2ntw@horde.genesilico.pl>
	<AANLkTikC_fuVP4kDMckW0EqIq0oZWwcHYy-TpbnM2hCl@mail.gmail.com>
	<beb3f08b05db7a4112966711a97b98e0-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XXw9fVw==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTinkppj-wfBPlBtNwub8edWzks3TfmcwKzKDJml5@mail.gmail.com>
Message-ID: <3e6714450418534d741476aa0b64b374-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1WWAhZWg==-webmailer2@server03.webmailer.hosteurope.de>


Hi Peter,

> That helped - thanks. Does your Secstruc object behave like a Python
> sequence (string/list/tuple) in that it has a length and can be sliced

Yes, it does.

> If so then it should be fine to
> store in the SeqRecord's letter_annotation dictionary.

Best,
  Kristian


> On Mon, Jun 14, 2010 at 4:01 PM, Kristian Rother <krother at rubor.de> wrote:
>>
>> Hi,
>>
>> much of what I do with RNA secondary structures strongly depends on
>> iterating base pairs, e.g..
>>
>>>>> sec = Secstruc("(((...)).)")
>>>>> for bp in sec.basepairs():
>>>>> ? ?print bp
>> (0, 9)
>> (1, 7)
>> (2, 6)
>>
>> also:
>>>>> sec.get_helices()
>>>>> sec.get_bulges()
>>>>> sec.get_hairpins()
>>>>> sec.contains_pseudoknot()
>> .. and a couple of similar ones.
>>
>> The reason why I'd prefer to have something more than a string as a sec
>> feature is that I wouldn't want to do all the time:
>>
>> sec = Secstruc(my_seq['secondary_structure'])
>> sec.get_helices()
>>
>> but
>>
>> my_seq['secondary_structure'].get_helices()
>>
>> instead.
>>
>> Best Regards,
>> ? Kristian
>
> That helped - thanks. Does your Secstruc object behave like a Python
> sequence (string/list/tuple) in that it has a length and can be sliced (as
> if acting on the string representation)? If so then it should be fine to
> store in the SeqRecord's letter_annotation dictionary.
>
> Peter
>
>


From anaryin at gmail.com  Mon Jun 14 13:58:56 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Mon, 14 Jun 2010 12:58:56 -0500
Subject: [Biopython-dev] creating Protein(structure) object
In-Reply-To: <851b929e1adfe21a7a4d7e938c0bed7b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XUAtWVg==-webmailer2@server03.webmailer.hosteurope.de>
References: <851b929e1adfe21a7a4d7e938c0bed7b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XUAtWVg==-webmailer2@server03.webmailer.hosteurope.de>
Message-ID: <AANLkTino0pTTYZZaZ9w35ig04AIa01YSRHcpunz91_DU@mail.gmail.com>

Hello Kristian,

The way I'm doing it as a workaround is:

class Protein(Structure):

    def __init__(self, protein):

        Structure.__init__(self, protein.id)

        self.full_id = protein.full_id
        self.child_list = protein.child_list
        self.child_dict = protein.child_dict
        self.parent = protein.parent
        self.xtra = protein.xtra

It works because every method I'm using deepcopies this anyway..

The way of adding the childs seems the correct way to go but it won't copy
headers... should we want this?

Thanks :)

J

From eric.talevich at gmail.com  Mon Jun 14 16:27:24 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 14 Jun 2010 16:27:24 -0400
Subject: [Biopython-dev] creating Protein(structure) object
In-Reply-To: <AANLkTino0pTTYZZaZ9w35ig04AIa01YSRHcpunz91_DU@mail.gmail.com>
References: <851b929e1adfe21a7a4d7e938c0bed7b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XUAtWVg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTino0pTTYZZaZ9w35ig04AIa01YSRHcpunz91_DU@mail.gmail.com>
Message-ID: <AANLkTilwgdOfGyRWz57HA5aPUXzH_vfXh_xHATlejnUn@mail.gmail.com>

Hi guys,

Another convention with the Decorator pattern is to ensure that all of the
method arguments that existed in the original class are also present in the
decorated one. This includes the constructor. Decoration simply adds another
feature to whatever was already there.


Jo?o Rodrigues <anaryin at gmail.com> wrote:

> Hello Kristian,
>
> The way I'm doing it as a workaround is:
>
> class Protein(Structure):
>
>    def __init__(self, protein):
>
>         Structure.__init__(self, protein.id)
>
>        self.full_id = protein.full_id
>        self.child_list = protein.child_list
>        self.child_dict = protein.child_dict
>        self.parent = protein.parent
>        self.xtra = protein.xtra
>


The way the constructors of Structure and other Entity subclasses work is to
create a new object with the appropriate, empty attributes -- i.e. no
children. Other code then attaches children to the class.

To decorate a Structure with Protein-specific functionality, I would
consider:

1. The Entity constructor takes an ID, and creates empty containers for
child Entities. (Models, in this case.) So Protein.__init__ needs to start
like:

class Protein(Structure):
    def __init__(self, id):  # take any keyword arguments?
        Structure.__init__(self, id)
        # handle any keyword arguments here

2. We need to be able to convert an existing Structure to a new Protein.
That's new functionality, so it needs either a keyword argument in __init__,
or a separate method or function. If we add a keyword argument to __init__,
then the implementation is basically two completely different operations
depending on if a Structure was passed or not. Plus, there's still that 'id'
argument to deal with.

3. Instantiating a Protein directly would mean importing the
Bio.Struct.Protein module manually, in addition to "from Bio import Struct".
More to the point, Bio.Struct.Protein consists of lower-level functionality
that a casual Struct user shouldn't have to dig into, as long as
Structure.as_protein() exists. So there's no value in making
Protein.__init__ "do what I mean" at the expense of clarity in the code.
Better to make the code very obvious and explicit here, and focus on API
prettiness from a different angle.

4. The next most convenient place for Structure-to-Protein conversion is on
the Structure class. This presents a nice API that will be sufficient for
most users:

from Bio import Struct
prot = Struct.read('1ABC.pdb').as_protein()

But, going back to OOP principles, the Structure class shouldn't need to
know anything about the Protein class's internals -- though it's free to
call any public method and make things nicer for the user. So, finally, we
need a class method* on Protein that Structure.as_protein() can call.

Hence, Protein.from_structure().

[*] A class method can be called without first instantiating the class.
Since we're trying to construct a new object here, we need to be able to
call this Protein method before the Protein object exists. No worries, just
use the @classmethod decorator.


> It works because every method I'm using deepcopies this anyway..
>

If someone modifies the original Structure object after you've created a
Protein this way -- e.g. renumbering residues, or with their own function --
it will also modify the Protein object, since lists and dicts are shared. Is
this what you want?

If you're concerned about memory usage, you can also look at implementing
__deepcopy__.


> The way of adding the childs seems the correct way to go but it won't copy
> headers... should we want this?
>

You code for copying the Structure's children looks right to me, except I
think it's best to be little paranoid with Python lists and make deep copies
anyway. I suppose you could also copy any header info that's relevant to
proteins, using the same approach.

Best,
Eric


From anaryin at gmail.com  Mon Jun 14 23:06:03 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Mon, 14 Jun 2010 22:06:03 -0500
Subject: [Biopython-dev] creating Protein(structure) object
In-Reply-To: <AANLkTilwgdOfGyRWz57HA5aPUXzH_vfXh_xHATlejnUn@mail.gmail.com>
References: <851b929e1adfe21a7a4d7e938c0bed7b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XUAtWVg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTino0pTTYZZaZ9w35ig04AIa01YSRHcpunz91_DU@mail.gmail.com> 
	<AANLkTilwgdOfGyRWz57HA5aPUXzH_vfXh_xHATlejnUn@mail.gmail.com>
Message-ID: <AANLkTil0k2HJEfXVc0_xUv39qKbQxk9oPAeiegO6aVVO@mail.gmail.com>

Ok, thanks for the long explanation!

I'll merge what you and Kristian said and come up with a better interface.
As is, I call is like this:

s = Struct.read("1abc.pdb") # by the way, I added a trick to avoid the
mandatory name of the structure
p = s.as_protein()

Best

J

From jblanca at btc.upv.es  Tue Jun 15 01:55:45 2010
From: jblanca at btc.upv.es (Jose Blanca)
Date: Tue, 15 Jun 2010 07:55:45 +0200
Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method?
In-Reply-To: <AANLkTik6TlYQER9MOEf56NZi0bNLANe_uJK3qjSKlQVG@mail.gmail.com>
References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com>
	<320fb6e00910010204x5035a6edjde030e3072f5f91b@mail.gmail.com>
	<AANLkTik6TlYQER9MOEf56NZi0bNLANe_uJK3qjSKlQVG@mail.gmail.com>
Message-ID: <201006150755.45162.jblanca@btc.upv.es>

On Monday 14 June 2010 16:44:50 Peter wrote:
> Hi all,
>
> You may recall late last year I posted about adding a reverse
> complement method to the SeqRecord, and addition support:
> http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006850.htm
>l http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006851.html
>
> SeqRecord addition was included in Biopython 1.53, but
> not the reverse_complement() method - which is something
> I wanted to use again today to reverse complement an
> annotated GenBank file and have all the SeqFeature
> locations flipped for me. I've rescued my old code and
> its unit tests and created a new branch for it:
> http://github.com/peterjc/biopython/commits/seqrecord-rc
>
> As I said at the end of last year, I think the general idea of
> a SeqRecord reverse_complement() method is nice but the
> details about handling the annotation is tricky. When we
> discussed slicing and addition, it was agreed that we
> should be cautious to avoid blindly transferring annotation
> inappropriately. The code on this branch allows the user to
> choose for each annotation type if it should be dropped
> (False), kept (True) or set to a supplied new value. The
> docstring has examples of how this works (which double
> as doctests).

Having a reverse_complement method would be useful for us. But it could be 
quite tricky to reverse complement some features. For instance we have SNP 
features that include a reference nucleotide. We would had to complement that 
nucleotide too.

Regards,


-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)

From biopython at maubp.freeserve.co.uk  Tue Jun 15 05:08:14 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 15 Jun 2010 10:08:14 +0100
Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method?
In-Reply-To: <201006150755.45162.jblanca@btc.upv.es>
References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com>
	<320fb6e00910010204x5035a6edjde030e3072f5f91b@mail.gmail.com>
	<AANLkTik6TlYQER9MOEf56NZi0bNLANe_uJK3qjSKlQVG@mail.gmail.com>
	<201006150755.45162.jblanca@btc.upv.es>
Message-ID: <AANLkTikDBQHR2DJ2wKoBfMWsj8VifkgOeBkpdFfE5KVG@mail.gmail.com>

On Tue, Jun 15, 2010 at 6:55 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
>
> Having a reverse_complement method would be useful for us. But it could be
> quite tricky to reverse complement some features. For instance we have SNP
> features that include a reference nucleotide. We would had to complement that
> nucleotide too.
>

Could you give an example? I assume you are talking about the annotation
of the feature (i.e. the qualifiers dictionary of a SeqFeature object).

Peter

From jblanca at btc.upv.es  Tue Jun 15 05:23:27 2010
From: jblanca at btc.upv.es (Jose Blanca)
Date: Tue, 15 Jun 2010 11:23:27 +0200
Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method?
In-Reply-To: <AANLkTikDBQHR2DJ2wKoBfMWsj8VifkgOeBkpdFfE5KVG@mail.gmail.com>
References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com>
	<201006150755.45162.jblanca@btc.upv.es>
	<AANLkTikDBQHR2DJ2wKoBfMWsj8VifkgOeBkpdFfE5KVG@mail.gmail.com>
Message-ID: <201006151123.27158.jblanca@btc.upv.es>

On Tuesday 15 June 2010 11:08:14 Peter wrote:
> On Tue, Jun 15, 2010 at 6:55 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> > Having a reverse_complement method would be useful for us. But it could
> > be quite tricky to reverse complement some features. For instance we have
> > SNP features that include a reference nucleotide. We would had to
> > complement that nucleotide too.
>
> Could you give an example? I assume you are talking about the annotation
> of the feature (i.e. the qualifiers dictionary of a SeqFeature object).

That is right in some instances the qualifiers should be modified. For 
instance if we have an ORF with a qualifier 'forward':True, it should be 
changed. I don't think this change can be done automatically .


-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)

From biopython at maubp.freeserve.co.uk  Tue Jun 15 05:42:47 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 15 Jun 2010 10:42:47 +0100
Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method?
In-Reply-To: <201006151123.27158.jblanca@btc.upv.es>
References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com>
	<201006150755.45162.jblanca@btc.upv.es>
	<AANLkTikDBQHR2DJ2wKoBfMWsj8VifkgOeBkpdFfE5KVG@mail.gmail.com>
	<201006151123.27158.jblanca@btc.upv.es>
Message-ID: <AANLkTili8DK7OJqeqtEgY6dqnGWUEYpOZhWYJC6dWPYi@mail.gmail.com>

On Tue, Jun 15, 2010 at 10:23 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> On Tuesday 15 June 2010 11:08:14 Peter wrote:
>> On Tue, Jun 15, 2010 at 6:55 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
>> > Having a reverse_complement method would be useful for us. But it could
>> > be quite tricky to reverse complement some features. For instance we have
>> > SNP features that include a reference nucleotide. We would had to
>> > complement that nucleotide too.
>>
>> Could you give an example? I assume you are talking about the annotation
>> of the feature (i.e. the qualifiers dictionary of a SeqFeature object).
>
> That is right in some instances the qualifiers should be modified. For
> instance if we have an ORF with a qualifier 'forward':True, it should be
> changed. I don't think this change can be done automatically .

Yes, that sort of thing would be very difficult to do automatically. We come
back to the question of what the default should be - blindly copy, or
just drop this information. I would say for most feature annotation (and
I am thinking about GenBank and EMBL style files here) there isn't
anything strand specific to worry about, so in general copying is fine.
Clearly this is not a safe assumption for SNP features.

Peter

From krother at rubor.de  Tue Jun 15 10:06:52 2010
From: krother at rubor.de (Kristian Rother)
Date: Tue, 15 Jun 2010 16:06:52 +0200
Subject: [Biopython-dev] RNA Alphabet: request for comments
Message-ID: <485134f2f1ebae4701d6fbcdfcdee3ee-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5UWABeXQ==-webmailer2@server02.webmailer.hosteurope.de>


Hi,

I've commited a proof-of-concept implementation how modified RNA bases
could be made compatible to Biopython Alphabets. Comments are very
welcome, especially because I had to change two lines in the Seq class to
make it work.

The code can be viewed on:
http://github.com/krother/biopython/commit/d9f942936d6165703512099a6a2d84452fea27aa
(on github: krother/biopython, branch rna_alphabet).

The two main classes are:
RNAAlphabetEntry(str) that contains different abbreviations for one base.
and
ModifiedRNAString(str) that behaves like a string except that it iterates
through RNAAlphabetEntry objects.

Thus, you can do:

>>> from Bio.Alphabet.ModifiedRNAAlphabet import modified_rna
>>> from Bio.Seq import Seq
>>> from Bio.RNA.ModifiedRNAString import ModifiedRNAString
>>>
>>> mod_seq = ModifiedRNAString('AA:"A')
>>> seq = Seq(mod_seq, modified_rna)
>>> for char in seq:
>>>     print char
adenosine
adenosine
2-O-methyladenosine
1-methyladenosine
adenosine

(see Unit test for details).

Best Regards,
    Kristian


From biopython at maubp.freeserve.co.uk  Tue Jun 15 10:46:10 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 15 Jun 2010 15:46:10 +0100
Subject: [Biopython-dev] RNA Alphabet: request for comments
In-Reply-To: <485134f2f1ebae4701d6fbcdfcdee3ee-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5UWABeXQ==-webmailer2@server02.webmailer.hosteurope.de>
References: <485134f2f1ebae4701d6fbcdfcdee3ee-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5UWABeXQ==-webmailer2@server02.webmailer.hosteurope.de>
Message-ID: <AANLkTimG8g7OvFhTmyUL-rkQD5QAqp3XAK7QMRLL0Qbb@mail.gmail.com>

On Tue, Jun 15, 2010 at 3:06 PM, Kristian Rother <krother at rubor.de> wrote:
>
> Hi,
>
> I've commited a proof-of-concept implementation how modified RNA bases
> could be made compatible to Biopython Alphabets. Comments are very
> welcome, especially because I had to change two lines in the Seq class to
> make it work.
>
> The code can be viewed on:
> http://github.com/krother/biopython/commit/d9f942936d6165703512099a6a2d84452fea27aa
> (on github: krother/biopython, branch rna_alphabet).
>
> The two main classes are:
> RNAAlphabetEntry(str) that contains different abbreviations for one base.
> and
> ModifiedRNAString(str) that behaves like a string except that it iterates
> through RNAAlphabetEntry objects.
>

Why not create a Seq subclass instead of your class ModifiedRNAString(str)?
This would then implement suitable (reverse) complement etc.

I would also have __iter__ and __getitem__ for a single letter return
an instance
of RNAAlphabetEntry (which would act like a single character string).

Peter

From bugzilla-daemon at portal.open-bio.org  Tue Jun 15 12:23:00 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 15 Jun 2010 12:23:00 -0400
Subject: [Biopython-dev] [Bug 3060] Add ungap method to the SeqRecord?
In-Reply-To: <bug-3060-42@http.bugzilla.open-bio.org/>
Message-ID: <201006151623.o5FGN0K6028619@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3060


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-15 12:22 EST -------
Patch applied to this branch:
http://github.com/peterjc/biopython/tree/seqrecord-rc


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From krother at rubor.de  Wed Jun 16 04:32:29 2010
From: krother at rubor.de (Kristian Rother)
Date: Wed, 16 Jun 2010 10:32:29 +0200
Subject: [Biopython-dev] RNA Alphabet: request for comments
Message-ID: <771c2d4a49dc24d2b4fe1438a13b361d-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SXwlbVg==-webmailer2@server02.webmailer.hosteurope.de>


Hi Peter,

> Why not create a Seq subclass instead of your class ModifiedRNAString(str)?

This turned out to be a lot simpler. Worked right away. New commit at:

http://github.com/krother/biopython/commit/b0a6071f2b08a4f9bfee33a8d675c0e21b60ba70

more comments welcome.

Next steps from my side would be:

1) add all modifications to the Alphabet.
2) add some RNA-specific methods.
3) add more tests.
4) sync with latest master branch.
5) request code merge.

Best regards,
     Kristian


Quoting Peter <biopython at maubp.freeserve.co.uk>:

> On Tue, Jun 15, 2010 at 3:06 PM, Kristian Rother <krother at rubor.de> wrote:
>>
>> Hi,
>>
>> I've commited a proof-of-concept implementation how modified RNA bases
>> could be made compatible to Biopython Alphabets. Comments are very
>> welcome, especially because I had to change two lines in the Seq class to
>> make it work.
>>
>> The code can be viewed on:
>> http://github.com/krother/biopython/commit/d9f942936d6165703512099a6a2d84452fea27aa
>> (on github: krother/biopython, branch rna_alphabet).
>>
>> The two main classes are:
>> RNAAlphabetEntry(str) that contains different abbreviations for one base.
>> and
>> ModifiedRNAString(str) that behaves like a string except that it iterates
>> through RNAAlphabetEntry objects.
>>
>
> Why not create a Seq subclass instead of your class ModifiedRNAString(str)?
> This would then implement suitable (reverse) complement etc.
>
> I would also have __iter__ and __getitem__ for a single letter return
> an instance
> of RNAAlphabetEntry (which would act like a single character string).
>
> Peter
>
>
>
>


From biopython at maubp.freeserve.co.uk  Wed Jun 16 04:51:03 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 16 Jun 2010 09:51:03 +0100
Subject: [Biopython-dev] RNA Alphabet: request for comments
In-Reply-To: <771c2d4a49dc24d2b4fe1438a13b361d-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SXwlbVg==-webmailer2@server02.webmailer.hosteurope.de>
References: <771c2d4a49dc24d2b4fe1438a13b361d-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SXwlbVg==-webmailer2@server02.webmailer.hosteurope.de>
Message-ID: <AANLkTimATWDdddH5wmvD5i2BPRPvaJsb0qmqLVEDzfFe@mail.gmail.com>

On Wed, Jun 16, 2010 at 9:32 AM, Kristian Rother <krother at rubor.de> wrote:
>
> Hi Peter,
>
>> Why not create a Seq subclass instead of your class ModifiedRNAString(str)?
>
> This turned out to be a lot simpler. Worked right away. New commit at:
>
> http://github.com/krother/biopython/commit/b0a6071f2b08a4f9bfee33a8d675c0e21b60ba70
>
> more comments welcome.

Why do you need the  _set_sequence method? Why not just put that
small piece of code inside the __init__ method?

> Next steps from my side would be:
>
> 1) add all modifications to the Alphabet.
> 2) add some RNA-specific methods.
> 3) add more tests.
> 4) sync with latest master branch.
> 5) request code merge.
>
> Best regards,
> ? ? Kristian

If this works out we should look at doing a Protein 3-letter code version
for use with PDB sequences (I'm thinking about the modified amino acids).

Peter


From krother at rubor.de  Wed Jun 16 05:03:37 2010
From: krother at rubor.de (Kristian Rother)
Date: Wed, 16 Jun 2010 11:03:37 +0200
Subject: [Biopython-dev] RNA Alphabet: request for comments
In-Reply-To: <AANLkTimATWDdddH5wmvD5i2BPRPvaJsb0qmqLVEDzfFe@mail.gmail.com>
References: <771c2d4a49dc24d2b4fe1438a13b361d-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SXwlbVg==-webmailer2@server02.webmailer.hosteurope.de>
	<AANLkTimATWDdddH5wmvD5i2BPRPvaJsb0qmqLVEDzfFe@mail.gmail.com>
Message-ID: <ba1c601c7f33e7f6ae3f22729e528388-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SUQheVw==-webmailer2@server02.webmailer.hosteurope.de>


Hi Peter,

> Why do you need the  _set_sequence method? Why not just put that
> small piece of code inside the __init__ method?

In _set_sequence there'll be a small parser taking care of modifications
where the one-letter abbreviations do not suffice. E.g. a sequence could
be

"CCC022UCCC"

(22U is a 5-hydroxyuridine).

--> being parsed into a list of RNAAlphabetEntries
['C','C','C','22U','C','C','C']

So the code will grow a little, but the basic idea stays the same.

If someone wants a one-letter representation, it could be "CCCxCCC", but
this is degenerate because 'x' is used for several modifications.

Best Regards,
   Kristian


>>> Why not create a Seq subclass instead of your class
>>> ModifiedRNAString(str)?
>>
>> This turned out to be a lot simpler. Worked right away. New commit at:
>>
>> http://github.com/krother/biopython/commit/b0a6071f2b08a4f9bfee33a8d675c0e21b60ba70
>>
>> more comments welcome.
>
> Why do you need the  _set_sequence method? Why not just put that
> small piece of code inside the __init__ method?
>
>> Next steps from my side would be:
>>
>> 1) add all modifications to the Alphabet.
>> 2) add some RNA-specific methods.
>> 3) add more tests.
>> 4) sync with latest master branch.
>> 5) request code merge.
>>
>> Best regards,
>> ? ? Kristian
>
> If this works out we should look at doing a Protein 3-letter code version
> for use with PDB sequences (I'm thinking about the modified amino acids).
>
> Peter
>
>


From biopython at maubp.freeserve.co.uk  Wed Jun 16 05:41:35 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 16 Jun 2010 10:41:35 +0100
Subject: [Biopython-dev] RNA Alphabet: request for comments
In-Reply-To: <ba1c601c7f33e7f6ae3f22729e528388-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SUQheVw==-webmailer2@server02.webmailer.hosteurope.de>
References: <771c2d4a49dc24d2b4fe1438a13b361d-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SXwlbVg==-webmailer2@server02.webmailer.hosteurope.de>
	<AANLkTimATWDdddH5wmvD5i2BPRPvaJsb0qmqLVEDzfFe@mail.gmail.com>
	<ba1c601c7f33e7f6ae3f22729e528388-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SUQheVw==-webmailer2@server02.webmailer.hosteurope.de>
Message-ID: <AANLkTimp5cvKxczZYPBM3n47CBTtltrNLDOUxgCzYfoq@mail.gmail.com>

On Wed, Jun 16, 2010 at 10:03 AM, Kristian Rother <krother at rubor.de> wrote:
>
> Hi Peter,
>
>> Why do you need the ?_set_sequence method? Why not just put that
>> small piece of code inside the __init__ method?
>
> In _set_sequence there'll be a small parser taking care of modifications
> where the one-letter abbreviations do not suffice. E.g. a sequence could
> be
>
> "CCC022UCCC"
>
> (22U is a 5-hydroxyuridine).
>
> --> being parsed into a list of RNAAlphabetEntries
> ['C','C','C','22U','C','C','C']
>
> So the code will grow a little, but the basic idea stays the same.
>
> If someone wants a one-letter representation, it could be "CCCxCCC", but
> this is degenerate because 'x' is used for several modifications.
>
> Best Regards,
> ? Kristian

Thinking ahead, we are planning to make the Seq objects use string
comparison instead of object identity. When that happens, I would
suggest in your subclass you implement the the equality method so
that if you are comparing against another instance of the modified RNA
Seq compare at the more detailed "22U" level, and if not then for
compatibility compare at the single letter level ("x" even though degenerate).

Peter


From bugzilla-daemon at portal.open-bio.org  Wed Jun 16 08:43:07 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 16 Jun 2010 08:43:07 -0400
Subject: [Biopython-dev] [Bug 3100] New: Bio.PDB.ResidueDepth distance
	calculation error
Message-ID: <bug-3100-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3100

           Summary: Bio.PDB.ResidueDepth distance calculation error
           Product: Biopython
           Version: 1.54b
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: andres.colubri at gmail.com


ResidueDepth.py in Bio.PDB contains an error at line 100:

d2=sum(d*d, 1)

This uses the built-in sum() function, which just sums all the elements of d*d,
starting at 1. But it should use numpy's sum instead:

d2=numpy.sum(d*d, 1)

To check the error, try the following code:

from Bio.PDB import
from Bio.PDB.ResidueDepth import
parser = PDBParser()
str = parser.get_structure('test', '3M38.pdb')
surf = get_surface('3M38.pdb', PDB_TO_XYZR='./pdb_to_xyzr', MSMS='./msms')
print min_dist(surf[10], surf)

3M38.pdb could be replaced by any other pdb file. The result of this
calculation printed to the console should be zero, since we are calculating the
minimum distance to the surface of a point belonging to the surface. But this
gives a value greater than zero.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From lueck at ipk-gatersleben.de  Wed Jun 16 09:18:00 2010
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Wed, 16 Jun 2010 15:18:00 +0200
Subject: [Biopython-dev] EuroSciPy 2010 conference in Paris
In-Reply-To: <AANLkTimJSOqEgUHokfs3P-6MNS5yKxl4_CNB5f8-X0AR@mail.gmail.com>
References: <AANLkTimJSOqEgUHokfs3P-6MNS5yKxl4_CNB5f8-X0AR@mail.gmail.com>
Message-ID: <001a01cb0d56$581dd610$1022a8c0@ipkgatersleben.de>

Hello!

Sorry for the late reply but I just came back from my holidays.
I have been to EuroSciPy 2009 and it's was really great (I also gave a talk
where biopython was several times mentioned ;-). Since it's was problematic
to go last time, I decided to skip it this year (principally I have to come
private). Unfortunately I hear now that the biopython people will be there
and I would be very interested to meet you, since I'm using biopython a lot.


I have to see what I still can do.
Would be great to see us!

Stefanie 

-----Urspr?ngliche Nachricht-----
Von: biopython-dev-bounces at lists.open-bio.org
[mailto:biopython-dev-bounces at lists.open-bio.org] Im Auftrag von Peter
Gesendet: Samstag, 5. Juni 2010 16:50
An: Biopython-Dev Mailing List
Betreff: [Biopython-dev] EuroSciPy 2010 conference in Paris

Hi all,

Are any Biopython folk planning to be at the EuroSciPy
conference in Paris this year (July 2010)? They are still
finalising the Scientific track, but the list of tutorials is
quite interesting already:

http://www.euroscipy.org/conference/euroscipy2010

Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev


From bugzilla-daemon at portal.open-bio.org  Fri Jun 18 09:19:02 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 18 Jun 2010 09:19:02 -0400
Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastaq
In-Reply-To: <bug-3102-42@http.bugzilla.open-bio.org/>
Message-ID: <201006181319.o5IDJ2Oj022977@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3102


cjfields at bioperl.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|bioperl-guts-l at bioperl.org  |biopython-dev at biopython.org


------- Comment #3 from cjfields at bioperl.org  2010-06-18 09:18 EST -------
(In reply to comment #2)
> (In reply to comment #1)
> > I'm making a wild guess that this is Biopython and not BioPerl.  
> 
> Yes, it's Biopython, Can you halp me, please? or can you give me a link where
> to find the answer for my problem? Thank you very much. 

Reassigning to the Biopython devs.  This should go to their list now, hopefully
you'll get a response.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri Jun 18 09:45:37 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 18 Jun 2010 09:45:37 -0400
Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastq
In-Reply-To: <bug-3102-42@http.bugzilla.open-bio.org/>
Message-ID: <201006181345.o5IDjbNB023730@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3102


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Error converting sff into   |Error converting sff into
                   |fastaq                      |fastq


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-18 09:45 EST -------
Thanks Chris.

Giorgio - Could you confirm which version of Biopython are you using?

To me the error message suggests the SFF file is corrupted (damaged). Is it
very large? Could you attach it to this bug (or email it to me personally) to
check?

Have you been able to process the SFF file with any other tools (e.g.
sff_extract which should work on Windows/Linux/Mac, or the Roche tools which
are Linux only)?

If you copied the SFF file over your network, or over the internet from your
sequencing center, perhaps there was an error there. Could you try
re-downloading the SFF file?

Regards,

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri Jun 18 11:03:45 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 18 Jun 2010 11:03:45 -0400
Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastq
In-Reply-To: <bug-3102-42@http.bugzilla.open-bio.org/>
Message-ID: <201006181503.o5IF3j23025689@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3102


------- Comment #5 from gcasaburi at tiscali.it  2010-06-18 11:03 EST -------
(In reply to comment #4)
> Thanks Chris.
> Giorgio - Could you confirm which version of Biopython are you using?
> To me the error message suggests the SFF file is corrupted (damaged). Is it
> very large? Could you attach it to this bug (or email it to me personally) to
> check?
> Have you been able to process the SFF file with any other tools (e.g.
> sff_extract which should work on Windows/Linux/Mac, or the Roche tools which
> are Linux only)?
> If you copied the SFF file over your network, or over the internet from your
> sequencing center, perhaps there was an error there. Could you try
> re-downloading the SFF file?
> Regards,
> Peter
Thank u for the answer. I have the last version of Biopython, The file is 1,12
giga, so i think is difficult to attach the file. The file has been taken
directly from the usb port of the 454 with a pendrive and now is in a normal
PC. With Biopthon i'v been able to read and open this sff file, but at the end
of the reading appers the message (Value error:...). So when i try to convert
the file in fasta the same message apper to be, bloking any work. So why the
file is open reading, with all information (flow, lewnght) but impossible to
edit, convert??? Thank u hope u can help us.
Grater from ITALY


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri Jun 18 11:28:01 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 18 Jun 2010 11:28:01 -0400
Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastq
In-Reply-To: <bug-3102-42@http.bugzilla.open-bio.org/>
Message-ID: <201006181528.o5IFS1iY026418@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3102


------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-18 11:28 EST -------
(In reply to comment #5)
> Thank u for the answer. I have the last version of Biopython,

Good.

> The file is 1,12 giga, so i think is difficult to attach the file.

Yes, too big to attach or email :(

> The file has been taken directly from the usb port of the 454 with a
> pendrive and now is in a normal PC.

I would try copying it again using a different USB memory stick / pen drive.

> With Biopthon i'v been able to read and open this sff file, but at the end
> of the reading appers the message (Value error:...). So when i try to convert
> the file in fasta the same message apper to be, bloking any work. So why the
> file is open reading, with all information (flow, lewnght) but impossible to
> edit, convert??? Thank u hope u can help us.
> Grater from ITALY

It sounds like there is an error is near the end of the file. You can open the
file and read lots of reads up until the error. If you use Bio.SeqIO.parse()
or Bio.SeqIO.convert() these will fail once you get to the bad read. Perhaps
the file is truncated (only partly copied)?

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri Jun 18 13:35:00 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 18 Jun 2010 13:35:00 -0400
Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastq
In-Reply-To: <bug-3102-42@http.bugzilla.open-bio.org/>
Message-ID: <201006181735.o5IHZ0SW030183@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3102


------- Comment #7 from gcasaburi at tiscali.it  2010-06-18 13:35 EST -------
(In reply to comment #6)
> (In reply to comment #5)
> > Thank u for the answer. I have the last version of Biopython,
> 
> Good.
> 
> > The file is 1,12 giga, so i think is difficult to attach the file.
> 
> Yes, too big to attach or email :(
> 
> > The file has been taken directly from the usb port of the 454 with a
> > pendrive and now is in a normal PC.
> 
> I would try copying it again using a different USB memory stick / pen drive.
> 
> > With Biopthon i'v been able to read and open this sff file, but at the end
> > of the reading appers the message (Value error:...). So when i try to convert
> > the file in fasta the same message apper to be, bloking any work. So why the
> > file is open reading, with all information (flow, lewnght) but impossible to
> > edit, convert??? Thank u hope u can help us.
> > Grater from ITALY
> 
> It sounds like there is an error is near the end of the file. You can open the
> file and read lots of reads up until the error. If you use Bio.SeqIO.parse()
> or Bio.SeqIO.convert() these will fail once you get to the bad read. Perhaps
> the file is truncated (only partly copied)?
> 
> Peter
> 
I will try to recopy the file on another pendrive.  I thought like you, may be
the file has a corruption at the end. I don't think is  truncated, in fact is a
.sff that represents one region of the "ptp", but the same error appers with
another file .sff2  that represents the second region of the "ptp" (diveded in
two regions for the same "run", totally 2 regions, each for one sample,  two
samples in total).  So   i don't know if there is a syntax command to modify
the error value. 
Thank you
Giorgio


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Tue Jun 22 09:11:15 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 22 Jun 2010 09:11:15 -0400
Subject: [Biopython-dev] [Bug 3060] Add ungap method to the SeqRecord?
In-Reply-To: <bug-3060-42@http.bugzilla.open-bio.org/>
Message-ID: <201006221311.o5MDBF8o003119@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3060


------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-22 09:11 EST -------
(In reply to comment #0)
> My motivating example is to take an ACE file loaded with SeqIO, remove the
> gaps, and output the contigs as FASTQ or QUAL files. This requires the
> per-letter-annotation to be sliced to match the ungapped sequence.
> 
> Likewise any features fully contained within ungapped regions should be
> retained and their co-ordinates shifted. I'm not sure if we should do anything
> about features spanning a gap - the simple option which I have implemented is
> they are lost. This is done via the existing SeqRecord slicing and addition
> code.

I've been trying building SeqFeature objects for the reads in an ACE file,
http://github.com/peterjc/biopython/tree/ace-reads

In this case when I call the SeqRecord ungap method, many of my read features
are lost with the current implementation (because they included gaps). This
also showed the ungap code to be quite slow for features. I'm going to have
another look at this.

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Tue Jun 22 10:58:39 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 22 Jun 2010 10:58:39 -0400
Subject: [Biopython-dev] [Bug 3060] Add ungap method to the SeqRecord?
In-Reply-To: <bug-3060-42@http.bugzilla.open-bio.org/>
Message-ID: <201006221458.o5MEwd0I005797@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3060


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1482 is|0                           |1
           obsolete|                            |


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-22 10:58 EST -------
(From update of attachment 1482)
(In reply to comment #3)
> 
> I've been trying building SeqFeature objects for the reads in an ACE file,
> http://github.com/peterjc/biopython/tree/ace-reads
> 
> In this case when I call the SeqRecord ungap method, many of my read features
> are lost with the current implementation (because they included gaps). This
> also showed the ungap code to be quite slow for features. I'm going to have
> another look at this.

My new code handles SeqFeature ungapping so as to preserve all the features by
adjusting their end points. This is also much faster:

http://github.com/peterjc/biopython/tree/ungap2


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From anaryin at gmail.com  Tue Jun 22 15:25:17 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 22 Jun 2010 14:25:17 -0500
Subject: [Biopython-dev] Parsing "element" out of PDB file
Message-ID: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com>

Hello all,

I've been using some non-standard pdb files outputted by some programs and
they miss the chemical element column in each ATOM line. I was looking at
the PDBParser code and element is dealt with like this:

        if element is None:
            import warnings
            from PDBExceptions import PDBConstructionWarning
            warnings.warn("Atom object (name=%s) without element" % name,
                          PDBConstructionWarning)
            element = "?"
            print name, "--> ?"
        elif len(element)>2 or element != element.upper() or element !=
element.strip():
            raise ValueError(element)
        self.element=element


In my case, the element line is not "None" but just an empty string - ' ' -
which fails these tests and is then passed on. This would be no problem at
all, but I've added a "mass" attribute to the Atom object defined like this:

        self.mass = IUPACData.atom_weigths[element]

I've added the ? to the atom_weights list as I thought it would deal with
the empty element cases.

I'd suggest adding to the first if statement a test to check if the element
string is empty and if so, treat it as None.

        if element is None or element is '':
            import warnings
            from PDBExceptions import PDBConstructionWarning
            warnings.warn("Atom object (name=%s) without element" % name,
                          PDBConstructionWarning)
            element = "?"
            print name, "--> ?"
        elif len(element)>2 or element != element.upper() or element !=
element.strip():
            raise ValueError(element)
        self.element=element


What do you think?

Best!

Jo?o [...] Rodrigues
@ http://doeidoei.wordpress.org


From biopython at maubp.freeserve.co.uk  Wed Jun 23 05:11:06 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 23 Jun 2010 10:11:06 +0100
Subject: [Biopython-dev] Parsing "element" out of PDB file
In-Reply-To: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com>
References: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com>
Message-ID: <AANLkTinz_AEn08DU-61V1i5xGA6N5sxYDf-JUCJXmrNH@mail.gmail.com>

On Tue, Jun 22, 2010 at 8:25 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> Hello all,
>
> I've been using some non-standard pdb files outputted by some programs and
> they miss the chemical element column in each ATOM line. I was looking at
> the PDBParser code and element is dealt with like this:
>
> ? ? ? ?if element is None:
> ? ? ? ? ? ?import warnings
> ? ? ? ? ? ?from PDBExceptions import PDBConstructionWarning
> ? ? ? ? ? ?warnings.warn("Atom object (name=%s) without element" % name,
> ? ? ? ? ? ? ? ? ? ? ? ? ?PDBConstructionWarning)
> ? ? ? ? ? ?element = "?"
> ? ? ? ? ? ?print name, "--> ?"
> ? ? ? ?elif len(element)>2 or element != element.upper() or element !=
> element.strip():
> ? ? ? ? ? ?raise ValueError(element)
> ? ? ? ?self.element=element
>
>
> In my case, the element line is not "None" but just an empty string - ' ' -
> which fails these tests and is then passed on.

That makes sense, since element=line[76:78].strip() will give an empty
string. A change as you suggest makes sense, but I think just using
"if element:" would be nicer.

Peter


From biopython at maubp.freeserve.co.uk  Wed Jun 23 06:28:22 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 23 Jun 2010 11:28:22 +0100
Subject: [Biopython-dev] Running unit tests on non-English OS
Message-ID: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>

Hi all,

>From some unit test output posted by Manabu Ishii via Twitter I
think the test suite is having problems checking for external tools
on non-English operating systems (e.g. Debian in Japanese):
http://d.hatena.ne.jp/manabou/20100619
http://twitter.com/manabou

I've tried to update a few to do a better job (test_Muscle_tool.py,
test_Clustalw_tool.py and test_Emboss.py), but what I really need
is someone to run the test suite on a non English system - ideally
without all these command line tools installed. The tests should
notice when the tool is missing, and be skipped without errors.

Could anyone with a non-English OS try running the latest code
from git (or even the latest release) to see if you get similar
problems?

Thanks,

Peter

From bugzilla-daemon at portal.open-bio.org  Wed Jun 23 09:21:25 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 23 Jun 2010 09:21:25 -0400
Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastq
In-Reply-To: <bug-3102-42@http.bugzilla.open-bio.org/>
Message-ID: <201006231321.o5NDLPm0017094@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3102


------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-23 09:21 EST -------
Hi Giorgio,

Did coping the file again help?

In addition to trying to read the SFF files with other tools (like sff_extract
or the Roche ssfinfo) as suggested, I have some additional things you could
try.

Firstly try this private function to see how many reads there should be:

filename = r"C:\Users\Giorgio Casaburi\Desktop\sff\GIK1EHM01.sff"
from Bio import SeqIO
print SeqIO.SffIO._sff_file_header(open(filename, "rb"))[3]

Then compare this to the number of reads you could extract up until the error.

Secondly, see if the index can be loaded or not:

filename = r"C:\Users\Giorgio Casaburi\Desktop\sff\GIK1EHM01.sff"
from Bio import SeqIO
d = SeqIO.index(filename, "sff")
print len(d)

If it is just one or two bad reads, this may allow you to jump to specific
records (and so avoid getting stuck on the bad ones).

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From anaryin at gmail.com  Wed Jun 23 12:52:47 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Wed, 23 Jun 2010 11:52:47 -0500
Subject: [Biopython-dev] Parsing "element" out of PDB file
In-Reply-To: <AANLkTinz_AEn08DU-61V1i5xGA6N5sxYDf-JUCJXmrNH@mail.gmail.com>
References: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com> 
	<AANLkTinz_AEn08DU-61V1i5xGA6N5sxYDf-JUCJXmrNH@mail.gmail.com>
Message-ID: <AANLkTin8KM8BYJc9sr-KX1o1l7zJ5A901Tv0QvEvd0nt@mail.gmail.com>

Ok, I've changed it in my local branch to if not element since that covers
both None and empty strings.

Best,

Jo?o [...] Rodrigues
@ http://doeidoei.wordpress.org


On Wed, Jun 23, 2010 at 4:11 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Tue, Jun 22, 2010 at 8:25 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> > Hello all,
> >
> > I've been using some non-standard pdb files outputted by some programs
> and
> > they miss the chemical element column in each ATOM line. I was looking at
> > the PDBParser code and element is dealt with like this:
> >
> >        if element is None:
> >            import warnings
> >            from PDBExceptions import PDBConstructionWarning
> >            warnings.warn("Atom object (name=%s) without element" % name,
> >                          PDBConstructionWarning)
> >            element = "?"
> >            print name, "--> ?"
> >        elif len(element)>2 or element != element.upper() or element !=
> > element.strip():
> >            raise ValueError(element)
> >        self.element=element
> >
> >
> > In my case, the element line is not "None" but just an empty string - ' '
> -
> > which fails these tests and is then passed on.
>
> That makes sense, since element=line[76:78].strip() will give an empty
> string. A change as you suggest makes sense, but I think just using
> "if element:" would be nicer.
>
> Peter
>


From biopython at maubp.freeserve.co.uk  Thu Jun 24 04:26:50 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 24 Jun 2010 09:26:50 +0100
Subject: [Biopython-dev] Parsing "element" out of PDB file
In-Reply-To: <AANLkTin8KM8BYJc9sr-KX1o1l7zJ5A901Tv0QvEvd0nt@mail.gmail.com>
References: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com>
	<AANLkTinz_AEn08DU-61V1i5xGA6N5sxYDf-JUCJXmrNH@mail.gmail.com>
	<AANLkTin8KM8BYJc9sr-KX1o1l7zJ5A901Tv0QvEvd0nt@mail.gmail.com>
Message-ID: <AANLkTilOh1wWfJZexI47ohbrnharVbXFvMIUyB4L9YAW@mail.gmail.com>

On Wed, Jun 23, 2010 at 5:52 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> Ok, I've changed it in my local branch to if not element since that covers
> both None and empty strings.
>
> Best,
>
> Jo?o [...] Rodrigues
> @ http://doeidoei.wordpress.org

I've you've done that little change as a single commit, then I can use
git cherry-pick to apply it to the master branch. But first you need to
push this work to github.com

Peter


From biopython at maubp.freeserve.co.uk  Thu Jun 24 04:32:46 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 24 Jun 2010 09:32:46 +0100
Subject: [Biopython-dev] Parsing "element" out of PDB file
In-Reply-To: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com>
References: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com>
Message-ID: <AANLkTin5kMjfMe-hpviCH4TIo4Cz1vfnwPihMRUGqLPa@mail.gmail.com>

On Tue, Jun 22, 2010 at 8:25 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> Hello all,
>
> I've been using some non-standard pdb files outputted by some programs and
> they miss the chemical element column in each ATOM line. ... This would be no
> problem at all, but I've added a "mass" attribute to the Atom object defined like this:
>
> ? ? ? ?self.mass = IUPACData.atom_weigths[element]
>
> I've added the ? to the atom_weights list as I thought it would deal with
> the empty element cases.

I wonder if using None or NAN would be better than zero here? Or just an
exception. This is difficult for me to say without a better idea of what you
will be using the atomic weights for.

On a separate point, if you have an old fashioned PDB file without the element
column, you can probably work out the element anyway. For example CA in
a normal amino acids residue means the alpha carbon, so the element is
carbon (although in a HETATM there is a possibility it is Calcium I think).
So I think it would be possible to infer the element in many cases (but not
all). However, this is going to be a reasonable amount of work to write and
test. How common are this kind of PDB file for the work you are doing - do
many modelling packages omit the element?

Have you contacted the program authors to request they include the
element column in future?

Peter


From anaryin at gmail.com  Thu Jun 24 12:36:36 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 24 Jun 2010 11:36:36 -0500
Subject: [Biopython-dev] Parsing "element" out of PDB file
In-Reply-To: <AANLkTin5kMjfMe-hpviCH4TIo4Cz1vfnwPihMRUGqLPa@mail.gmail.com>
References: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com> 
	<AANLkTin5kMjfMe-hpviCH4TIo4Cz1vfnwPihMRUGqLPa@mail.gmail.com>
Message-ID: <AANLkTim22bd6-wSUY_3bp9FQOG3k5uVCXBt-6N-lU85G@mail.gmail.com>

>
> I wonder if using None or NAN would be better than zero here? Or just an
> exception. This is difficult for me to say without a better idea of what
> you
> will be using the atomic weights for.
>

Right now I'm just using them for the center of mass calculation.


>
> On a separate point, if you have an old fashioned PDB file without the
> element
> column, you can probably work out the element anyway. For example CA in
> a normal amino acids residue means the alpha carbon, so the element is
> carbon (although in a HETATM there is a possibility it is Calcium I think).
> So I think it would be possible to infer the element in many cases (but not
> all). However, this is going to be a reasonable amount of work to write and
> test.


>From non HETATMs its possible from the first letter of the atom name (or it
is H if the first letter is a digit). For HETATMs, names match elements
IIRC.

Do you think it's worth the try? It shouldn't be hard to write and the cases
where it would fail would be sporadic.


> How common are this kind of PDB file for the work you are doing - do
> many modelling packages omit the element?


> Have you contacted the program authors to request they include the
> element column in future?
>

Well... several packages make this, specially webservers.. Contacting them
authors wouldn't bring those many favourable answers IMO.


I've commited it here:
http://github.com/JoaoRodrigues/biopython/commit/29f48e8f97870530520884fa6b8c9b70d87ba8bc

I commented out the self.mass part since we're still working on it.

Best,

J

From biopython at maubp.freeserve.co.uk  Thu Jun 24 12:54:41 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 24 Jun 2010 17:54:41 +0100
Subject: [Biopython-dev] Parsing "element" out of PDB file
In-Reply-To: <AANLkTim22bd6-wSUY_3bp9FQOG3k5uVCXBt-6N-lU85G@mail.gmail.com>
References: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com>
	<AANLkTin5kMjfMe-hpviCH4TIo4Cz1vfnwPihMRUGqLPa@mail.gmail.com>
	<AANLkTim22bd6-wSUY_3bp9FQOG3k5uVCXBt-6N-lU85G@mail.gmail.com>
Message-ID: <AANLkTikhslYyt6mn8Gtd_WrWmNsEpYkuCSD8cuEtXXv9@mail.gmail.com>

On Thu, Jun 24, 2010 at 5:36 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
>>
>> I wonder if using None or NAN would be better than zero here? Or just an
>> exception. This is difficult for me to say without a better idea of what
>> you will be using the atomic weights for.
>>
>
> Right now I'm just using them for the center of mass calculation.
>

Well if you don't know an atom's mass, you can't calculate the real
center of mass. Maybe this should throw an exception?

>> On a separate point, if you have an old fashioned PDB file without the
>> element column, you can probably work out the element anyway. ...
>
> From non HETATMs its possible from the first letter of the atom name (or it
> is H if the first letter is a digit). For HETATMs, names match elements
> IIRC.
>
> Do you think it's worth the try? It shouldn't be hard to write and the cases
> where it would fail would be sporadic.

Eric - what do you think?

>> How common are this kind of PDB file for the work you are doing - do
>> many modelling packages omit the element?
>
>
>> Have you contacted the program authors to request they include the
>> element column in future?
>>
>
> Well... several packages make this, specially webservers.. Contacting them
> authors wouldn't bring those many favourable answers IMO.

I'd ask politely anyway ;)

> I've commited it here:
> http://github.com/JoaoRodrigues/biopython/commit/29f48e8f97870530520884fa6b8c9b70d87ba8bc
>
> I commented out the self.mass part since we're still working on it.

I've cherry-picked that for the trunk - could you test the master branch
please (just to make sure this worked as you expected)?

Thanks,

Peter


From eric.talevich at gmail.com  Thu Jun 24 14:05:11 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 24 Jun 2010 14:05:11 -0400
Subject: [Biopython-dev] Parsing "element" out of PDB file
In-Reply-To: <AANLkTikhslYyt6mn8Gtd_WrWmNsEpYkuCSD8cuEtXXv9@mail.gmail.com>
References: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com> 
	<AANLkTin5kMjfMe-hpviCH4TIo4Cz1vfnwPihMRUGqLPa@mail.gmail.com> 
	<AANLkTim22bd6-wSUY_3bp9FQOG3k5uVCXBt-6N-lU85G@mail.gmail.com> 
	<AANLkTikhslYyt6mn8Gtd_WrWmNsEpYkuCSD8cuEtXXv9@mail.gmail.com>
Message-ID: <AANLkTimvyJSycfL2707i-DyizcE5xcJ950duGV8MqiSt@mail.gmail.com>

On Thu, Jun 24, 2010 at 12:54 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Thu, Jun 24, 2010 at 5:36 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> >>
> >> I wonder if using None or NAN would be better than zero here? Or just an
> >> exception. This is difficult for me to say without a better idea of what
> >> you will be using the atomic weights for.
> >>
> >
> > Right now I'm just using them for the center of mass calculation.
> >
>
> Well if you don't know an atom's mass, you can't calculate the real
> center of mass. Maybe this should throw an exception?
>

And the center of mass calculation was for coarse-graining structures,
right? What would be most useful there?

(a) Give unknown atoms a weight of 0.0, so CoM essentially disregards them
(b) Give unknown atoms a weight of None, and have CoM check for this and
disregard those atoms (similar effect) -- preferably issuing a warning
(c) Like (b), but CoM raises an exception
(d) Give CoM a keyword argument for how to treat this (e.g.
strict=True/False), so course-graining can be permissive but direct use of
CoM can raise an exception if desired. (However, if warnings are used then
the warnings module already lets you convert specific warnings into
exceptions.)


 >> On a separate point, if you have an old fashioned PDB file without the
> >> element column, you can probably work out the element anyway. ...
> >
> > From non HETATMs its possible from the first letter of the atom name (or
> it
> > is H if the first letter is a digit). For HETATMs, names match elements
> > IIRC.
> >
> > Do you think it's worth the try? It shouldn't be hard to write and the
> cases
> > where it would fail would be sporadic.
>
> Eric - what do you think?
>

Sounds useful to me. Where would it fail, and how should failures be
treated? Unrecognized atom names, and then issue a warning and leave the
element attribute blank? (See options above...)

Cheers,
Eric


From anaryin at gmail.com  Thu Jun 24 14:25:45 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 24 Jun 2010 13:25:45 -0500
Subject: [Biopython-dev] Parsing "element" out of PDB file
In-Reply-To: <AANLkTimvyJSycfL2707i-DyizcE5xcJ950duGV8MqiSt@mail.gmail.com>
References: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com> 
	<AANLkTin5kMjfMe-hpviCH4TIo4Cz1vfnwPihMRUGqLPa@mail.gmail.com> 
	<AANLkTim22bd6-wSUY_3bp9FQOG3k5uVCXBt-6N-lU85G@mail.gmail.com> 
	<AANLkTikhslYyt6mn8Gtd_WrWmNsEpYkuCSD8cuEtXXv9@mail.gmail.com> 
	<AANLkTimvyJSycfL2707i-DyizcE5xcJ950duGV8MqiSt@mail.gmail.com>
Message-ID: <AANLkTin26e0xe-fpkJ-uBUVRT3zuyh_vDYoXquK2VOL6@mail.gmail.com>

>
> And the center of mass calculation was for coarse-graining structures,
> right? What would be most useful there?
>

> (a) Give unknown atoms a weight of 0.0, so CoM essentially disregards them
>

CoM counts with the number of atoms so 0.0 will not work anyways actually.


>  (b) Give unknown atoms a weight of None, and have CoM check for this and
> disregard those atoms (similar effect) -- preferably issuing a warning
>

I'd prefer this. Exclude atoms from the calculation. But then this might
have an impact in the location of the mass..


> (c) Like (b), but CoM raises an exception
> (d) Give CoM a keyword argument for how to treat this (e.g.
> strict=True/False), so course-graining can be permissive but direct use of
> CoM can raise an exception if desired. (However, if warnings are used then
> the warnings module already lets you convert specific warnings into
> exceptions.)
>

My suggestion. CoM can be either geometrical or gravitical. The first
assumes equal mass for everyone, the second does not. If there's a mass that
doesn't exist, the CoM would default to geometrical and issue a warning.
Having a flag in CoM can also be valuable but I guess this would be
redundant with the warning/exception (permissive/strict) in the Atom class.


>
>
>  >> On a separate point, if you have an old fashioned PDB file without the
>> >> element column, you can probably work out the element anyway. ...
>> >
>> > From non HETATMs its possible from the first letter of the atom name (or
>> it
>> > is H if the first letter is a digit). For HETATMs, names match elements
>> > IIRC.
>> >
>> > Do you think it's worth the try? It shouldn't be hard to write and the
>> cases
>> > where it would fail would be sporadic.
>>
>> Eric - what do you think?
>>
>
> Sounds useful to me. Where would it fail, and how should failures be
> treated? Unrecognized atom names, and then issue a warning and leave the
> element attribute blank? (See options above...)
>

I'd implement it in the Atom class. Instead of having this check (lines
75-76):

        elif len(element)>2 or element != element.upper() or element !=
element.strip():
            raise ValueError(element)

there would be a check against IUPACData.atom_weight.keys(). If the element
is not found, then it would try to check the atom name and issue a warning.
If this fails, exception thrown.

Sounds good?

Best!

J

From anaryin at gmail.com  Thu Jun 24 16:25:23 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 24 Jun 2010 15:25:23 -0500
Subject: [Biopython-dev] Parsing "element" out of PDB file
In-Reply-To: <AANLkTin26e0xe-fpkJ-uBUVRT3zuyh_vDYoXquK2VOL6@mail.gmail.com>
References: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com> 
	<AANLkTin5kMjfMe-hpviCH4TIo4Cz1vfnwPihMRUGqLPa@mail.gmail.com> 
	<AANLkTim22bd6-wSUY_3bp9FQOG3k5uVCXBt-6N-lU85G@mail.gmail.com> 
	<AANLkTikhslYyt6mn8Gtd_WrWmNsEpYkuCSD8cuEtXXv9@mail.gmail.com> 
	<AANLkTimvyJSycfL2707i-DyizcE5xcJ950duGV8MqiSt@mail.gmail.com> 
	<AANLkTin26e0xe-fpkJ-uBUVRT3zuyh_vDYoXquK2VOL6@mail.gmail.com>
Message-ID: <AANLkTinQ9E9wekA7ZttIGavZDDIjACJhJ18QaqB2Ra83@mail.gmail.com>

Ok, I was looking at the element attribution and there's a slight problem. I
thought I could easily fetch if the atom is from an ATOM or HETATM, but
since the "parenting" of the Atom is only done *after* the Atom is created,
there is no way (as is) of knowing where it comes from. Therefore, I thought
of the following work around. *hetero_flag* is already defined when the Atom
is created. It could be passed to the Atom as another of its arguments.

It would then be a conditional like this inside the Atom class:

if not element or element not in IUPACData:

  if hetatm:
    if atom.name in IUPACData:
      element = atom.name
    else:
      element = ?
  else: # Not HETATM
    t_element = atom.name[0] if not atom.name[0].isdigit() else atom.name[1]
    if t_element in IUPACData:
       element = t_element
    else:
       element = ?

else: # Has element and it is in IUPACData
   element = element

The advantage is that either if you don't give an element or if it fails the
IUPACData check, it will try to recover it from the atom name.

It also makes it possible to thrown an exception when the element is not
found. Or a warning since for now, only the CoM function uses it and it has
a failsafe against it (defaults to geometrical).

Opinions?

Jo?o


From bugzilla-daemon at portal.open-bio.org  Fri Jun 25 07:49:35 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 25 Jun 2010 07:49:35 -0400
Subject: [Biopython-dev] [Bug 2738] Speed up GenBank parsing,
	in particular location parsing
In-Reply-To: <bug-2738-42@http.bugzilla.open-bio.org/>
Message-ID: <201006251149.o5PBnZpA007121@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2738


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1327 is|0                           |1
           obsolete|                            |


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri Jun 25 07:51:16 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 25 Jun 2010 07:51:16 -0400
Subject: [Biopython-dev] [Bug 2738] Speed up GenBank parsing,
	in particular location parsing
In-Reply-To: <bug-2738-42@http.bugzilla.open-bio.org/>
Message-ID: <201006251151.o5PBpGE9007286@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2738


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1329 is|0                           |1
           obsolete|                            |


------- Comment #14 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-25 07:51 EST -------
(From update of attachment 1329)
I've got a branch using regular expressions which seems to cover all the
location strings I've found in testing. It is at least twice the speed of the
old parser.

http://github.com/peterjc/biopython/tree/location-parsing2


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Fri Jun 25 11:21:46 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 25 Jun 2010 16:21:46 +0100
Subject: [Biopython-dev] Re-written GenBank/EMBL feature location parsing
Message-ID: <AANLkTikgejWghQbe4LJnx82u7sCEi2A911O3BIg6JijW@mail.gmail.com>

Hi all,

I've been working on and off recently on rewriting the location
parsing for GenBank/EMBL features:
http://bugzilla.open-bio.org/show_bug.cgi?id=2738

I have a branch ready for public testing,
http://github.com/peterjc/biopython/commits/location-parsing2

The old code is still there (and indeed right now gets used as a fall
back with a warning if an unrecognised location is seen). I'd like to
label it (plus Bio.Parsers and Bio.Parsers.spark) as obsolete for the
next release, and then deprecate them the subsequence release.

The old code takes each location string, parses it with SPARK and
generates a set of token objects for each element (see the code in
Bio.GenBank.LocationParser) and then turns that into SeqFeature
location and position objects. All this object creation is probably a
major reason why the old code is slow.

The new code takes each location string, and parses it with a mix
of regular expressions and simple Python code, and then builds
the SeqFeature location and position objects. On my tests this is
at least twice as fast, typically between three and four times faster.

The intention is this parser change will result in no functional
changes at all.

As part of this work I have been extending the feature unit tests,
and have also run some more extensive additional tests locally
(GenBank files for plants, viruses, environmental samples etc).
I'm reasonably sure this covers all the location variants... but
with GenBank and EMBL files you can never be sure ;)

Would anyone like to volunteer to test the new branch before
I merge it to the trunk? I'm also interested in comments on the
code itself. Note I have tried to avoid any refactoring until the
old code is actually deprecated.

Thanks,

Peter

From bugzilla-daemon at portal.open-bio.org  Fri Jun 25 13:46:14 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 25 Jun 2010 13:46:14 -0400
Subject: [Biopython-dev] [Bug 3103] New: Possibly corrupt -
	ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML
Message-ID: <bug-3103-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3103

           Summary: Possibly corrupt - ncbi_taxonomy_mollusca.xml.zip in
                    Tests/PhyloXML
           Product: Biopython
           Version: 1.54
          Platform: Other
        OS/Version: Linux
            Status: NEW
          Severity: minor
          Priority: P5
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: vimalkumarvelayudhan at gmail.com


I created an RPM recently for Biopython version 1.54 and got this error from
rpmlint

python-biopython.i586:???W:???unable-to-read-zip???/usr/share/python-biopython/Tests/PhyloXML/ncbi_taxonomy_mollusca.xml.zip:???Bad???magic???number???for???central???directory

This appears for both the .tar.gz and the .zip version. I could do a manual
unzip of the file though.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Sun Jun 27 11:31:11 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 27 Jun 2010 11:31:11 -0400
Subject: [Biopython-dev] [Bug 3103] Possibly corrupt -
	ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML
In-Reply-To: <bug-3103-42@http.bugzilla.open-bio.org/>
Message-ID: <201006271531.o5RFVBTP001043@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3103


------- Comment #1 from eric.talevich at gmail.com  2010-06-27 11:31 EST -------
Interesting. Where did you get this release of Biopython 1.54? From PyPI, or
GitHub?

I downloaded this file from phyloxml.org originally, and haven't changed it.
This file is used in the unit tests, and Python's zipfile library doesn't seem
to have any trouble opening it. The 'file' command on Ubuntu 10.04 identifies
it as:
"Zip archive data, at least v2.0 to extract"

It's actually not a very important part of the unit tests anyway, so if it's
causing you trouble, I could give you a patch to remove this file from the unit
tests.

(If you're taking patches, there's a bug in Bio.Phylo's Nexus parsing that I'd
like to include a fix for, too. It's fixed in Biopython's trunk already, but
slipped past our release process for v.1.54.)


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Sun Jun 27 12:45:28 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 27 Jun 2010 12:45:28 -0400
Subject: [Biopython-dev] [Bug 3103] Possibly corrupt -
	ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML
In-Reply-To: <bug-3103-42@http.bugzilla.open-bio.org/>
Message-ID: <201006271645.o5RGjSBd019564@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3103


------- Comment #2 from vimalkumarvelayudhan at gmail.com  2010-06-27 12:45 EST -------
The archives were downloaded from 
http://biopython.org/DIST/biopython-1.54.tar.gz
http://biopython.org/DIST/biopython-1.54.zip

I could remove the zip file during the build process and can also patch the
Phylo.Nexus for the next release if you could forward it to me.


(In reply to comment #1)
> Interesting. Where did you get this release of Biopython 1.54? From PyPI, or
> GitHub?
> 
> I downloaded this file from phyloxml.org originally, and haven't changed it.
> This file is used in the unit tests, and Python's zipfile library doesn't seem
> to have any trouble opening it. The 'file' command on Ubuntu 10.04 identifies
> it as:
> "Zip archive data, at least v2.0 to extract"
> 
> It's actually not a very important part of the unit tests anyway, so if it's
> causing you trouble, I could give you a patch to remove this file from the unit
> tests.
> 
> (If you're taking patches, there's a bug in Bio.Phylo's Nexus parsing that I'd
> like to include a fix for, too. It's fixed in Biopython's trunk already, but
> slipped past our release process for v.1.54.)
> 


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Sun Jun 27 18:21:43 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 27 Jun 2010 23:21:43 +0100
Subject: [Biopython-dev] Running unit tests on non-English OS
In-Reply-To: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
References: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
Message-ID: <AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>

On Wed, Jun 23, 2010 at 11:28 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> From some unit test output posted by Manabu Ishii via Twitter I
> think the test suite is having problems checking for external tools
> on non-English operating systems (e.g. Debian in Japanese):
> http://d.hatena.ne.jp/manabou/20100619
> http://twitter.com/manabou
>
> I've tried to update a few to do a better job (test_Muscle_tool.py,
> test_Clustalw_tool.py and test_Emboss.py), but what I really need
> is someone to run the test suite on a non English system - ideally
> without all these command line tools installed. The tests should
> notice when the tool is missing, and be skipped without errors.
>
> Could anyone with a non-English OS try running the latest code
> from git (or even the latest release) to see if you get similar
> problems?

I've also included an idea from Manabu Ishii to set environment
variable LANG=C to get the default of USA English. This should
work on Linux etc, and is probably harmless on Windows.

Again, testing would be most welcome (any non-English OS),

Thanks

Peter

From bugzilla-daemon at portal.open-bio.org  Mon Jun 28 08:23:25 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 28 Jun 2010 08:23:25 -0400
Subject: [Biopython-dev] [Bug 3103] Possibly corrupt -
	ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML
In-Reply-To: <bug-3103-42@http.bugzilla.open-bio.org/>
Message-ID: <201006281223.o5SCNPog015539@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3103


------- Comment #3 from eric.talevich at gmail.com  2010-06-28 08:23 EST -------
Created an attachment (id=1517)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1517&action=view)
Patch to remove ncbi_xml_mollusca.xml.zip from the Phylo unit test

This patch should fix the problem reported in Bug 3103. Created with git
format-patch.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon Jun 28 08:25:20 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 28 Jun 2010 08:25:20 -0400
Subject: [Biopython-dev] [Bug 3103] Possibly corrupt -
	ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML
In-Reply-To: <bug-3103-42@http.bugzilla.open-bio.org/>
Message-ID: <201006281225.o5SCPKo9015639@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3103


------- Comment #4 from eric.talevich at gmail.com  2010-06-28 08:25 EST -------
Created an attachment (id=1518)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1518&action=view)
Patch to fix a bug in NexusIO

This patch fixes another bug in NexusIO, parsing the support values on
branches.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From k.okonechnikov at gmail.com  Mon Jun 28 13:55:30 2010
From: k.okonechnikov at gmail.com (Konstantin Okonechnikov)
Date: Tue, 29 Jun 2010 00:55:30 +0700
Subject: [Biopython-dev] Running unit tests on non-English OS
In-Reply-To: <AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>
References: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
	<AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>
Message-ID: <AANLkTinK3XuRt01-GfjNw45zNqAjXzVhFHT2Pyj-HVmH@mail.gmail.com>

Peter,
I have built and run the latest code from git on Russian Ubuntu 10.4.
Entrez tests have failed. Muscle, clustal and emboss tests have been skipped
successfully.
The tests have been executed from build.py script and I am not sure how to
generate test report. Redirecting the script output to file didn't help.


On Mon, Jun 28, 2010 at 5:21 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Wed, Jun 23, 2010 at 11:28 AM, Peter <biopython at maubp.freeserve.co.uk>
> wrote:
> > Hi all,
> >
> > From some unit test output posted by Manabu Ishii via Twitter I
> > think the test suite is having problems checking for external tools
> > on non-English operating systems (e.g. Debian in Japanese):
> > http://d.hatena.ne.jp/manabou/20100619
> > http://twitter.com/manabou
> >
> > I've tried to update a few to do a better job (test_Muscle_tool.py,
> > test_Clustalw_tool.py and test_Emboss.py), but what I really need
> > is someone to run the test suite on a non English system - ideally
> > without all these command line tools installed. The tests should
> > notice when the tool is missing, and be skipped without errors.
> >
> > Could anyone with a non-English OS try running the latest code
> > from git (or even the latest release) to see if you get similar
> > problems?
>
> I've also included an idea from Manabu Ishii to set environment
> variable LANG=C to get the default of USA English. This should
> work on Linux etc, and is probably harmless on Windows.
>
> Again, testing would be most welcome (any non-English OS),
>
> Thanks
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


-- 
Best regards,
        Konstantin

From biopython at maubp.freeserve.co.uk  Tue Jun 29 05:57:27 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 29 Jun 2010 10:57:27 +0100
Subject: [Biopython-dev] Running unit tests on non-English OS
In-Reply-To: <AANLkTinK3XuRt01-GfjNw45zNqAjXzVhFHT2Pyj-HVmH@mail.gmail.com>
References: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
	<AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>
	<AANLkTinK3XuRt01-GfjNw45zNqAjXzVhFHT2Pyj-HVmH@mail.gmail.com>
Message-ID: <AANLkTin9qFGyTce87cQ5imM0SFxwNnOJjEvEPX2d16tB@mail.gmail.com>

On Mon, Jun 28, 2010 at 6:55 PM, Konstantin Okonechnikov
<k.okonechnikov at gmail.com> wrote:
> Peter,
> I have built and run the latest code from git on Russian Ubuntu 10.4.

Thank you,

> Entrez tests have failed.

That can happen due to network problems. I'd like to see the error though.

> Muscle, clustal and emboss tests have been skipped successfully.

Good :)

> The tests have been executed from build.py script and I am not sure how to
> generate test report. Redirecting the script output to file didn't help.

I normally just run "python setup.py test" from the source directory or
"python run_tests.py" from the Tests subdirectory at the terminal, and
copy and paste the interesting bits of the output.

If you want to capture the test output to a file, you should probably redirect
both stdout and stderr:

python run_tests.py &> output.txt

Regards,

Peter

From bugzilla-daemon at portal.open-bio.org  Tue Jun 29 15:08:45 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 29 Jun 2010 15:08:45 -0400
Subject: [Biopython-dev] [Bug 3103] Possibly corrupt -
	ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML
In-Reply-To: <bug-3103-42@http.bugzilla.open-bio.org/>
Message-ID: <201006291908.o5TJ8j66032031@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3103


vimalkumarvelayudhan at gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #5 from vimalkumarvelayudhan at gmail.com  2010-06-29 15:08 EST -------
Thank you. RPMs packaged with patches applied and can be found at
http://download.opensuse.org/repositories/science:/vlinux/


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From k.okonechnikov at gmail.com  Tue Jun 29 23:27:20 2010
From: k.okonechnikov at gmail.com (Konstantin Okonechnikov)
Date: Wed, 30 Jun 2010 10:27:20 +0700
Subject: [Biopython-dev] Running unit tests on non-English OS
In-Reply-To: <AANLkTin9qFGyTce87cQ5imM0SFxwNnOJjEvEPX2d16tB@mail.gmail.com>
References: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
	<AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>
	<AANLkTinK3XuRt01-GfjNw45zNqAjXzVhFHT2Pyj-HVmH@mail.gmail.com>
	<AANLkTin9qFGyTce87cQ5imM0SFxwNnOJjEvEPX2d16tB@mail.gmail.com>
Message-ID: <AANLkTimvkHrAXaSBhgZoKF710jEAGPLt2IZOT1M5nogE@mail.gmail.com>

Peter,
actually the problems with Entrez tools are Unicode related.
I suppose, that the test failures are related with  the current working dir
path: it contains a non-English word in it, thus it can not be represented
as an ascii string.
Also there are similar problems with Genbank to Sql tests.

Please, see the error-log attached.

On Tue, Jun 29, 2010 at 4:57 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Mon, Jun 28, 2010 at 6:55 PM, Konstantin Okonechnikov
> <k.okonechnikov at gmail.com> wrote:
> > Peter,
> > I have built and run the latest code from git on Russian Ubuntu 10.4.
>
> Thank you,
>
> > Entrez tests have failed.
>
> That can happen due to network problems. I'd like to see the error though.
>
> > Muscle, clustal and emboss tests have been skipped successfully.
>
> Good :)
>
> > The tests have been executed from build.py script and I am not sure how
> to
> > generate test report. Redirecting the script output to file didn't help.
>
> I normally just run "python setup.py test" from the source directory or
> "python run_tests.py" from the Tests subdirectory at the terminal, and
> copy and paste the interesting bits of the output.
>
> If you want to capture the test output to a file, you should probably
> redirect
> both stdout and stderr:
>
> python run_tests.py &> output.txt
>
> Regards,
>
> Peter
>


-- 
Best regards,
        Konstantin
-------------- next part --------------
running test
test_Ace ... ok
test_AlignIO ... ok
test_AlignIO_convert ... ok
test_BioSQL ... FAIL
test_BioSQL_SeqIO ... /home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/Loader.py:797: UserWarning: order location operators are not fully supported
  % feature.location_operator)
/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/Loader.py:797: UserWarning: bond location operators are not fully supported
  % feature.location_operator)
ok
test_CAPS ... ok
test_Clustalw ... ok
test_Clustalw_tool ... skipping. Install clustalw or clustalw2 if you want to use Bio.Clustalw.
test_Cluster ... ok
test_CodonTable ... ok
test_CodonUsage ... ok
test_Compass ... ok
test_Crystal ... ok
test_Dialign_tool ... skipping. Install DIALIGN2-2 if you want to use the Bio.Align.Applications wrapper.
test_DocSQL ... skipping. Install MySQLdb if you want to use Bio.DocSQL.
test_Emboss ... skipping. Install EMBOSS if you want to use Bio.Emboss.
test_EmbossPhylipNew ... skipping. Install the Emboss package 'PhylipNew' if you want to use the Bio.Emboss.Applications wrappers for phylogenetic tools.
test_EmbossPrimer ... ok
test_Entrez ... FAIL
test_Enzyme ... ok
test_FSSP ... ok
test_Fasta ... ok
test_File ... ok
test_GACrossover ... ok
test_GAMutation ... ok
test_GAOrganism ... ok
test_GAQueens ... ok
test_GARepair ... ok
test_GASelection ... ok
test_GFF ... skipping. Environment is not configured for this test (not important if you do not plan to use Bio.GFF).
test_GFF2 ... skipping. Install MySQLdb if you want to use Bio.GFF.
test_GenBank ... ok
test_GenomeDiagram ... skipping. Install reportlab if you want to use Bio.Graphics.
test_GraphicsBitmaps ... skipping. Install ReportLab if you want to use Bio.Graphics.
test_GraphicsChromosome ... skipping. Install reportlab if you want to use Bio.Graphics.
test_GraphicsDistribution ... skipping. Install reportlab if you want to use Bio.Graphics.
test_GraphicsGeneral ... skipping. Install reportlab if you want to use Bio.Graphics.
test_HMMCasino ... ok
test_HMMGeneral ... ok
test_HotRand ... ok
test_IsoelectricPoint ... ok
test_KDTree ... ok
test_KEGG ... ok
test_KeyWList ... ok
test_Location ... ok
test_LocationParser ... ok
test_LogisticRegression ... ok
test_MEME ... ok
test_Mafft_tool ... skipping. Install MAFFT if you want to use the Bio.Align.Applications wrapper.
test_MarkovModel ... ok
test_Medline ... ok
test_Motif ... ok
test_Muscle_tool ... skipping. Install MUSCLE if you want to use the Bio.Align.Applications wrapper.
test_NCBIStandalone ... ok
test_NCBITextParser ... ok
test_NCBIXML ... ok
test_NCBI_BLAST_tools ... skipping. Install the NCBI BLAST+ command line tools if you want to use the Bio.Blast.Applications wrapper.
test_NCBI_qblast ... ok
test_NNExclusiveOr ... ok
test_NNGene ... ok
test_NNGeneral ... ok
test_Nexus ... ok
test_PDB ... ok
test_ParserSupport ... ok
test_Pathway ... ok
test_Phd ... ok
test_Phylo ... ok
test_PhyloXML ... ok
test_Phylo_depend ... skipping. Install NetworkX if you want to use Bio.Phylo._utils.
test_PopGen_FDist ... skipping. Install FDist if you want to use Bio.PopGen.FDist.
test_PopGen_FDist_nodepend ... ok
test_PopGen_GenePop ... skipping. Install GenePop if you want to use Bio.PopGen.GenePop.
test_PopGen_GenePop_EasyController ... skipping. Install GenePop if you want to use Bio.PopGen.GenePop.
test_PopGen_GenePop_nodepend ... ok
test_PopGen_SimCoal ... skipping. Install SIMCOAL2 if you want to use Bio.PopGen.SimCoal.
test_PopGen_SimCoal_nodepend ... ok
test_Prank_tool ... skipping. Install PRANK if you want to use the Bio.Align.Applications wrapper.
test_Probcons_tool ... skipping. Install PROBCONS if you want to use the Bio.Align.Applications wrapper.
test_ProtParam ... ok
test_Restriction ... ok
test_SCOP_Astral ... ok
test_SCOP_Cla ... ok
test_SCOP_Des ... ok
test_SCOP_Dom ... ok
test_SCOP_Hie ... ok
test_SCOP_Raf ... ok
test_SCOP_Residues ... ok
test_SCOP_Scop ... ok
test_SVDSuperimposer ... ok
test_SeqIO ... ok
test_SeqIO_FastaIO ... ok
test_SeqIO_QualityIO ... ok
test_SeqIO_convert ... ok
test_SeqIO_features ... ok
test_SeqIO_index ... ok
test_SeqIO_online ... ok
test_SeqRecord ... ok
test_SeqUtils ... ok
test_Seq_objs ... ok
test_SubsMat ... ok
test_SwissProt ... ok
test_TCoffee_tool ... skipping. Install TCOFFEE if you want to use the Bio.Align.Applications wrapper.
test_UniGene ... ok
test_UniGene_obsolete ... ok
test_Wise ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise.
test_align ... ok
test_geo ... ok
test_interpro ... ok
test_kNN ... ok
test_lowess ... ok
test_pairwise2 ... ok
test_prodoc ... ok
test_property_manager ... ok
test_prosite1 ... ok
test_prosite2 ... ok
test_prosite_patterns ... ok
test_psw ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise.
test_seq ... ok
test_translate ... ok
test_trie ... ok
test_triefind ... ok
Bio.Application docstring test ... ok
Bio.Seq docstring test ... ok
Bio.SeqFeature docstring test ... ok
Bio.SeqRecord docstring test ... ok
Bio.SeqIO docstring test ... ok
Bio.SeqIO.AceIO docstring test ... ok
Bio.SeqIO.PhdIO docstring test ... ok
Bio.SeqIO.QualityIO docstring test ... ok
Bio.SeqIO.SffIO docstring test ... ok
Bio.SeqUtils docstring test ... ok
Bio.Align docstring test ... ok
Bio.Align.Generic docstring test ... ok
Bio.AlignIO docstring test ... ok
Bio.AlignIO.StockholmIO docstring test ... ok
Bio.Blast.Applications docstring test ... ok
Bio.Clustalw docstring test ... ok
Bio.Emboss.Applications docstring test ... ok
Bio.KEGG.Compound docstring test ... ok
Bio.KEGG.Enzyme docstring test ... ok
Bio.Wise docstring test ... FAIL
Bio.Wise.psw docstring test ... ok
Bio.Motif docstring test ... ok
Bio.Statistics.lowess docstring test ... ok
======================================================================
ERROR: GenBank file to BioSQL and back to a GenBank file, NC_000932.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 423, in test_NC_000932
    self.loop(os.path.join(os.getcwd(), "GenBank", "NC_000932.gb"), "gb")
  File "test_BioSQL.py", line 456, in loop
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL and back to a GenBank file, NC_005816.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 419, in test_NC_005816
    self.loop(os.path.join(os.getcwd(), "GenBank", "NC_005816.gb"), "gb")
  File "test_BioSQL.py", line 456, in loop
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL and back to a GenBank file, NT_019265.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 427, in test_NT_019265
    self.loop(os.path.join(os.getcwd(), "GenBank", "NT_019265.gb"), "gb")
  File "test_BioSQL.py", line 456, in loop
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL and back to a GenBank file, arab1.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 447, in test_arab1
    self.loop(os.path.join(os.getcwd(), "GenBank", "arab1.gb"), "gb")
  File "test_BioSQL.py", line 456, in loop
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL and back to a GenBank file, cor6_6.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 443, in test_cor6_6
    self.loop(os.path.join(os.getcwd(), "GenBank", "cor6_6.gb"), "gb")
  File "test_BioSQL.py", line 456, in loop
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL and back to a GenBank file, noref.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 435, in test_no_ref
    self.loop(os.path.join(os.getcwd(), "GenBank", "noref.gb"), "gb")
  File "test_BioSQL.py", line 456, in loop
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL and back to a GenBank file, one_of.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 439, in test_one_of
    self.loop(os.path.join(os.getcwd(), "GenBank", "one_of.gb"), "gb")
  File "test_BioSQL.py", line 456, in loop
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL and back to a GenBank file, protein_refseq2.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 431, in test_protein_refseq2
    self.loop(os.path.join(os.getcwd(), "GenBank", "protein_refseq2.gb"), "gb")
  File "test_BioSQL.py", line 456, in loop
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL, then again to a new namespace, NC_000932.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 496, in test_NC_000932
    self.trans(os.path.join(os.getcwd(), "GenBank", "NC_000932.gb"), "gb")
  File "test_BioSQL.py", line 529, in trans
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL, then again to a new namespace, NC_005816.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 492, in test_NC_005816
    self.trans(os.path.join(os.getcwd(), "GenBank", "NC_005816.gb"), "gb")
  File "test_BioSQL.py", line 529, in trans
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL, then again to a new namespace, NT_019265.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 500, in test_NT_019265
    self.trans(os.path.join(os.getcwd(), "GenBank", "NT_019265.gb"), "gb")
  File "test_BioSQL.py", line 529, in trans
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL, then again to a new namespace, arab1.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 520, in test_arab1
    self.trans(os.path.join(os.getcwd(), "GenBank", "arab1.gb"), "gb")
  File "test_BioSQL.py", line 529, in trans
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL, then again to a new namespace, cor6_6.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 516, in test_cor6_6
    self.trans(os.path.join(os.getcwd(), "GenBank", "cor6_6.gb"), "gb")
  File "test_BioSQL.py", line 529, in trans
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL, then again to a new namespace, noref.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 508, in test_no_ref
    self.trans(os.path.join(os.getcwd(), "GenBank", "noref.gb"), "gb")
  File "test_BioSQL.py", line 529, in trans
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL, then again to a new namespace, one_of.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 512, in test_one_of
    self.trans(os.path.join(os.getcwd(), "GenBank", "one_of.gb"), "gb")
  File "test_BioSQL.py", line 529, in trans
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL, then again to a new namespace, protein_refseq2.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 504, in test_protein_refseq2
    self.trans(os.path.join(os.getcwd(), "GenBank", "protein_refseq2.gb"), "gb")
  File "test_BioSQL.py", line 529, in trans
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: Test parsing XML returned by EFetch, Journals database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 3451, in test_journals
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by EFetch, Nucleotide database (first test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 3893, in test_nucleotide1
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by EFetch, Protein database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 4045, in test_nucleotide2
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by EFetch, OMIM database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 3607, in test_omim
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by EFetch, PubMed database (first test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 3034, in test_pubmed1
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by EFetch, PubMed database (second test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 3237, in test_pubmed2
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by EFetch, Taxonomy database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 3784, in test_taxonomy
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML output returned by EGQuery (first test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 2706, in test_egquery1
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML output returned by EGQuery (second test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 2858, in test_egquery2
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing database list returned by EInfo
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 26, in test_list
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing database info returned by EInfo
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 72, in test_pubmed
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing cancerchromosomes links returned by ELink
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 2690, in test_cancerchromosomes
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing medline indexed articles returned by ELink
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 1965, in test_medline
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing Nucleotide to Protein links returned by ELink
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 1239, in test_nucleotide
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing pubmed links returned by ELink (first test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 934, in test_pubmed1
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing pubmed links returned by ELink (second test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 1253, in test_pubmed2
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing pubmed link returned by ELink (third test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 2404, in test_pubmed3
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing pubmed links returned by ELink (fourth test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 2431, in test_pubmed4
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing pubmed links returned by ELink (fifth test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 2499, in test_pubmed5
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing pubmed links returned by ELink (sixth test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 2669, in test_pubmed6
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by EPost
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 535, in test_epost
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by EPost with an invalid id (overflow tag)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 553, in test_invalid
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by EPost with incorrect arguments
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 545, in test_wrong
    self.assertRaises(RuntimeError, Entrez.read, handle)
  File "/usr/lib/python2.6/unittest.py", line 336, in failUnlessRaises
    callableObj(*args, **kwargs)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESearch from the Journals database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 322, in test_journals
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESearch when no items were found
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 502, in test_notfound
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESearch from the Nucleotide database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 444, in test_nucleotide
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESearch from PubMed Central
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 366, in test_pmc
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESearch from the Protein database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 479, in test_protein
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESearch from PubMed (first test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 107, in test_pubmed1
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESearch from PubMed (second test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 136, in test_pubmed2
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESearch from PubMed (third test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 289, in test_pubmed3
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML output returned by ESpell
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 3013, in test_espell
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESummary from the Journals database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 653, in test_journals
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESummary from the Nucleotide database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 766, in test_nucleotide
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESummary from the Protein database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 727, in test_protein
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESummary from PubMed
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 576, in test_pubmed
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESummary from the Structure database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 805, in test_structure
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESummary from the Taxonomy database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 855, in test_taxonomy
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESummary from the UniSTS database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 895, in test_unists
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESummary with incorrect arguments
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 921, in test_wrong
    self.assertRaises(RuntimeError, Entrez.read, handle)
  File "/usr/lib/python2.6/unittest.py", line 336, in failUnlessRaises
    callableObj(*args, **kwargs)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
FAIL: Doctest: Bio.Wise._build_align_cmdline
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.6/doctest.py", line 2152, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for Bio.Wise._build_align_cmdline
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Wise/__init__.py", line 23, in _build_align_cmdline

----------------------------------------------------------------------
File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Wise/__init__.py", line 26, in Bio.Wise._build_align_cmdline
Failed example:
    _build_align_cmdline(["dnal"], ("seq1.fna", "seq2.fna"), "/tmp/output", kbyte=100000)
Expected:
    'dnal -kbyte 100000 seq1.fna seq2.fna > /tmp/output'
Got:
    'dnal -kbyte 100000 -quiet seq1.fna seq2.fna > /tmp/output'
----------------------------------------------------------------------
File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Wise/__init__.py", line 28, in Bio.Wise._build_align_cmdline
Failed example:
    _build_align_cmdline(["psw"], ("seq1.faa", "seq2.faa"), "/tmp/output_aa")
Expected:
    'psw -kbyte 300000 seq1.faa seq2.faa > /tmp/output_aa'
Got:
    'psw -kbyte 300000 -quiet seq1.faa seq2.faa > /tmp/output_aa'


----------------------------------------------------------------------
Ran 144 tests in 192.676 seconds

FAILED (failures = 3)

From biopython at maubp.freeserve.co.uk  Wed Jun 30 06:19:19 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 30 Jun 2010 11:19:19 +0100
Subject: [Biopython-dev] Running unit tests on non-English OS
In-Reply-To: <AANLkTimvkHrAXaSBhgZoKF710jEAGPLt2IZOT1M5nogE@mail.gmail.com>
References: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
	<AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>
	<AANLkTinK3XuRt01-GfjNw45zNqAjXzVhFHT2Pyj-HVmH@mail.gmail.com>
	<AANLkTin9qFGyTce87cQ5imM0SFxwNnOJjEvEPX2d16tB@mail.gmail.com>
	<AANLkTimvkHrAXaSBhgZoKF710jEAGPLt2IZOT1M5nogE@mail.gmail.com>
Message-ID: <AANLkTin57wgJmKehUbeB21KBu1GMcc-N4PlzfycQJ54t@mail.gmail.com>

On Wed, Jun 30, 2010 at 4:27 AM, Konstantin Okonechnikov
<k.okonechnikov at gmail.com> wrote:
> Peter,
> actually the problems with Entrez tools are Unicode related.
> I suppose, that the test failures are related with? the current working dir
> path: it contains a non-English word in it, thus it can not be represented
> as an ascii string.
> Also there are similar problems with Genbank to Sql tests.
>
> Please, see the error-log attached.

Thank you for the error log. Yes, there do seem to be problems
with having the source code under a unicode path. Could you
try moving the folder from /home/okko/??????/biopython to
/home/okko/biopython and repeat the test? That would help
confirm this hypothesis.

Peter


From biopython at maubp.freeserve.co.uk  Wed Jun 30 08:47:14 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 30 Jun 2010 13:47:14 +0100
Subject: [Biopython-dev] Running unit tests on non-English OS
In-Reply-To: <AANLkTin57wgJmKehUbeB21KBu1GMcc-N4PlzfycQJ54t@mail.gmail.com>
References: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
	<AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>
	<AANLkTinK3XuRt01-GfjNw45zNqAjXzVhFHT2Pyj-HVmH@mail.gmail.com>
	<AANLkTin9qFGyTce87cQ5imM0SFxwNnOJjEvEPX2d16tB@mail.gmail.com>
	<AANLkTimvkHrAXaSBhgZoKF710jEAGPLt2IZOT1M5nogE@mail.gmail.com>
	<AANLkTin57wgJmKehUbeB21KBu1GMcc-N4PlzfycQJ54t@mail.gmail.com>
Message-ID: <AANLkTinOiSlYeKJM_HLlQq_bYtjTOQU5fMSVe7JIamqC@mail.gmail.com>

On Wed, Jun 30, 2010 at 11:19 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Wed, Jun 30, 2010 at 4:27 AM, Konstantin Okonechnikov
> <k.okonechnikov at gmail.com> wrote:
>> Peter,
>> actually the problems with Entrez tools are Unicode related.
>> I suppose, that the test failures are related with? the current working dir
>> path: it contains a non-English word in it, thus it can not be represented
>> as an ascii string.
>> Also there are similar problems with Genbank to Sql tests.
>>
>> Please, see the error-log attached.
>
> Thank you for the error log. Yes, there do seem to be problems
> with having the source code under a unicode path. Could you
> try moving the folder from /home/okko/??????/biopython to
> /home/okko/biopython and repeat the test? That would help
> confirm this hypothesis.

I created a similar directory name on my (English) version of
Mac OS X, and get the same Entrez failure.

Peter


From biopython at maubp.freeserve.co.uk  Wed Jun 30 09:05:53 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 30 Jun 2010 14:05:53 +0100
Subject: [Biopython-dev] Running unit tests on non-English OS
In-Reply-To: <AANLkTinOiSlYeKJM_HLlQq_bYtjTOQU5fMSVe7JIamqC@mail.gmail.com>
References: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
	<AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>
	<AANLkTinK3XuRt01-GfjNw45zNqAjXzVhFHT2Pyj-HVmH@mail.gmail.com>
	<AANLkTin9qFGyTce87cQ5imM0SFxwNnOJjEvEPX2d16tB@mail.gmail.com>
	<AANLkTimvkHrAXaSBhgZoKF710jEAGPLt2IZOT1M5nogE@mail.gmail.com>
	<AANLkTin57wgJmKehUbeB21KBu1GMcc-N4PlzfycQJ54t@mail.gmail.com>
	<AANLkTinOiSlYeKJM_HLlQq_bYtjTOQU5fMSVe7JIamqC@mail.gmail.com>
Message-ID: <AANLkTinmc1pgGd3vdHh-zoeXJ4jCIQaP2RdlxhQyk7a-@mail.gmail.com>

On Wed, Jun 30, 2010 at 1:47 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> I created a similar directory name on my (English) version of
> Mac OS X, and get the same Entrez failure.
>

Hi Konstantin,

Could you retest using the latest code from github? I hope that now
test_Entrez.py will work for you.

Thanks,

Peter

From biopython at maubp.freeserve.co.uk  Wed Jun 30 09:31:58 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 30 Jun 2010 14:31:58 +0100
Subject: [Biopython-dev] Running unit tests on non-English OS
In-Reply-To: <AANLkTinmc1pgGd3vdHh-zoeXJ4jCIQaP2RdlxhQyk7a-@mail.gmail.com>
References: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
	<AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>
	<AANLkTinK3XuRt01-GfjNw45zNqAjXzVhFHT2Pyj-HVmH@mail.gmail.com>
	<AANLkTin9qFGyTce87cQ5imM0SFxwNnOJjEvEPX2d16tB@mail.gmail.com>
	<AANLkTimvkHrAXaSBhgZoKF710jEAGPLt2IZOT1M5nogE@mail.gmail.com>
	<AANLkTin57wgJmKehUbeB21KBu1GMcc-N4PlzfycQJ54t@mail.gmail.com>
	<AANLkTinOiSlYeKJM_HLlQq_bYtjTOQU5fMSVe7JIamqC@mail.gmail.com>
	<AANLkTinmc1pgGd3vdHh-zoeXJ4jCIQaP2RdlxhQyk7a-@mail.gmail.com>
Message-ID: <AANLkTimQz379gZrGxwwvpKOi_lspOn5dzKQFYbdUQfAF@mail.gmail.com>

On Wed, Jun 30, 2010 at 2:05 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Wed, Jun 30, 2010 at 1:47 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>>
>> I created a similar directory name on my (English) version of
>> Mac OS X, and get the same Entrez failure.
>>
>
> Hi Konstantin,
>
> Could you retest using the latest code from github? I hope that now
> test_Entrez.py will work for you.

The second update should also fix test_BioSQL.py as well.

Peter

From biopython at maubp.freeserve.co.uk  Wed Jun 30 10:24:57 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 30 Jun 2010 15:24:57 +0100
Subject: [Biopython-dev] Running unit tests on non-English OS
In-Reply-To: <AANLkTimXNStsgSo2zbBz3TGfWnxB_Dn-XZCraPMD4H6M@mail.gmail.com>
References: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
	<AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>
	<AANLkTinK3XuRt01-GfjNw45zNqAjXzVhFHT2Pyj-HVmH@mail.gmail.com>
	<AANLkTin9qFGyTce87cQ5imM0SFxwNnOJjEvEPX2d16tB@mail.gmail.com>
	<AANLkTimvkHrAXaSBhgZoKF710jEAGPLt2IZOT1M5nogE@mail.gmail.com>
	<AANLkTin57wgJmKehUbeB21KBu1GMcc-N4PlzfycQJ54t@mail.gmail.com>
	<AANLkTinOiSlYeKJM_HLlQq_bYtjTOQU5fMSVe7JIamqC@mail.gmail.com>
	<AANLkTinmc1pgGd3vdHh-zoeXJ4jCIQaP2RdlxhQyk7a-@mail.gmail.com>
	<AANLkTimQz379gZrGxwwvpKOi_lspOn5dzKQFYbdUQfAF@mail.gmail.com>
	<AANLkTimXNStsgSo2zbBz3TGfWnxB_Dn-XZCraPMD4H6M@mail.gmail.com>
Message-ID: <AANLkTikVjsHaO3EZNJgB7wWZ6G3Yde-_AkQuxUNDNZQt@mail.gmail.com>

On Wed, Jun 30, 2010 at 2:59 PM, Konstantin Okonechnikov
<k.okonechnikov at gmail.com> wrote:
> The fixes work!
> Only one test fails, but it doesn't look related to non-English OS
> problems.? I've attached the new test log.

Great :)

I hadn't done anything about the Bio.Wise docstring test failure yet,
but it isn't linked to the non-English OS at all. I'll start a new thread...

Peter


From bugzilla-daemon at portal.open-bio.org  Wed Jun 30 11:22:16 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 30 Jun 2010 11:22:16 -0400
Subject: [Biopython-dev] [Bug 2738] Speed up GenBank parsing,
	in particular location parsing
In-Reply-To: <bug-2738-42@http.bugzilla.open-bio.org/>
Message-ID: <201006301522.o5UFMGvo028548@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2738


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-30 11:22 EST -------
I've merged my github branch into the master.

Marking as fixed.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Wed Jun 30 11:23:12 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 30 Jun 2010 16:23:12 +0100
Subject: [Biopython-dev] Re-written GenBank/EMBL feature location parsing
In-Reply-To: <AANLkTikgejWghQbe4LJnx82u7sCEi2A911O3BIg6JijW@mail.gmail.com>
References: <AANLkTikgejWghQbe4LJnx82u7sCEi2A911O3BIg6JijW@mail.gmail.com>
Message-ID: <AANLkTimEfOqNcUA91D7hVZEdX0AaR5JjhIo0eWMvh1tV@mail.gmail.com>

On Fri, Jun 25, 2010 at 4:21 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> I've been working on and off recently on rewriting the location
> parsing for GenBank/EMBL features:
> http://bugzilla.open-bio.org/show_bug.cgi?id=2738
>
> I have a branch ready for public testing, ... Would anyone like
> to volunteer to test the new branch before I merge it to the trunk?

I've just merged it - testing and feedback still welcome of course.

Peter

From biopython at maubp.freeserve.co.uk  Wed Jun 30 10:38:59 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 30 Jun 2010 15:38:59 +0100
Subject: [Biopython-dev] Running unit tests on non-English OS
In-Reply-To: <AANLkTikVjsHaO3EZNJgB7wWZ6G3Yde-_AkQuxUNDNZQt@mail.gmail.com>
References: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
	<AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>
	<AANLkTinK3XuRt01-GfjNw45zNqAjXzVhFHT2Pyj-HVmH@mail.gmail.com>
	<AANLkTin9qFGyTce87cQ5imM0SFxwNnOJjEvEPX2d16tB@mail.gmail.com>
	<AANLkTimvkHrAXaSBhgZoKF710jEAGPLt2IZOT1M5nogE@mail.gmail.com>
	<AANLkTin57wgJmKehUbeB21KBu1GMcc-N4PlzfycQJ54t@mail.gmail.com>
	<AANLkTinOiSlYeKJM_HLlQq_bYtjTOQU5fMSVe7JIamqC@mail.gmail.com>
	<AANLkTinmc1pgGd3vdHh-zoeXJ4jCIQaP2RdlxhQyk7a-@mail.gmail.com>
	<AANLkTimQz379gZrGxwwvpKOi_lspOn5dzKQFYbdUQfAF@mail.gmail.com>
	<AANLkTimXNStsgSo2zbBz3TGfWnxB_Dn-XZCraPMD4H6M@mail.gmail.com>
	<AANLkTikVjsHaO3EZNJgB7wWZ6G3Yde-_AkQuxUNDNZQt@mail.gmail.com>
Message-ID: <AANLkTinxk2s14jnD1C9jAnqFvrZQZjdvicd7T2p027yf@mail.gmail.com>

On Wed, Jun 30, 2010 at 3:24 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Wed, Jun 30, 2010 at 2:59 PM, Konstantin Okonechnikov
> <k.okonechnikov at gmail.com> wrote:
>> The fixes work!
>> Only one test fails, but it doesn't look related to non-English OS
>> problems.? I've attached the new test log.
>
> Great :)
>
> I hadn't done anything about the Bio.Wise docstring test failure yet,
> but it isn't linked to the non-English OS at all. I'll start a new thread...
>

Solved. The doctest was working UNLESS the test output was
being sent to a file.

Peter


From eric.talevich at gmail.com  Tue Jun  1 03:44:11 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 31 May 2010 23:44:11 -0400
Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB
	enhancements
In-Reply-To: <AANLkTin1nilGCBk3zuU4ZMf7vl0XdSlX8Ujai6_g4B-z@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com> 
	<AANLkTin1nilGCBk3zuU4ZMf7vl0XdSlX8Ujai6_g4B-z@mail.gmail.com>
Message-ID: <AANLkTikj60J4uhtzzgTYos67aeH3eJ3xJF8-QsKq7qWS@mail.gmail.com>

On Mon, May 31, 2010 at 11:53 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Mon, May 31, 2010 at 4:38 PM, Eric Talevich <eric.talevich at gmail.com>
> wrote:
> > Hi all,
> >
> > This summer our GSoC student Jo?o Rodrigues will be implementing a number
> of
> > enhancements to Biopython's structural biology modules. Since Bio.PDB is
> one
> > of the most widely used parts of Biopython, I'd like to find a way to
> > let Jo?o add major new features without breaking existing code and
> > documentation.
> >
> > There are a few issues I'd like to address:
> >
> > 1. The I/O conventions of parse/read/write/convert seem to work very well
> in
> > SeqIO, AlignIO, Phylo, and other Biopython sub-packages. Bio.PDB supports
> > I/O in several formats, but the API is lower-level and isn't unified in
> the
> > same way (yet).
>
> Currently Bio.PDB supports the plain text PDB format, and has partial
> support for mmCIF. It lacks support for the XML PDB format, PDBML -
> Protein Data Bank Markup Language.
>

Yeah, it would be good to implement that at some point. For now, I'd be
happy to be able to read and write PDB files with a single function call
each, and design the I/O wrapper for easy extension to mmCIF and PDBML.


Under this proposed scheme, what would you see as the basic record type
> (analogous to a SeqRecord, alignment or tree in Bio.SeqIO, Bio.AlignIO and
> Bio.Phylo)? It would be nice to say a protein chain, but there is the issue
> of
> multiple models (e.g. from NMR). I presume you'd go with the model as the
> basic unit (where each model may contain multiple chains).
>

I'd consider a structure to be the basic unit of I/O. If we're going to make
better use of header info, that's generally associated with the whole
structure and not individual models -- we'd have to duplicate the header
info in each Model object emitted, which would be weird.

Are there any formats that store more than one structure in a file? If not,
then there's probably no need for a parse() function in Bio.Struct.


> > from Bio.Struct import WHATIF, Jpred
> > # Servers each get their own module
>
> Hmm - perhaps we may need have another level here, Bio.Struct.Servers
> or Bio.Struct.WWW or something. How many of these do you expect?
>

Jo?o's project plan includes Dali and WHATIF:
http://biopython.org/wiki/GSOC2010_Joao

These servers do different things so I wouldn't expect any similarity in the
code between them. There are lots of servers that we *could* support...
Aesthetically, a Servers or WWW subdirectory would match
Bio.Struct.Applications and make the whole package a little more
self-documenting.

Here's one more idea: Fetching a single PDB file from RCSB requires a
separate import and a couple of calls. Should we make this even easier by
mimicking the efetch function in Bio.Entrez, something like

>>> handle = Bio.PDB.fetch("1MOT")

or

>>> from Bio.Struct.WWW import RCSB
>>> handle = RCSB.fetch("1MOT", "pdb")

?

-Eric


From biopython at maubp.freeserve.co.uk  Tue Jun  1 09:05:43 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 1 Jun 2010 10:05:43 +0100
Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB
	enhancements
In-Reply-To: <AANLkTikj60J4uhtzzgTYos67aeH3eJ3xJF8-QsKq7qWS@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<AANLkTin1nilGCBk3zuU4ZMf7vl0XdSlX8Ujai6_g4B-z@mail.gmail.com>
	<AANLkTikj60J4uhtzzgTYos67aeH3eJ3xJF8-QsKq7qWS@mail.gmail.com>
Message-ID: <AANLkTim3hndnmdtvlXcBqo723U48FOmpd5AiESwqxM8P@mail.gmail.com>

2010/6/1 Eric Talevich:
> On Mon, May 31, 2010 at 11:53 AM, Peter wrote:
>
> Under this proposed scheme, what would you see as the basic record type
>> (analogous to a SeqRecord, alignment or tree in Bio.SeqIO, Bio.AlignIO
>> and Bio.Phylo)? It would be nice to say a protein chain, but there is the
>> issue of multiple models (e.g. from NMR). I presume you'd go with the
>> model as the basic unit (where each model may contain multiple chains).
>>
>
> I'd consider a structure to be the basic unit of I/O. If we're going to make
> better use of header info, that's generally associated with the whole
> structure and not individual models -- we'd have to duplicate the header
> info in each Model object emitted, which would be weird.
>
> Are there any formats that store more than one structure in a file? If not,
> then there's probably no need for a parse() function in Bio.Struct.

OK, yes - a whole structure as the unit would work, so we would
only need the read function (one file is one structure) and not the
parse function (no point in iterating over one thing).

>> > from Bio.Struct import WHATIF, Jpred
>> > # Servers each get their own module
>>
>> Hmm - perhaps we may need have another level here, Bio.Struct.Servers
>> or Bio.Struct.WWW or something. How many of these do you expect?
>>
>
> Jo?o's project plan includes Dali and WHATIF:
> http://biopython.org/wiki/GSOC2010_Joao
>
> These servers do different things so I wouldn't expect any similarity in the
> code between them. There are lots of servers that we *could* support...
> Aesthetically, a Servers or WWW subdirectory would match
> Bio.Struct.Applications and make the whole package a little more
> self-documenting.

My thoughts exactly.

> Here's one more idea: Fetching a single PDB file from RCSB requires a
> separate import and a couple of calls. Should we make this even easier by
> mimicking the efetch function in Bio.Entrez, something like
>
>>>> handle = Bio.PDB.fetch("1MOT")
>
> or
>
>>>> from Bio.Struct.WWW import RCSB
>>>> handle = RCSB.fetch("1MOT", "pdb")
>
> ?
>

That seems nice.

Peter


From krother at rubor.de  Tue Jun  1 09:59:31 2010
From: krother at rubor.de (Kristian Rother)
Date: Tue, 1 Jun 2010 11:59:31 +0200
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
Message-ID: <ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>

Hi,

Got some comments & questions.

> 2. PDB headers seem to have become better structured in recent years, in
> ... parse_pdb_header needs some attention as well.

I haven't looked into this code for years .. I think it might be a little
messy.


> 3. Kristian asked on this list awhile ago about the proper location for
> his new code that works with RNA structures. While RCSB's PDB contains
> some RNA structures, the RNA world doesn't revolve around it. Similarly,
> Jo?o needs a place to put code for structure prediction/validation
> servers, command-line wrappers, secondary structures, etc.
>
> I propose a new sub-package called Bio.Struct for these enhancements:
>
> from Bio.Struct import RNA
> # Would this work for you, Kristian?

Yes, it would be more descriptive than the originally proposed Bio.RNA . I
am just concerned whether I could keep the 2D structure-related modules in
the same package.

> Alternatively, we could do all of this within the PDB module -- so picture
> the above examples with "PDB" in place of "Struct". This raises the chance
> of naming collisions, though, and doesn't solve issue #3 above.

I like Bio.PDB.RNA less for the same reasons plus the 2D structure issue.


> We'll leave the existing PDB module layout alone, in general. I think it
> will be necessary to add a few more attributes to the
> Bio.PDB.Structure.Structure class, but we can do this without breaking
> compatibility.
>
> Comments?

What about the modules for constructing coordinates & Loop Closure
(currently available on my Github branch)? I placed them in Bio.PDB
because they are not limited to RNA and are conceptually similar to the
operations performed by Bio.PDB.NeighborSearch and Bio.PDB.SVDSuperimposer
- or would it be better to gather such things in some other package within
Bio.PDB.Struct?

Cheers,
     Kristian


From biopython at maubp.freeserve.co.uk  Tue Jun  1 11:42:53 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 1 Jun 2010 12:42:53 +0100
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
Message-ID: <AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>

2010/6/1 Kristian Rother <krother at rubor.de>:
>
>> 3. Kristian asked on this list awhile ago about the proper location for
>> his new code that works with RNA structures. While RCSB's PDB
>> contains some RNA structures, the RNA world doesn't revolve around
>> it. Similarly, Jo?o needs a place to put code for structure prediction/
>> validation servers, command-line wrappers, secondary structures, etc.
>>
>> I propose a new sub-package called Bio.Struct for these enhancements:
>>
>> from Bio.Struct import RNA
>> # Would this work for you, Kristian?
>
> Yes, it would be more descriptive than the originally proposed Bio.RNA . I
> am just concerned whether I could keep the 2D structure-related modules
> in the same package.

I don't necessarily see a problem with Bio.Struct or Bio.Structure covering
both 2D and 3D structures. Does this 2D stuff include file parsers? That
would complicate plans for Bio.Struct.read() etc. Maybe Bio.RNA is better.

Peter


From biopython at maubp.freeserve.co.uk  Tue Jun  1 13:10:05 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 1 Jun 2010 14:10:05 +0100
Subject: [Biopython-dev] subprocess and calling application wrappers
In-Reply-To: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
References: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
Message-ID: <AANLkTikfxQ87Cpqx466oHhYZF8fn7_bJtriZ-ocQ_2O2@mail.gmail.com>

On Mon, May 31, 2010 at 3:50 PM, Peter wrote:
> Hi all,
>
> With the new command line wrappers and the tutorial pushing
> users towards using subprocess we've had more queries
> about how to use it. The subprocess module itself is rather
> scary I guess, and things could be made a lot easier.
>
> I think the most typical use cases are:
>
> (1) Run the command, return the error code (integer)
> (2) Run the command, return stdout, stderr and error code
>
> In theory the function subprocess.call() would take care
> of the first example, but there is a cross platform annoyance
> here with the shell parameter. Also, if you want the output
> too things get even more tricky. It hasn't helped that there
> are a few platform specific quirks/bugs in subprocess itself
> (the different behaviour of the shell option on Windows,
> bug http://bugs.python.org/issue1124861 in old Pythons,
> the risk of deadlocks with large output files, etc).

In fact I've often found using os.system() much easier than
subprocess for the first use case - running a command and
getting the return code. I wondered about adding an example
of this to the tutorial but didn't find time before the last release
(even if the Python documentation does try and encourage
using subprocess instead).

Peter


From chapmanb at 50mail.com  Tue Jun  1 13:23:55 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 1 Jun 2010 09:23:55 -0400
Subject: [Biopython-dev] subprocess and calling application wrappers
In-Reply-To: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
References: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
Message-ID: <20100601132355.GU1054@sobchak.mgh.harvard.edu>

Peter;

> With the new command line wrappers and the tutorial pushing
> users towards using subprocess we've had more queries
> about how to use it. The subprocess module itself is rather
> scary I guess, and things could be made a lot easier.
[...]
> We could instead make the wrapper objects callable (define
> the magic method __call__) to offer this kind of functionality.
> This seems quite elegant to me. 

This is a good idea, although I'm 50/50 on the __call__ idea.
Having a run() command or something similar might be more intuitive
then the more magical call, if the idea is to appeal to users who
find subprocess too problematic.

I'd suggest having an option to not capture stdout and stderr, which
would help users avoid those cases where a program spews a lot to
stdout and it's unwieldy to capture and stick it into a string.

Brad


From biopython at maubp.freeserve.co.uk  Tue Jun  1 13:48:30 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 1 Jun 2010 14:48:30 +0100
Subject: [Biopython-dev] Blast parsers and records
In-Reply-To: <1275332206.4c04066ed4ec5@webmail.upv.es>
References: <1275332206.4c04066ed4ec5@webmail.upv.es>
Message-ID: <AANLkTimdncn6t6mMTfW3o-1aijnhVemB9XEpJC6qHFbN@mail.gmail.com>

On Mon, May 31, 2010 at 7:56 PM, Blanca Postigo Jose Miguel
<jblanca at btc.upv.es> wrote:
> Mensaje citado por Michael Sandford <sandford at ufl.edu>:
>
>> I've got a few comments as well:
>> > 4) The current Blast record stores its information in attributes. If you
>> use Bio.Entrez to parse Blast XML output (Biopython 1.54 contains the
>> necessary DTDs to do so), the information is stored in dictionaries. This has
>> some advantages. For example, it allows you to use record.keys() to find out
>> what the record contains. Ideally, I think that a Blast Record class should
>> inherit from a dictionary.
>
> I've developed for my own use a dict structure that represents a blast result.
> This structure also can represent many other results, like exonerate, SSAHA or
> any other number of aligners. Having a common representations for all of them
> allows you to create common filters that work with the same interface. I don't
> know if it is very efficient, but it has proven to be very convinient for us.
> You can take a look at:
>
> http://github.com/JoseBlanca/franklin/blob/master/franklin/alignment_search_result.py
>
> Best regards,
>
> Jose Blanca

It has some similarities to what I was imagining for a BioPerl-SearchIO-like
module. I'm still not convinced that we should just be using (subclasses of)
dictionaries - I would rather have important core properties like the hit
co-ordinates held explicitly as properties or attributes (and always using
Python counting, not whatever a given file format uses, like one-based
locations in BLAST output).

Peter


From krother at rubor.de  Tue Jun  1 14:11:51 2010
From: krother at rubor.de (Kristian Rother)
Date: Tue, 1 Jun 2010 16:11:51 +0200
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>
Message-ID: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>

Hi,

>>> from Bio.Struct import RNA
>>> # Would this work for you, Kristian?
>>
>> Yes, it would be more descriptive than the originally proposed Bio.RNA .
>> I
>> am just concerned whether I could keep the 2D structure-related modules
>> in the same package.
>
> I don't necessarily see a problem with Bio.Struct or Bio.Structure
> covering
> both 2D and 3D structures. Does this 2D stuff include file parsers? That
> would complicate plans for Bio.Struct.read() etc. Maybe Bio.RNA is better.

Yes, currently, RNA contains 2D stuff. It would complicate Struct.read().
On the other hand, the 2D stuff is independent from the 3D modules - could
be split into two packages -- but I think keeping RNA is simpler.

Best Regards,
   Kristian


From biopython at maubp.freeserve.co.uk  Tue Jun  1 15:15:03 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 1 Jun 2010 16:15:03 +0100
Subject: [Biopython-dev] subprocess and calling application wrappers
In-Reply-To: <20100601132355.GU1054@sobchak.mgh.harvard.edu>
References: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
	<20100601132355.GU1054@sobchak.mgh.harvard.edu>
Message-ID: <AANLkTinJplczWd6_m4sCmY8Gkh-AsjptvfAwa2TDa5M7@mail.gmail.com>

On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Peter;
>
>> With the new command line wrappers and the tutorial pushing
>> users towards using subprocess we've had more queries
>> about how to use it. The subprocess module itself is rather
>> scary I guess, and things could be made a lot easier.
> [...]
>> We could instead make the wrapper objects callable (define
>> the magic method __call__) to offer this kind of functionality.
>> This seems quite elegant to me.
>
> This is a good idea, although I'm 50/50 on the __call__ idea.
> Having a run() command or something similar might be more intuitive
> then the more magical call, if the idea is to appeal to users who
> find subprocess too problematic.

Fair point. We'd have to audit all the existing wrappers to make
sure we have some suitable names free (e.g run or execute).

> I'd suggest having an option to not capture stdout and stderr, which
> would help users avoid those cases where a program spews a lot to
> stdout and it's unwieldy to capture and stick it into a string.

We need to avoid any risk of deadlocks, so I guess the safe
implementation here would be call subprocess with stdout and
stderr sent to dev null.

Peter


From eric.talevich at gmail.com  Tue Jun  1 18:25:52 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 1 Jun 2010 14:25:52 -0400
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com> 
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com> 
	<312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
Message-ID: <AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>

On Tue, Jun 1, 2010 at 10:11 AM, Kristian Rother <krother at rubor.de> wrote:

> Hi,
>
> >>> from Bio.Struct import RNA
> >>> # Would this work for you, Kristian?
> >>
> >> Yes, it would be more descriptive than the originally proposed Bio.RNA .
> >> I
> >> am just concerned whether I could keep the 2D structure-related modules
> >> in the same package.
> >
> > I don't necessarily see a problem with Bio.Struct or Bio.Structure
> > covering
> > both 2D and 3D structures. Does this 2D stuff include file parsers? That
> > would complicate plans for Bio.Struct.read() etc. Maybe Bio.RNA is
> better.
>
> Yes, currently, RNA contains 2D stuff. It would complicate Struct.read().
> On the other hand, the 2D stuff is independent from the 3D modules - could
> be split into two packages -- but I think keeping RNA is simpler.
>
> Best Regards,
>    Kristian
>
>
I could be totally wrong here, but I think it's useful to lay out some
assumptions and intuitions explicitly.

To me, secondary structure is not really a separate dimension in its own
right, the way tertiary structure corresponds to 3D space and primary
structure corresponds to a linear sequence. Instead, secondary structure has
meaning in 3D space, but is usually serialized as a linear sequence. That
is, we want to parse something that resembles a sequence, but be able to map
it onto a 3D structure. (More for proteins than for RNA, usually.)

(For non-RNA folk, here's an example of RNA secondary structure:
http://github.com/krother/biopython/blob/rna/Tests/RNA/sample.vienna
)

For instance, the output of DSSP and Jpred describes a protein's secondary
structure, but the input to DSSP is a 3D structure, while Jpred accepts a
protein sequence. The representation of secondary structure isn't distinct
from either of these. I'd want both of these available in Bio.Struct
(eventually).

This means that some interaction between Bio.Struct and SeqIO is necessary.
It would be neat if secondary structure regions were represented as
SeqFeature instances, and secondary-structure parsers returned some kind of
subclass of SeqRecord -- or a standard SeqRecord containing a special kind
of Seq.

The secondary-structure parsers for RNA and proteins should be separate,
too, since the annotated features are different. So the function
Bio.Struct.read() can apply exclusively to 3D structures. Would it be
reasonable for Bio.Struct.RNA.read() to apply exclusively to RNA secondary
structures -- assuming that anything that's not a secondary structure, 3D
structure, or nucleotide sequence is something special that belongs in its
own module?

As for protein secondary structure, it's usually associated with a sequence
or a structure, so maybe we could get by with storing that information in an
ordinary Structure or SeqRecord object without inventing a new subclass.

Best,
Eric


From jblanca at btc.upv.es  Wed Jun  2 06:21:36 2010
From: jblanca at btc.upv.es (Jose Blanca)
Date: Wed, 2 Jun 2010 08:21:36 +0200
Subject: [Biopython-dev] subprocess and calling application wrappers
In-Reply-To: <AANLkTinJplczWd6_m4sCmY8Gkh-AsjptvfAwa2TDa5M7@mail.gmail.com>
References: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
	<20100601132355.GU1054@sobchak.mgh.harvard.edu>
	<AANLkTinJplczWd6_m4sCmY8Gkh-AsjptvfAwa2TDa5M7@mail.gmail.com>
Message-ID: <201006020821.36486.jblanca@btc.upv.es>

On Tuesday 01 June 2010 17:15:03 Peter wrote:
> On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> > Peter;
> >
> >> With the new command line wrappers and the tutorial pushing
> >> users towards using subprocess we've had more queries
> >> about how to use it. The subprocess module itself is rather
> >> scary I guess, and things could be made a lot easier.

We had the same need. We solved it with a call function. You can take a look 
at:

http://github.com/JoseBlanca/franklin/blob/master/franklin/utils/cmd_utils.py

Regards,

-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)


From krother at rubor.de  Wed Jun  2 08:17:01 2010
From: krother at rubor.de (Kristian Rother)
Date: Wed, 2 Jun 2010 10:17:01 +0200
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>
	<312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>
Message-ID: <efd9344002b1ace781f63182344f0859-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxTXg5dXg==-webmailer2@server06.webmailer.hosteurope.de>

Hi,

>> >>> from Bio.Struct import RNA
..
>> > I don't necessarily see a problem with Bio.Struct or Bio.Structure
>> > covering both 2D and 3D structures.


Eric, I agree with you - the secondary structure of RNA maps nicely to 3D
space. Generally, I think it is a little more common to work with RNA 2D
structures in absence of 3D information than in proteins - 2D prediction
of RNA is maybe simply a less nasty target.


Eric wrote:

> I could be totally wrong here, but I think it's useful to lay out some
> assumptions and intuitions explicitly.
>
> To me, secondary structure is not really a separate dimension in its own
> right, the way tertiary structure corresponds to 3D space and primary
> structure corresponds to a linear sequence. Instead, secondary structure
> has
> meaning in 3D space, but is usually serialized as a linear sequence. That
> is, we want to parse something that resembles a sequence, but be able to
> map
> it onto a 3D structure. (More for proteins than for RNA, usually.)
>
> (For non-RNA folk, here's an example of RNA secondary structure:
> http://github.com/krother/biopython/blob/rna/Tests/RNA/sample.vienna
> )
>
> For instance, the output of DSSP and Jpred describes a protein's secondary
> structure, but the input to DSSP is a 3D structure, while Jpred accepts a
> protein sequence. The representation of secondary structure isn't distinct
> from either of these. I'd want both of these available in Bio.Struct
> (eventually).
>
> This means that some interaction between Bio.Struct and SeqIO is
> necessary.
> It would be neat if secondary structure regions were represented as
> SeqFeature instances, and secondary-structure parsers returned some kind
> of
> subclass of SeqRecord -- or a standard SeqRecord containing a special kind
> of Seq.

So far the Secstruc parsers I've implemented just return
(sequence,secstruc) tuples. But putting this into a SeqRecord makes sense
- I understand this fits better to the BioPython architecture.

Maybe instead of a Seq or SeqRecord subclass we could use the decorator
pattern (decorating a class, not the Python decorator function syntax).


A potential problem that I'd like to point out early is that we are
working with modified RNA nucleotides a lot (up to 20% of residues in
every tRNA). This would require extending the RNA Alphabet (which now just
is "AGCU") - but I see this as remote from the Bio.XXXX.read() thread.

> The secondary-structure parsers for RNA and proteins should be separate,
> too, since the annotated features are different. So the function
> Bio.Struct.read() can apply exclusively to 3D structures. Would it be
> reasonable for Bio.Struct.RNA.read() to apply exclusively to RNA secondary
> structures -- assuming that anything that's not a secondary structure, 3D
> structure, or nucleotide sequence is something special that belongs in its
> own module?

To summarize, we could use:

1) protein 3D structures:
   Bio.Struct.read() --> Bio.PDB.Structure

2) RNA 3D structures:
   Bio.Struct.read() --> Bio.PDB.Structure

3) RNA 2D structures:
   Bio.Struct.RNA.read() --> Bio.SeqRecord (extended/decorated by a
secstruc field)

4) protein 2D structures: uses special parser module??

5) plain sequences:
   Bio.read() --> Bio.SeqRecord


Eric, does this summarize your thoughts correctly?

This would work for me. Any comments from the others.

Best,
   Kristian


From biopython at maubp.freeserve.co.uk  Wed Jun  2 08:44:54 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 2 Jun 2010 09:44:54 +0100
Subject: [Biopython-dev] subprocess and calling application wrappers
In-Reply-To: <201006020821.36486.jblanca@btc.upv.es>
References: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
	<20100601132355.GU1054@sobchak.mgh.harvard.edu>
	<AANLkTinJplczWd6_m4sCmY8Gkh-AsjptvfAwa2TDa5M7@mail.gmail.com>
	<201006020821.36486.jblanca@btc.upv.es>
Message-ID: <AANLkTilCHrgSCgkkqd0votf3qqIW97wzawk3pAC7ho7Z@mail.gmail.com>

On Wed, Jun 2, 2010 at 7:21 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> On Tuesday 01 June 2010 17:15:03 Peter wrote:
>> On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>> > Peter;
>> >
>> >> With the new command line wrappers and the tutorial pushing
>> >> users towards using subprocess we've had more queries
>> >> about how to use it. The subprocess module itself is rather
>> >> scary I guess, and things could be made a lot easier.
>
> We had the same need. We solved it with a call function. You can take
> a look at:
>
> http://github.com/JoseBlanca/franklin/blob/master/franklin/utils/cmd_utils.py
>

It looks complicated (and I'm sure with good reason), but I'd guess
you've never tried this on Windows?

We used to have the Bio.Application.generic_run function for calling
a command - but making the command line wrapper callable or
having a method on the command line wrapper is much easier to
use (no extra import needed).

Peter


From biopython at maubp.freeserve.co.uk  Wed Jun  2 09:23:15 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 2 Jun 2010 10:23:15 +0100
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>
	<312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>
Message-ID: <AANLkTimtoy9zWv74_KwwRhU-UQny1u0uzHQsy894wXJ8@mail.gmail.com>

On Tue, Jun 1, 2010 at 7:25 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
> I could be totally wrong here, but I think it's useful to lay out some
> assumptions and intuitions explicitly.
>
> To me, secondary structure is not really a separate dimension in its own
> right, the way tertiary structure corresponds to 3D space and primary
> structure corresponds to a linear sequence. Instead, secondary structure has
> meaning in 3D space, but is usually serialized as a linear sequence. That
> is, we want to parse something that resembles a sequence, but be able to map
> it onto a 3D structure. (More for proteins than for RNA, usually.)
>
> (For non-RNA folk, here's an example of RNA secondary structure:
> http://github.com/krother/biopython/blob/rna/Tests/RNA/sample.vienna
> )
>
> For instance, the output of DSSP and Jpred describes a protein's secondary
> structure, but the input to DSSP is a 3D structure, while Jpred accepts a
> protein sequence. The representation of secondary structure isn't distinct
> from either of these. I'd want both of these available in Bio.Struct
> (eventually).
>
> This means that some interaction between Bio.Struct and SeqIO is necessary.
> It would be neat if secondary structure regions were represented as
> SeqFeature instances, and secondary-structure parsers returned some kind of
> subclass of SeqRecord -- or a standard SeqRecord containing a special kind
> of Seq.
>
> ...
>
> As for protein secondary structure, it's usually associated with a sequence
> or a structure, so maybe we could get by with storing that information in an
> ordinary Structure or SeqRecord object without inventing a new subclass.

Maybe all/most secondary structure parsers can just go into Bio.SeqIO (for
both proteins, RNA and DNA). We can store a secondary structure string as
per-letter-annotation, or things like helix regions as SeqFeature objects.

Peter


From jblanca at btc.upv.es  Wed Jun  2 09:24:24 2010
From: jblanca at btc.upv.es (Jose Blanca)
Date: Wed, 2 Jun 2010 11:24:24 +0200
Subject: [Biopython-dev] subprocess and calling application wrappers
In-Reply-To: <AANLkTilCHrgSCgkkqd0votf3qqIW97wzawk3pAC7ho7Z@mail.gmail.com>
References: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
	<201006020821.36486.jblanca@btc.upv.es>
	<AANLkTilCHrgSCgkkqd0votf3qqIW97wzawk3pAC7ho7Z@mail.gmail.com>
Message-ID: <201006021124.24499.jblanca@btc.upv.es>

On Wednesday 02 June 2010 10:44:54 Peter wrote:
> On Wed, Jun 2, 2010 at 7:21 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> > On Tuesday 01 June 2010 17:15:03 Peter wrote:
> >> On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> >> > Peter;
> >> >
> >> >> With the new command line wrappers and the tutorial pushing
> >> >> users towards using subprocess we've had more queries
> >> >> about how to use it. The subprocess module itself is rather
> >> >> scary I guess, and things could be made a lot easier.
> >
> > We had the same need. We solved it with a call function. You can take
> > a look at:
> >
> > http://github.com/JoseBlanca/franklin/blob/master/franklin/utils/cmd_util
> >s.py
>
> It looks complicated (and I'm sure with good reason), but I'd guess
> you've never tried this on Windows?

Yes it is somewhat complicated. We need some functionalities like accepting 
stdout to be a file or just a pipe (some programs have very long stdouts). We 
have added everything we have required for our programs.

No, we haven't test anything on windows.

-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)


From biopython at maubp.freeserve.co.uk  Wed Jun  2 09:25:47 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 2 Jun 2010 10:25:47 +0100
Subject: [Biopython-dev] RNA alphabets; was Bio.PDB enhancements
Message-ID: <AANLkTinC97We1P65J3Hjfp1e5NTjrBRG_ILda1W2MWO3@mail.gmail.com>

On Wed, Jun 2, 2010 at 9:17 AM, Kristian Rother <krother at rubor.de> wrote:
>
> A potential problem that I'd like to point out early is that we are
> working with modified RNA nucleotides a lot (up to 20% of residues in
> every tRNA). This would require extending the RNA Alphabet (which now just
> is "AGCU") - but I see this as remote from the Bio.XXXX.read() thread.
>

What letters are you missing? There is a commented out ExtendedIUPACRNA
alphabet that may be relevant in Bio/Alphabets/IUPAC.py

Peter


From biopython at maubp.freeserve.co.uk  Wed Jun  2 11:36:46 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 2 Jun 2010 12:36:46 +0100
Subject: [Biopython-dev] subprocess and calling application wrappers
In-Reply-To: <AANLkTinJplczWd6_m4sCmY8Gkh-AsjptvfAwa2TDa5M7@mail.gmail.com>
References: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
	<20100601132355.GU1054@sobchak.mgh.harvard.edu>
	<AANLkTinJplczWd6_m4sCmY8Gkh-AsjptvfAwa2TDa5M7@mail.gmail.com>
Message-ID: <AANLkTilkILMvG5huqqZ2-rKMSQNqOAQGiirVO3rNBCwt@mail.gmail.com>

On Tue, Jun 1, 2010 at 4:15 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>> I'd suggest having an option to not capture stdout and stderr, which
>> would help users avoid those cases where a program spews a lot to
>> stdout and it's unwieldy to capture and stick it into a string.
>
> We need to avoid any risk of deadlocks, so I guess the safe
> implementation here would be call subprocess with stdout and
> stderr sent to dev null.

How does this look? Tested on Mac and Windows:
http://github.com/peterjc/biopython/tree/app-exec2

Example usage without capturing the output:

    from Bio.Emboss.Applications import WaterCommandline
    water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True,
                                 asequence="a.fasta", bsequence="b.fasta")
    print "About to run:\n%s" % water_cmd
    return_code = water_cmd()
    print "Return code: %i" % return_code

Example usage with stdout and stderr capture:

    from Bio.Emboss.Applications import WaterCommandline
    water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True,
                                 asequence="a.fasta", bsequence="b.fasta")
    print "About to run:\n%s" % water_cmd
    stdout, stderr, return_code = water_cmd(capture=True)
    print "Return code: %i" % return_code
    print "Tool output:\n%s" % stdout

Note in this implementation it either returns an integer error level
(the default) or a tuple of stdout, stderr and the error level return
code. If we opt for adding methods rather than using __call__
these could be different methods instead.

Another potentially useful option would be to copy the
subprocess.check_call() function in Python 2.5+ which verifies
the return code (error level) is zero and raises an exception if not
(probably only sensible if not capturing the output?). Maybe this
could even be the default behaviour?

[I would prefer to keep the interface as simple as possible though,
less options is better! KISS principle.]

Peter


From biopython at maubp.freeserve.co.uk  Wed Jun  2 11:59:46 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 2 Jun 2010 12:59:46 +0100
Subject: [Biopython-dev] subprocess and calling application wrappers
In-Reply-To: <AANLkTilkILMvG5huqqZ2-rKMSQNqOAQGiirVO3rNBCwt@mail.gmail.com>
References: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
	<20100601132355.GU1054@sobchak.mgh.harvard.edu>
	<AANLkTinJplczWd6_m4sCmY8Gkh-AsjptvfAwa2TDa5M7@mail.gmail.com>
	<AANLkTilkILMvG5huqqZ2-rKMSQNqOAQGiirVO3rNBCwt@mail.gmail.com>
Message-ID: <AANLkTimZShDqgJ__BO8sqvkJl7DBsLXS2iz-0ATW0saa@mail.gmail.com>

On Wed, Jun 2, 2010 at 12:36 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Tue, Jun 1, 2010 at 4:15 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>>> I'd suggest having an option to not capture stdout and stderr, which
>>> would help users avoid those cases where a program spews a lot to
>>> stdout and it's unwieldy to capture and stick it into a string.
>>
>> We need to avoid any risk of deadlocks, so I guess the safe
>> implementation here would be call subprocess with stdout and
>> stderr sent to dev null.
>
> How does this look? Tested on Mac and Windows:
> http://github.com/peterjc/biopython/tree/app-exec2
>
> Example usage without capturing the output:
>
> ? ?from Bio.Emboss.Applications import WaterCommandline
> ? ?water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? asequence="a.fasta", bsequence="b.fasta")
> ? ?print "About to run:\n%s" % water_cmd
> ? ?return_code = water_cmd()
> ? ?print "Return code: %i" % return_code
>
> Example usage with stdout and stderr capture:
>
> ? ?from Bio.Emboss.Applications import WaterCommandline
> ? ?water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? asequence="a.fasta", bsequence="b.fasta")
> ? ?print "About to run:\n%s" % water_cmd
> ? ?stdout, stderr, return_code = water_cmd(capture=True)
> ? ?print "Return code: %i" % return_code
> ? ?print "Tool output:\n%s" % stdout
>
> Note in this implementation it either returns an integer error level
> (the default) or a tuple of stdout, stderr and the error level return
> code. If we opt for adding methods rather than using __call__
> these could be different methods instead.
>
> Another potentially useful option would be to copy the
> subprocess.check_call() function in Python 2.5+ which verifies
> the return code (error level) is zero and raises an exception if not
> (probably only sensible if not capturing the output?). Maybe this
> could even be the default behaviour?
>
> [I would prefer to keep the interface as simple as possible though,
> less options is better! KISS principle.]

With that in mind, as I mentioned yesterday maybe we should just
update the documentation to suggest using os.system() when you
just need the return code and there is no stdin to worry about:

    import os
    from Bio.Emboss.Applications import WaterCommandline
    water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True,
                                 asequence="a.fasta", bsequence="b.fasta")
    print "About to run:\n%s" % water_cmd
    return_code = os.system(water_cmd)
    print "Return code: %i" % return_code

Even if the Python documentation seems to be discouraging it,
using os.system() seems simple, robust, and cross platform. We
could even update the tutorial now and post it online - it should
make some people's lives a little easier.

[Note this is actually a silly example, I should be telling water to
output to a file, not stdout which is then ignored.]

Peter


From krother at rubor.de  Wed Jun  2 12:14:05 2010
From: krother at rubor.de (Kristian Rother)
Date: Wed, 2 Jun 2010 14:14:05 +0200
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <AANLkTimtoy9zWv74_KwwRhU-UQny1u0uzHQsy894wXJ8@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>
	<312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>
	<AANLkTimtoy9zWv74_KwwRhU-UQny1u0uzHQsy894wXJ8@mail.gmail.com>
Message-ID: <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de>


Hi Peter,

Bio.SeqIO would be a nice place for RNA 2D parsers. I can create a new
branch for that (on Git: krother/biopython).

Putting secondary structures like '((((....))))' for a hairpin into the
letter_annotation field makes sense. I think it even would work for
pseudoknotted RNA (which is hard to represent as a string, one possible
notation would be '(((..[[[....)))..]]]'.

Where should the str subclass for secondary structures that the parsers
create go? Could it be Bio.Struct.RNA?

Best,
   Kristian


Putting RNA secondary structures
>> As for protein secondary structure, it's usually associated with a
>> sequence
>> or a structure, so maybe we could get by with storing that information
>> in an
>> ordinary Structure or SeqRecord object without inventing a new subclass.
>
> Maybe all/most secondary structure parsers can just go into Bio.SeqIO (for
> both proteins, RNA and DNA). We can store a secondary structure string as
> per-letter-annotation, or things like helix regions as SeqFeature objects.
>
> Peter
>
>


From krother at rubor.de  Wed Jun  2 12:21:43 2010
From: krother at rubor.de (Kristian Rother)
Date: Wed, 2 Jun 2010 14:21:43 +0200
Subject: [Biopython-dev] RNA alphabets; was Bio.PDB enhancements
In-Reply-To: <AANLkTinC97We1P65J3Hjfp1e5NTjrBRG_ILda1W2MWO3@mail.gmail.com>
References: <AANLkTinC97We1P65J3Hjfp1e5NTjrBRG_ILda1W2MWO3@mail.gmail.com>
Message-ID: <837b33ddc1456279e108d21c0d12d3fb-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWQtfXA==-webmailer2@server06.webmailer.hosteurope.de>


Hi Peter,

I'm afraid the matter is more complicated. To date, we have 115 modified
RNA bases, which means in practice that you run out of nice ASCII
characters. Moreover, some people use one-letter symbols in RNA as
wildcards (R for purine, Y for pyrimidine). As a consequence, several sets
of abbreviations have been developed - see
http://modomics.genesilico.pl/modification_list to get an impression.

We've written for our own purposes a class containing different ways of
nomenclature, but I think its incompatible to Bio.Alphabet - but I'd like
to change that.

Best Regards,
   Kristian


> On Wed, Jun 2, 2010 at 9:17 AM, Kristian Rother <krother at rubor.de> wrote:
>>
>> A potential problem that I'd like to point out early is that we are
>> working with modified RNA nucleotides a lot (up to 20% of residues in
>> every tRNA). This would require extending the RNA Alphabet (which now
>> just
>> is "AGCU") - but I see this as remote from the Bio.XXXX.read() thread.
>>
> What letters are you missing? There is a commented out ExtendedIUPACRNA
> alphabet that may be relevant in Bio/Alphabets/IUPAC.py
>
> Peter
>
>


From biopython at maubp.freeserve.co.uk  Wed Jun  2 13:22:36 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 2 Jun 2010 14:22:36 +0100
Subject: [Biopython-dev] RNA alphabets; was Bio.PDB enhancements
In-Reply-To: <837b33ddc1456279e108d21c0d12d3fb-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWQtfXA==-webmailer2@server06.webmailer.hosteurope.de>
References: <AANLkTinC97We1P65J3Hjfp1e5NTjrBRG_ILda1W2MWO3@mail.gmail.com>
	<837b33ddc1456279e108d21c0d12d3fb-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWQtfXA==-webmailer2@server06.webmailer.hosteurope.de>
Message-ID: <AANLkTikTZLYiEVOhrOe3gHxel5b_ijCCCF-HOb-X_tPT@mail.gmail.com>

On Wed, Jun 2, 2010 at 1:21 PM, Kristian Rother <krother at rubor.de> wrote:
>
> Hi Peter,
>
> I'm afraid the matter is more complicated. To date, we have 115 modified
> RNA bases, which means in practice that you run out of nice ASCII
> characters. Moreover, some people use one-letter symbols in RNA as
> wildcards (R for purine, Y for pyrimidine). As a consequence, several sets
> of abbreviations have been developed - see
> http://modomics.genesilico.pl/modification_list to get an impression.
>
> We've written for our own purposes a class containing different ways of
> nomenclature, but I think its incompatible to Bio.Alphabet - but I'd like
> to change that.
>
> Best Regards,
> ? Kristian

Hmm. I wonder if the HTML entities would work nicely in Python
(as unicode)? That way you could have an unambiguous string
representation where each letter is one character long.
I'm thinking a Seq subclass (with a special alphabet) might be
the way to go here, allowing access to the single character
entities by default but also the longer codes as well.

There are similarities with modified peptide sequences where
there are clear three letter codes, but not one letter codes.

Tricky.

Peter


From biopython at maubp.freeserve.co.uk  Wed Jun  2 13:24:49 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 2 Jun 2010 14:24:49 +0100
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>
	<312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>
	<AANLkTimtoy9zWv74_KwwRhU-UQny1u0uzHQsy894wXJ8@mail.gmail.com>
	<6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de>
Message-ID: <AANLkTinJE4eBPO8ydgrunBzWZqeN5Nw4nyuo8GEaRTzr@mail.gmail.com>

On Wed, Jun 2, 2010 at 1:14 PM, Kristian Rother <krother at rubor.de> wrote:
>
> Hi Peter,
>
> Bio.SeqIO would be a nice place for RNA 2D parsers. I can create a new
> branch for that (on Git: krother/biopython).
>
> Putting secondary structures like '((((....))))' for a hairpin into the
> letter_annotation field makes sense. I think it even would work for
> pseudoknotted RNA (which is hard to represent as a string, one possible
> notation would be '(((..[[[....)))..]]]'.
>
> Where should the str subclass for secondary structures that the parsers
> create go? Could it be Bio.Struct.RNA?
>
> Best,
> ? Kristian

You don't think plain strings in the SeqRecord's letter_annotation
dict would be enough? Assuming you do need something then
perhaps under Bio.Seq or Bio.SeqUtils might be worth considering
as alternatives to Bio.Struct.RNA.

Peter


From eric.talevich at gmail.com  Thu Jun  3 16:17:09 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 3 Jun 2010 12:17:09 -0400
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com> 
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com> 
	<312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com> 
	<AANLkTimtoy9zWv74_KwwRhU-UQny1u0uzHQsy894wXJ8@mail.gmail.com> 
	<6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de>
Message-ID: <AANLkTimUOe06zg8ksdoyLXtg6SkPZtTg-RdtAWSQ-cQi@mail.gmail.com>

On Wed, Jun 2, 2010 at 8:14 AM, Kristian Rother <krother at rubor.de> wrote:

>
> Putting secondary structures like '((((....))))' for a hairpin into the
> letter_annotation field makes sense. I think it even would work for
> pseudoknotted RNA (which is hard to represent as a string, one possible
> notation would be '(((..[[[....)))..]]]'.
>
>
Here's another format that was designed to represent pseudoknots:
http://www.uga.edu/RNA-Informatics/files/software/RNApasta.help.html#Format

I'm not sure how standardized or widely used it is, but the program
RNA-pasta works with it:
http://www.uga.edu/RNA-Informatics/?f=software&p=RNApasta

-Eric


From biopython at maubp.freeserve.co.uk  Thu Jun  3 16:43:47 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 3 Jun 2010 17:43:47 +0100
Subject: [Biopython-dev] More SeqRecord methods
In-Reply-To: <AANLkTikXgZH_CYjV4J_Ii8DtPWwtsN35SJ-Hc_gpq3B1@mail.gmail.com>
References: <AANLkTikXgZH_CYjV4J_Ii8DtPWwtsN35SJ-Hc_gpq3B1@mail.gmail.com>
Message-ID: <AANLkTikDZm4AjPG_gbZhBTy0i11PRT_FtDg_cmrvoqI0@mail.gmail.com>

On Mon, May 31, 2010 at 3:53 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> What do people think of adding upper and lower methods to the SeqRecord?
> http://bugzilla.open-bio.org/show_bug.cgi?id=3054

I checked that in with an example in the tutorial.

> If that is well received, how about adding another Seq method to the
> SeqRecord, the newish ungap method?
> http://bugzilla.open-bio.org/show_bug.cgi?id=3060

This one I would like some feedback on first. I'm sure the implementation
could me made much more efficient too.

Peter


From bugzilla-daemon at portal.open-bio.org  Thu Jun  3 16:45:16 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 3 Jun 2010 12:45:16 -0400
Subject: [Biopython-dev] [Bug 3054] Add upper and lower methods to the
	SeqRecord
In-Reply-To: <bug-3054-42@http.bugzilla.open-bio.org/>
Message-ID: <201006031645.o53GjGd9019264@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3054


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-03 12:45 EST -------
Checked in:
http://github.com/biopython/biopython/tree/f4f11a9c4e7aca10c33cfe93c78d4972a0d736f8

With an example in the tutorial too:
http://github.com/biopython/biopython/commit/3de8bbd423010eb0b480b8966041f7c6d8e9890d

Marking this as fixed. See also:
http://lists.open-bio.org/pipermail/biopython-dev/2010-May/007772.html
http://lists.open-bio.org/pipermail/biopython-dev/2010-June/007801.html


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Thu Jun  3 17:24:43 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 3 Jun 2010 18:24:43 +0100
Subject: [Biopython-dev] More SeqRecord methods
In-Reply-To: <AANLkTikDZm4AjPG_gbZhBTy0i11PRT_FtDg_cmrvoqI0@mail.gmail.com>
References: <AANLkTikXgZH_CYjV4J_Ii8DtPWwtsN35SJ-Hc_gpq3B1@mail.gmail.com>
	<AANLkTikDZm4AjPG_gbZhBTy0i11PRT_FtDg_cmrvoqI0@mail.gmail.com>
Message-ID: <AANLkTinvkmF3ZUfO7z7n0HXMxj6W4b_kG9as-s8zDMlq@mail.gmail.com>

On Thu, Jun 3, 2010 at 5:43 PM, Peter wrote:
> On Mon, May 31, 2010 at 3:53 PM, Peter wrote:
>
>> ..., how about adding another Seq method to the
>> SeqRecord, the newish ungap method?
>> http://bugzilla.open-bio.org/show_bug.cgi?id=3060
>
> This one I would like some feedback on first. I'm sure the
> implementation could be made much more efficient too.

Maybe I should mention that I also envisage a similar method for the
alignment object, to give a new alignment with any all-gap-columns
removed (perhaps with an optional argument to specify a threshold
for the number of gaps required - defaulting to only removing columns
which are all gaps).

Again, the simplest way to implement this is to re-use the new
alignment slicing and addition features - much as how I did it
for the proposed SeqRecord ungap method.

Peter


From eric.talevich at gmail.com  Thu Jun  3 19:10:51 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 3 Jun 2010 15:10:51 -0400
Subject: [Biopython-dev] Fixup branch for Bio.PDB
Message-ID: <AANLkTikl98AiYImXrrYZBrWW8tKu1ZV5LM5jnLNhdBEX@mail.gmail.com>

Hi all,

I've poked around Bugzilla, taken patches for some outstanding bugs, and
applied them to a branch on GitHub:
http://github.com/etal/biopython/tree/pdbfixes
http://github.com/etal/biopython/commits/pdbfixes

I'd like to encourage people to test this branch with their own code, and if
it all still works (or nobody's interested in testing this branch), I'll
push it to the Biopython trunk so it gets tested more. Time frame: if this
branch lingers too long, there's a high chance it will cause conflicts for
Jo?o (our GSoC student) the next time he merges. How about a week?

The branch has patches for bugs 2820, 2948, 2879, 2950 and 2951:
http://bugzilla.open-bio.org/show_bug.cgi?id=2820
http://bugzilla.open-bio.org/show_bug.cgi?id=2948
http://bugzilla.open-bio.org/show_bug.cgi?id=2879
http://bugzilla.open-bio.org/show_bug.cgi?id=2950
http://bugzilla.open-bio.org/show_bug.cgi?id=2951

Thanks,
Eric


From biopython at maubp.freeserve.co.uk  Fri Jun  4 08:44:19 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 4 Jun 2010 09:44:19 +0100
Subject: [Biopython-dev] Fixup branch for Bio.PDB
In-Reply-To: <AANLkTikl98AiYImXrrYZBrWW8tKu1ZV5LM5jnLNhdBEX@mail.gmail.com>
References: <AANLkTikl98AiYImXrrYZBrWW8tKu1ZV5LM5jnLNhdBEX@mail.gmail.com>
Message-ID: <AANLkTin3Yr8UX4PmksYXSPXD-0OB_D_H-mJ4d4lWSOP2@mail.gmail.com>

On Thu, Jun 3, 2010 at 8:10 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
> Hi all,
>
> I've poked around Bugzilla, taken patches for some outstanding bugs, and
> applied them to a branch on GitHub:
> http://github.com/etal/biopython/tree/pdbfixes
> http://github.com/etal/biopython/commits/pdbfixes
>
> I'd like to encourage people to test this branch with their own code, and if
> it all still works (or nobody's interested in testing this branch), I'll
> push it to the Biopython trunk so it gets tested more. Time frame: if this
> branch lingers too long, there's a high chance it will cause conflicts for
> Jo?o (our GSoC student) the next time he merges. How about a week?
>
> The branch has patches for bugs 2820, 2948, 2879, 2950 and 2951:
> http://bugzilla.open-bio.org/show_bug.cgi?id=2820
> http://bugzilla.open-bio.org/show_bug.cgi?id=2948
> http://bugzilla.open-bio.org/show_bug.cgi?id=2879
> http://bugzilla.open-bio.org/show_bug.cgi?id=2950
> http://bugzilla.open-bio.org/show_bug.cgi?id=2951
>
> Thanks,
> Eric

That sounds like a good plan.

Peter


From mjldehoon at yahoo.com  Fri Jun  4 15:55:27 2010
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Fri, 4 Jun 2010 08:55:27 -0700 (PDT)
Subject: [Biopython-dev] Blast parsers and records
In-Reply-To: <901919.44402.qm@web62402.mail.re1.yahoo.com>
Message-ID: <933074.46322.qm@web62405.mail.re1.yahoo.com>

Michael, Peter, Sebastian, Laurent, Jose, and others,

Thanks for your comments. It looks like there are lots of things to discuss, so let's start with the easiest ones.

About converting a record to a string (point 5): I agree that using __str__ is probably not the best choice, so let's use __format__ instead, or add a "write" method. The added advantage of these is that we can print out a record in different formats (xml, text, table) by specifying the requested format as an argument.

For point 3), maybe my wording was confusing; actually what I had in mind is the case where a given Blast program can produce different output formats (xml, text, table, etc.). This was inspired by this bug report:
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
In my mind, the different output formats are just different intermediates, but in essence they are the same and should therefore be stored in the same class. So, if I run blastp, save the result as XML, and parse it, I'd expect the same class as when I run blastp and save and parse the output in table format. Just in the latter case, some information may be missing if it is not available in the output in table format. Does that sound acceptable?

--Michiel.

 
--- On Fri, 5/28/10, Michiel de Hoon <mjldehoon at yahoo.com> wrote:

> From: Michiel de Hoon <mjldehoon at yahoo.com>
> Subject: [Biopython-dev] Blast parsers and records
> To: biopython-dev at biopython.org
> Date: Friday, May 28, 2010, 11:23 PM
> Hi everybody,
> 
> With Biopython 1.54 out (thanks Peter!), and NCBI
> encouraging to use its new Blast+ suite of Blast programs,
> maybe this is a good time to tackle some older bugs related
> to Blast output parsing in Biopython:
> 
> http://bugzilla.open-bio.org/show_bug.cgi?id=2176
> (inconsistencies in the output of different Blast parsers)
> 
> http://bugzilla.open-bio.org/show_bug.cgi?id=2929
> (inconsistencies between Psi-blast parsers)
> 
> http://bugzilla.open-bio.org/show_bug.cgi?id=2319
> (parsing Blast table output)
> 
> and more generally think about the design of the Blast
> record class and Blast parsing. In my opinion, these are the
> major issues:
> 
> 1) Blast parsers are located in several modules
> (Bio.Blast.NCBIXML, Bio.Blast.NCBIStandalone,
> Bio.Blast.ParseBlastTable). I think we should have one
> read() function and one parse() function under Bio.Blast,
> with arguments specifying which format the Blast output is
> in.
> 
> 2) Blast records produced by any of the parsers should be
> consistent with each other. As XML output by blast and
> psi-blast follow the same DTD, we should be able to
> represent both by a single Record class.
> 
> 3) Different parsers should store information in this
> Record class in the same way.
> 
> 4) The current Blast record stores its information in
> attributes. If you use Bio.Entrez to parse Blast XML output
> (Biopython 1.54 contains the necessary DTDs to do so), the
> information is stored in dictionaries. This has some
> advantages. For example, it allows you to use record.keys()
> to find out what the record contains. Ideally, I think that
> a Blast Record class should inherit from a dictionary.
> 
> 5) We should be able to print a Blast record object to
> generate output that is close to the plain-text output
> generated by blast. This would allow us to generate and
> store Blast output as XML, and to convert the output to
> plain-text to make it more human-readable.
> 
> 6) The current Blast record inherits from
> Bio.Blast.Record.Header, Bio.Blast.Record.DatabaseReport,
> and Bio.Blast.Record.Parameters. I don't see the rationale
> for this inheritance, and I think we should remove it.
> 
> Any comments, suggestions (in particular about by proposal
> to have a Blast Record class that inherits from a
> dictionary? Btw, to avoid breaking scripts, I propose that
> any changes to the Blast record and parser are implemented
> separately from the existing parsers and record, and to
> leave those untouched.
> 
> --Michiel.
> 
> 
> ? ? ? 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
> 


From biopython at maubp.freeserve.co.uk  Sat Jun  5 14:49:39 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 5 Jun 2010 15:49:39 +0100
Subject: [Biopython-dev] EuroSciPy 2010 conference in Paris
Message-ID: <AANLkTimJSOqEgUHokfs3P-6MNS5yKxl4_CNB5f8-X0AR@mail.gmail.com>

Hi all,

Are any Biopython folk planning to be at the EuroSciPy
conference in Paris this year (July 2010)? They are still
finalising the Scientific track, but the list of tutorials is
quite interesting already:

http://www.euroscipy.org/conference/euroscipy2010

Peter


From biopython at maubp.freeserve.co.uk  Mon Jun  7 09:35:15 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 7 Jun 2010 10:35:15 +0100
Subject: [Biopython-dev] Working directly on the main git repository
Message-ID: <AANLkTikhnHfMIQfWU3ORrlfG1z2Zxsdx-sEycFSB3oa-@mail.gmail.com>

Hi all,

I thought I'd write down some notes about how I've been using git recently.
This may be of interest to any of the other core developers (those of us
with read-write access to the main repository), and I might get some good
tips from any discussion. The key point is that I have read+write access
to two repositories on github (the official repository AND my own fork),
so there are different advantages/disadvantages about which I choose
to work with directly as my main repository.

Our official repository has just a single stable master branch, and I
often need to work directly with this (e.g. committing small bug fixes
or adding more documentation). I therefore if I setup a clone of the
master repository I can work on the main branch very easily.

Now, when working on a branch for new features, I could just do this
locally, and when they are ready, merge them direct to the master.
However, this means others cannot look at my work (and I find it a
problem when working on multiple machines).

Alternatively, I could push the branches to the public "master" repository.
This would be the simplest option BUT the high visibility gives any such
experimental branch disproportionate status. I think this would be
a good idea for important (multi-person) efforts, like Python 3 work.

Instead, I have a github repository of my own (what github calls a
fork), and I push branches there.

http://github.com/biopython/biopython - the official branch(es)
http://github.com/peterjc/biopython - my branches

How does this work in practice? Like this - I clone the master
and add a reference to my repository (and I do the same when I
want to grab a branch from another developer):

git clone git at github.com:biopython/biopython.git
cd biopython
git remote add peterjc git at github.com:peterjc/biopython.git
git fetch peterjc

Then make a new local branch as usual, and when ready to share
it publicly, I push it to *my* repository on github:

git branch new-work
git checkout new-work
git commit ...
git push peterjc new-work

This would then appear as a new-work branch on my github page.
Then if I (or someone else) wants to access these branches later
(e.g. from another machine) just use the checkout tracked remote
branch. For example,

git clone git at github.com:biopython/biopython.git
cd biopython
git remote add peterjc git at github.com:peterjc/biopython.git
git fetch peterjc
git checkout -t peterjc/seqio-imgt

This then looks like a normal branch (called just "seqio-imgt" in
this example), but git knows it is linked to the remote branch on
the "peterjc" repository (not the origin which is the "official"
repository).

I'd have to check, but I guess that if the original git clone is done
with git://github.com/biopython/biopython.git instead (read only
access) the same procedure could be used by non core devs.
However, I'm not sure this is clearer for them. I think the current
procedure (on our wiki) where you add a remote reference to
the "upstream" official repository works better in this case.

Comments?

Peter

Useful links from Google searches:
http://www.gitready.com/intermediate/2009/01/09/checkout-remote-tracked-branch.html
http://www.gitready.com/beginner/2009/03/09/remote-tracking-branches.html


From biopython at maubp.freeserve.co.uk  Mon Jun  7 13:40:54 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 7 Jun 2010 14:40:54 +0100
Subject: [Biopython-dev] EuroSciPy 2010 conference in Paris
In-Reply-To: <AANLkTimJSOqEgUHokfs3P-6MNS5yKxl4_CNB5f8-X0AR@mail.gmail.com>
References: <AANLkTimJSOqEgUHokfs3P-6MNS5yKxl4_CNB5f8-X0AR@mail.gmail.com>
Message-ID: <AANLkTilToD0eVKZ8ZryPhJhyqVi-leNZkD8mnKcivDL2@mail.gmail.com>

On Sat, Jun 5, 2010 at 3:49 PM, Peter wrote:
> Hi all,
>
> Are any Biopython folk planning to be at the EuroSciPy
> conference in Paris this year (July 2010)? They are still
> finalising the Scientific track, but the list of tutorials is
> quite interesting already:
>
> http://www.euroscipy.org/conference/euroscipy2010
>
> Peter

Hi all,

The track list for the EuroSciPy 2010 Scientific track has
now been announced, and I'm delighted that I will be able
to present a talk on Biopython (likely 4pm Saturday 10 July).
While I hope there will be some other Biopython users there,
this is a nice opportunity to meet the broader scientific python
community. There are still places at the moment if you want
to attend:

http://www.euroscipy.org/conference/euroscipy2010

Unfortunately I will not be attending BOSC or ISMB this
year. However Brad Chapman will be there to present the
annual "Biopython Project Update" talk (as well as helping
to organise this year's BOSC and the associated CodeFest
event preceding it). I'd love to have been there too, but I'm
sure everyone attending will have a great time. Again,
registration is still open:

http://www.open-bio.org/wiki/BOSC_2010
http://www.open-bio.org/wiki/Codefest_2010

Regards,

Peter

P.S. Those of you in North America you might also be
interested in the main SciPy conference in Austin, Texas
(28 June to 3 July 2010):

http://conference.scipy.org/scipy2010/


From biopython at maubp.freeserve.co.uk  Mon Jun  7 13:50:06 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 7 Jun 2010 14:50:06 +0100
Subject: [Biopython-dev] Blast parsers and records
In-Reply-To: <933074.46322.qm@web62405.mail.re1.yahoo.com>
References: <901919.44402.qm@web62402.mail.re1.yahoo.com>
	<933074.46322.qm@web62405.mail.re1.yahoo.com>
Message-ID: <AANLkTikenpBKpMWK7pPSk6RYhCW5mvb5LXjxpFilopvD@mail.gmail.com>

On Fri, Jun 4, 2010 at 4:55 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> Michael, Peter, Sebastian, Laurent, Jose, and others,
>
> Thanks for your comments. It looks like there are lots of things to discuss,
> so let's start with the easiest ones.
>
> About converting a record to a string (point 5): I agree that using __str__ is
> probably not the best choice, so let's use __format__ instead, or add a "write"
> method. The added advantage of these is that we can print out a record in
> different formats (xml, text, table) by specifying the requested format as an argument.

The __format__ or format method sounds like a great idea (following other
bits of Biopython).

> For point 3), maybe my wording was confusing; actually what I had in mind
> is the case where a given Blast program can produce different output formats
> (xml, text, table, etc.). This was inspired by this bug report:
> http://bugzilla.open-bio.org/show_bug.cgi?id=2176
> In my mind, the different output formats are just different intermediates, but
> in essence they are the same and should therefore be stored in the same
> class. So, if I run blastp, save the result as XML, and parse it, I'd expect the
> same class as when I run blastp and save and parse the output in table format.
> Just in the latter case, some information may be missing if it is not available in
> the output in table format. Does that sound acceptable?

I agree that records from all the different BLAST output formats should be
represented by a common base class - but not necessarily the same class.
For example, the default plain text and XML formats include the pairwise
alignments, but the tabular output does not. To me having a sub-class which
stores the pairwise alignments seems natural here.

Peter


From biopython at maubp.freeserve.co.uk  Mon Jun  7 17:45:57 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 7 Jun 2010 18:45:57 +0100
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
Message-ID: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>

Hi all,

Thanks for the lively discussion on the main list,

http://lists.open-bio.org/pipermail/biopython/2010-June/006546.html
...
http://lists.open-bio.org/pipermail/biopython/2010-June/006580.html

I've spent the afternoon updating my old branch which uses SQLite
to store the record identifier to file offset mapping. Using the code
on this branch, Bio.SeqIO.index() supports a new optional argument
currently called "db" (other names I like including "cache", suggestions
welcome):

http://github.com/peterjc/biopython/tree/index-sqlite

The default (False) is not to use SQLite, but continue with an in
memory Python dictionary. As long as you have enough RAM
and don't plan to use the index at a later date, this will be fastest.

If set to True or a filename, then an SQLite index is used to hold
the offsets. This means very low RAM requirements, but is a lot
slower because the offsets are written to disk and the SQLite
index is updated as we go. I expect this part can be optimised
(e.g. try to build the index at the end, try committing in batches).

I'm still testing this, but the core of the work is done I think.
Once we're happy with the public API, we can concentrate
on things like the SQLite schema, and optimising the code.

Peter

P.S. I know it will need a little work to fail gracefully on Python 2.4
when SQLite isn't installed.


From biopython at maubp.freeserve.co.uk  Mon Jun  7 18:23:05 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 7 Jun 2010 19:23:05 +0100
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
Message-ID: <AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>

Peter wrote:
>...
>
> http://github.com/peterjc/biopython/tree/index-sqlite
>
> ... an SQLite index is used to hold
> the offsets. This means very low RAM requirements, but is a lot
> slower because the offsets are written to disk and the SQLite
> index is updated as we go. I expect this part can be optimised
> (e.g. try to build the index at the end, try committing in batches).

Having now tried using this on some files with tens of millions of
records, tuning how we use SQLite is going to be important.

Peter


From bioinformed at gmail.com  Mon Jun  7 21:10:42 2010
From: bioinformed at gmail.com (Kevin Jacobs <jacobs@bioinformed.com>)
Date: Mon, 7 Jun 2010 17:10:42 -0400
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
Message-ID: <AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>

On Mon, Jun 7, 2010 at 2:23 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> Peter wrote:
> >...
> >
> > http://github.com/peterjc/biopython/tree/index-sqlite
> >
> > ... an SQLite index is used to hold
> > the offsets. This means very low RAM requirements, but is a lot
> > slower because the offsets are written to disk and the SQLite
> > index is updated as we go. I expect this part can be optimised
> > (e.g. try to build the index at the end, try committing in batches).
>
> Having now tried using this on some files with tens of millions of
> records, tuning how we use SQLite is going to be important.
>
>
Wouldn't a Berkeley database be much much faster for constructing simple key
to offset mappings?

-Kevin


From anaryin at gmail.com  Tue Jun  8 00:45:05 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Mon, 7 Jun 2010 19:45:05 -0500
Subject: [Biopython-dev] [GSOC] Report - Week 1
Message-ID: <AANLkTinnFWWM2ojjfBtBu2Qt4911cUZTJyI0J1hVYoji@mail.gmail.com>

Dear all,

Eric suggested me to write a weekly email wrapping up my progress, any
problems I encountered, new ideas, etc. So, here's week 1 :)

*Proposed Tasks:*
Wiki<http://www.biopython.org/wiki/GSOC2010_Joao#Week_1_.5B31st_May_-_6th_June.5D>
*Project's Github account:*
Link<http://github.com/JoaoRodrigues/biopython/tree/GSOC2010>
*
Progress:*

*1. Renumbering Residues*

I wrote a small function in Structure.py
(link<http://github.com/JoaoRodrigues/biopython/blob/GSOC2010/Bio/PDB/Structure.py#L57>)
that iterates over the residues in a chain and subtracts the original first
residue number. This keeps gaps intacts. Worked on my machine for a set of
75 proteins I was working on. Also allows for people to change the starting
residue for whatever reason, the default being 1.

I had originally thought of having a SEQREQ parsing function and using this
as a base for the new renumbering. However, most structures that lack
residues (gaps) still count them in the numbering. Since there is no parser
for SEQRES, I thought this to be the best option.

*Example
*
...
s = p.get_structure('a', '2KSX.pdb')
s.renumber_residues()
s.renumber_residues(start=0)


*2. Disulphide bond search*

I originally proposed to use the NeighborSearch method but I didn't know
that subtracting two atom objects gave me their distance. I used this
instead.

I defined a threshold of 3A for a S-S since the average is 2.05A. I tried to
get some paper/doc from other software where such a limit would be already
defined but I didn't find any.. thus, I assigned 3 because its results
agreed with the SSBOND records. The user can provide a threshold integer or
float as an argument to make the search stricter or broader.

The function generates first an iterator with all the pairs of cysteines
possible in the protein. It then checks and yields those with distances
between the SG atoms of the cystein below the threshold. The result is also
an iterator with tuples containing pairs of residue objects.

*Example*

...
s = p.get_structure('a', '2KSX.pdb')
[i for i in s.search_ss_bonds()]
[(<Residue CYS het=  resseq=1 icode= >, <Residue CYS het=  resseq=98 icode=
>), (<Residue CYS het=  resseq=29 icode= >, <Residue CYS het=  resseq=138
icode= >), (<Residue CYS het=  resseq=12 icode= >, <Residue CYS het=
resseq=95 icode= >), (<Residue CYS het=  resseq=58 icode= >, <Residue CYS
het=  resseq=66 icode= >), (<Residue CYS het=  resseq=180 icode= >, <Residue
CYS het=  resseq=200 icode= >)]
len([i for i in s.search_ss_bonds(threshold=100)])
45


*Problems:*

*3. Biological Unit*

I added code to parse_pdb_header to extract the REMARK 350 section. They
contain something like this (1IHM.pdb<http://www.pdb.org/pdb/files/1IHM.pdb>
):

REMARK 350   BIOMT1   1  1.000000  0.000000  0.000000
0.00000
REMARK 350   BIOMT2   1  0.000000  1.000000  0.000000
0.00000
REMARK 350   BIOMT3   1  0.000000  0.000000  1.000000
0.00000
REMARK 350   BIOMT1   2  0.500000 -0.809017 -0.309017
0.00000
REMARK 350   BIOMT2   2  0.809017  0.309017  0.500000
0.00000
REMARK 350   BIOMT3   2 -0.309017 -0.500000  0.809017
0.00000
REMARK 350   BIOMT1   3 -0.309017 -0.500000 -0.809017        0.00000

I parse out the 4th column to identify each transformation. I store a 3x3
rotation matrix and the translation vector separately. It is then easy to
apply them to each atom record via the transform function.

Now, the problem lies in what the output should be. We broke it down to two
main options:

a. Create a new structure object for each rotated/translated object, thus
making the final output a list of structures. This takes quite a while
actually. I tried this with a deepcopy method to copy each structure and it
took over 30 seconds on my machine for that PDB file above.

b. Add the new rotated objects as new chains in the original structure. This
is actually a good solution because it allows people to use other methods
(the SS search comes to mind) on quartenary structures. It also allows the
user to write a file with all the structures in their place using PDBIO
quite seamlessly. However, it might be complicated to deal with an excess of
chains, or if not all chains are supposed to be rotated (dunno if the case
actually exists).

My personal belief is that B is the way to go. Although it adulterates the
original structure with alien chains, it allows much greater flexibility. I
haven't tested it though.

----

Comments? :)

Jo?o [...] Rodrigues


From anaryin at gmail.com  Tue Jun  8 03:42:27 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Mon, 7 Jun 2010 22:42:27 -0500
Subject: [Biopython-dev] [GSOC] Report - Week 1
In-Reply-To: <AANLkTinnFWWM2ojjfBtBu2Qt4911cUZTJyI0J1hVYoji@mail.gmail.com>
References: <AANLkTinnFWWM2ojjfBtBu2Qt4911cUZTJyI0J1hVYoji@mail.gmail.com>
Message-ID: <AANLkTinyVRT9SfeFJDl3sfI7TTmtqjdRalFh3nbE5O3h@mail.gmail.com>

Just my own heads up and comment.

I thought of using MODEL records to hold the rotated structures. Citing the
PDB format guidelines:

This record is used only when more than one model appears in an entry.
*Generally,
> it is employed mainly for NMR structures.* The chemical connectivity
> should be the same for each model. ATOM, HETATM, ANISOU, and TER records for
> each model structure and are interspersed as needed between MODEL and ENDMDL
> records.
>

Since REMARK 350 seems to be a X-Ray exclusive feature and conversely MODEL
a NMR one, I believe this could also be a possible solution. I'm adding the
code I wrote to Git. There is a huge speed problem with that deepcopy
method.. if someone has a faster/better alternative, it would be great as
this takes around 2 seconds per matrix.

Best!

Jo?o [...] Rodrigues
@ http://www.biopython.org/wiki/User:Joaor


On Mon, Jun 7, 2010 at 7:45 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:

> Dear all,
>
> Eric suggested me to write a weekly email wrapping up my progress, any
> problems I encountered, new ideas, etc. So, here's week 1 :)
>
> *Proposed Tasks:* Wiki<http://www.biopython.org/wiki/GSOC2010_Joao#Week_1_.5B31st_May_-_6th_June.5D>
> *Project's Github account:* Link<http://github.com/JoaoRodrigues/biopython/tree/GSOC2010>
> *
> Progress:*
>
> *1. Renumbering Residues*
>
> I wrote a small function in Structure.py (link<http://github.com/JoaoRodrigues/biopython/blob/GSOC2010/Bio/PDB/Structure.py#L57>)
> that iterates over the residues in a chain and subtracts the original first
> residue number. This keeps gaps intacts. Worked on my machine for a set of
> 75 proteins I was working on. Also allows for people to change the starting
> residue for whatever reason, the default being 1.
>
> I had originally thought of having a SEQREQ parsing function and using this
> as a base for the new renumbering. However, most structures that lack
> residues (gaps) still count them in the numbering. Since there is no parser
> for SEQRES, I thought this to be the best option.
>
> *Example
> *
> ...
> s = p.get_structure('a', '2KSX.pdb')
> s.renumber_residues()
> s.renumber_residues(start=0)
>
>
> *2. Disulphide bond search*
>
> I originally proposed to use the NeighborSearch method but I didn't know
> that subtracting two atom objects gave me their distance. I used this
> instead.
>
> I defined a threshold of 3A for a S-S since the average is 2.05A. I tried
> to get some paper/doc from other software where such a limit would be
> already defined but I didn't find any.. thus, I assigned 3 because its
> results agreed with the SSBOND records. The user can provide a threshold
> integer or float as an argument to make the search stricter or broader.
>
> The function generates first an iterator with all the pairs of cysteines
> possible in the protein. It then checks and yields those with distances
> between the SG atoms of the cystein below the threshold. The result is also
> an iterator with tuples containing pairs of residue objects.
>
> *Example*
>
> ...
> s = p.get_structure('a', '2KSX.pdb')
> [i for i in s.search_ss_bonds()]
> [(<Residue CYS het=  resseq=1 icode= >, <Residue CYS het=  resseq=98 icode=
> >), (<Residue CYS het=  resseq=29 icode= >, <Residue CYS het=  resseq=138
> icode= >), (<Residue CYS het=  resseq=12 icode= >, <Residue CYS het=
> resseq=95 icode= >), (<Residue CYS het=  resseq=58 icode= >, <Residue CYS
> het=  resseq=66 icode= >), (<Residue CYS het=  resseq=180 icode= >, <Residue
> CYS het=  resseq=200 icode= >)]
> len([i for i in s.search_ss_bonds(threshold=100)])
> 45
>
>
>
> *Problems:*
>
> *3. Biological Unit*
>
> I added code to parse_pdb_header to extract the REMARK 350 section. They
> contain something like this (1IHM.pdb<http://www.pdb.org/pdb/files/1IHM.pdb>
> ):
>
> REMARK 350   BIOMT1   1  1.000000  0.000000  0.000000
> 0.00000
> REMARK 350   BIOMT2   1  0.000000  1.000000  0.000000
> 0.00000
> REMARK 350   BIOMT3   1  0.000000  0.000000  1.000000
> 0.00000
> REMARK 350   BIOMT1   2  0.500000 -0.809017 -0.309017
> 0.00000
> REMARK 350   BIOMT2   2  0.809017  0.309017  0.500000
> 0.00000
> REMARK 350   BIOMT3   2 -0.309017 -0.500000  0.809017
> 0.00000
> REMARK 350   BIOMT1   3 -0.309017 -0.500000 -0.809017        0.00000
>
> I parse out the 4th column to identify each transformation. I store a 3x3
> rotation matrix and the translation vector separately. It is then easy to
> apply them to each atom record via the transform function.
>
> Now, the problem lies in what the output should be. We broke it down to two
> main options:
>
> a. Create a new structure object for each rotated/translated object, thus
> making the final output a list of structures. This takes quite a while
> actually. I tried this with a deepcopy method to copy each structure and it
> took over 30 seconds on my machine for that PDB file above.
>
> b. Add the new rotated objects as new chains in the original structure.
> This is actually a good solution because it allows people to use other
> methods (the SS search comes to mind) on quartenary structures. It also
> allows the user to write a file with all the structures in their place using
> PDBIO quite seamlessly. However, it might be complicated to deal with an
> excess of chains, or if not all chains are supposed to be rotated (dunno if
> the case actually exists).
>
> My personal belief is that B is the way to go. Although it adulterates the
> original structure with alien chains, it allows much greater flexibility. I
> haven't tested it though.
>
> ----
>
> Comments? :)
>
> Jo?o [...] Rodrigues
>
>


From thomas.hamelryck at gmail.com  Tue Jun  8 06:39:53 2010
From: thomas.hamelryck at gmail.com (Thomas Hamelryck)
Date: Tue, 8 Jun 2010 08:39:53 +0200
Subject: [Biopython-dev] [GSOC] Report - Week 1
In-Reply-To: <AANLkTinnFWWM2ojjfBtBu2Qt4911cUZTJyI0J1hVYoji@mail.gmail.com>
References: <AANLkTinnFWWM2ojjfBtBu2Qt4911cUZTJyI0J1hVYoji@mail.gmail.com>
Message-ID: <AANLkTikObrjMkhEUNgEeDozAK3i06aRjfi85D_1LA9kC@mail.gmail.com>

Hi all,

I think it's great that Bio.PDB is being updated.

Here are some remarks:

I haven't seen much discussion about the one key feature of Bio.PDB
that definitely needs to be improved: its speed. With the enormous
increase of the number of structures, extracting data using Bio.PDB is
too slow. Would be good to move some parts to C.

A second issues is nicely illustrated by the following code snippet:

> s = p.get_structure('a', '2KSX.pdb')
> [i for i in s.search_ss_bonds()]

I think this is NOT the way to do it. PDB files can contain anything
RNA, DNA, sugars, small molecules... It is thus not a good idea to
directly associate protein-specific methods to the structure class; it
will lead to a bloated Structure class and a lot of irrelevant methods
(ie. search_ss_bonds is meaningless for a PDB file that contains RNA).

Currently, one creates Polypeptide objects from a Structure object
using a factory design pattern (via PPBuilder); the Polypeptide class
implements some protein specific methods. I believe that is a much
cleaner way to do it (though we need a Protein class that represents
collections of connected polypeptides). One can also make sure that
all such derived objects (Protein, NA, DNA,...) adhere to the same
interface by providing a suitable base class with shared functionality
- in that way, the whole thing is also extendible.

Something like:

s = p.get_structure('a', '2KSX.pdb')
pb = ProteinBuilder()
proteins = pb.build(structure)
ssbridges = proteins.get_ss_bonds()

Here, "proteins" would represent a collection of polypeptide chains.

Cheers,

-Thomas

-- 
Thomas Hamelryck, Assoc. Prof.
Group leader Structural Bioinformatics
Bioinformatics center
Department of Biology
University of Copenhagen
Ole Maaloes Vej 5
DK-2200 Copenhagen N
Denmark
http://www.binf.ku.dk/research/structural_bioinformatics/


From lgautier at gmail.com  Tue Jun  8 07:00:10 2010
From: lgautier at gmail.com (Laurent)
Date: Tue, 08 Jun 2010 09:00:10 +0200
Subject: [Biopython-dev] Biopython-dev Digest, Vol 89, Issue 8
In-Reply-To: <mailman.2432.1275979197.3120.biopython-dev@lists.open-bio.org>
References: <mailman.2432.1275979197.3120.biopython-dev@lists.open-bio.org>
Message-ID: <4C0DEA7A.1020606@gmail.com>

On 08/06/10 08:39, biopython-dev-request at lists.open-bio.org wrote:
> On Mon, Jun 7, 2010 at 2:23 PM, Peter<biopython at maubp.freeserve.co.uk>wrote:
>
>> >  Peter wrote:
>>> >  >...
>>> >  >
>>> >  >  http://github.com/peterjc/biopython/tree/index-sqlite
>>> >  >
>>> >  >  ... an SQLite index is used to hold
>>> >  >  the offsets. This means very low RAM requirements, but is a lot
>>> >  >  slower because the offsets are written to disk and the SQLite
>>> >  >  index is updated as we go. I expect this part can be optimised
>>> >  >  (e.g. try to build the index at the end, try committing in batches).
>> >
>> >  Having now tried using this on some files with tens of millions of
>> >  records, tuning how we use SQLite is going to be important.
>> >
>> >
> Wouldn't a Berkeley database be much much faster for constructing simple key
> to offset mappings?
>
> -Kevin
>

Yes. If one is only looking for a key/value associative structure, the 
NOSQL solutions will be faster (tokyocabinet seems to be one of the 
fastest, up to 100x when compared to BerkleyDB
http://www.ioremap.net/node/235
).

L.


From biopython at maubp.freeserve.co.uk  Tue Jun  8 09:35:15 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 8 Jun 2010 10:35:15 +0100
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
	<AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
Message-ID: <AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>

On Mon, Jun 7, 2010 at 10:10 PM, Kevin Jacobs wrote:
> On Mon, Jun 7, 2010 at 2:23 PM, Peter wrote:
>>
>> Having now tried using this on some files with tens of millions of
>> records, tuning how we use SQLite is going to be important.
>>
>>
> Wouldn't a Berkeley database be much much faster for constructing
> simple key to offset mappings?
>

Maybe - now that I've done the refactoring on Bio.SeqIO.index() to
allow two back ends (python dict or SQLite) trying a third (BDB) is
much easier. Did you know BDB was used in the old OBDA index
files? However, Python 2.6 deprecated bsddb (the Python Interface
to Berkeley DB library) and Python is pushing people to SQLite3
instead.

Peter


From krother at rubor.de  Tue Jun  8 09:59:43 2010
From: krother at rubor.de (Kristian Rother)
Date: Tue, 8 Jun 2010 11:59:43 +0200
Subject: [Biopython-dev] Tested  Fixup branch for Bio.PDB
Message-ID: <df95eaa0e6f3c40d451630cb54332b3c-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FaLkQAWF9QSVlUUAw=-webmailer2@server04.webmailer.hosteurope.de>


Hi Eric,

I've checked out your pdbfixes branch and ran our 431 Unit Tests of
ModeRNA with it. There were no changes to the master Bio.PDB branch -->
for us everything OK.

Details:
ModeRNA (http://www.genesilico.pl/moderna) engineers RNA 3D structures and
uses Bio.PDB for most of its operations: reading files,
adding/copying/manipulating residues/atoms, superimposing structures,
searching neighbors by KDTree, writing files.

Right, the tests most probably did not depend directly on the code you
changed, but as I understand you wanted to go sure the branch didnt break
anything by accident.

Best Regards,
    Kristian


From bioinformed at gmail.com  Tue Jun  8 11:00:44 2010
From: bioinformed at gmail.com (Kevin Jacobs <jacobs@bioinformed.com>)
Date: Tue, 8 Jun 2010 07:00:44 -0400
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
	<AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
	<AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>
Message-ID: <AANLkTincBvgf1pBsEq3a60ftvGPYgFUP9hC800BWC1lV@mail.gmail.com>

On Tue, Jun 8, 2010 at 5:35 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Mon, Jun 7, 2010 at 10:10 PM, Kevin Jacobs wrote:
> > On Mon, Jun 7, 2010 at 2:23 PM, Peter wrote:
> >>
> >> Having now tried using this on some files with tens of millions of
> >> records, tuning how we use SQLite is going to be important.
> >>
> > Wouldn't a Berkeley database be much much faster for constructing
> > simple key to offset mappings?
>
> Maybe - now that I've done the refactoring on Bio.SeqIO.index() to
> allow two back ends (python dict or SQLite) trying a third (BDB) is
> much easier. Did you know BDB was used in the old OBDA index
> files? However, Python 2.6 deprecated bsddb (the Python Interface
> to Berkeley DB library) and Python is pushing people to SQLite3
> instead.
>
>
Hi Peter,

I am aware that SQLite is taking over the job of serving as the default
embedded database for Python and am in vigorous agreement with that trend.
 I use SQLite for a wide range of tasks and am extremely happy with it for
most applications.  Unfortunately, for pure key-value mapping tasks, I've
found  SQLite to be 4-10x slower than a well-tuned BDB tree, even with
batched updates and using the most aggressive SQLite performance pragmas. My
results may not be typical, but I thought I'd raise the issue given the
magnitude of the performance difference.

Best regards,
-Kevin


From mjldehoon at yahoo.com  Tue Jun  8 12:19:28 2010
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 8 Jun 2010 05:19:28 -0700 (PDT)
Subject: [Biopython-dev] Blast parsers and records
In-Reply-To: <AANLkTikenpBKpMWK7pPSk6RYhCW5mvb5LXjxpFilopvD@mail.gmail.com>
Message-ID: <14055.47665.qm@web62401.mail.re1.yahoo.com>

--- On Mon, 6/7/10, Peter <biopython at maubp.freeserve.co.uk> wrote:
> I agree that records from all the different BLAST output
> formats should be represented by a common base class -
> but not necessarily the same class.
> For example, the default plain text and XML formats include
> the pairwise alignments, but the tabular output does not. To
> me having a sub-class which stores the pairwise alignments seems
> natural here.

Why do we need a sub-class? We don't do this in Bio.SeqIO, where GenBank files contain much more information than Fasta files, but both are represented by a SeqRecord.

Best,
--Michiel.


From biopython at maubp.freeserve.co.uk  Tue Jun  8 12:32:05 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 8 Jun 2010 13:32:05 +0100
Subject: [Biopython-dev] Blast parsers and records
In-Reply-To: <14055.47665.qm@web62401.mail.re1.yahoo.com>
References: <AANLkTikenpBKpMWK7pPSk6RYhCW5mvb5LXjxpFilopvD@mail.gmail.com>
	<14055.47665.qm@web62401.mail.re1.yahoo.com>
Message-ID: <AANLkTimVwjENsgiOgMw167HZ5IVkUAXg7aDtNB3xQd0K@mail.gmail.com>

On Tue, Jun 8, 2010 at 1:19 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> --- On Mon, 6/7/10, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> I agree that records from all the different BLAST output
>> formats should be represented by a common base class -
>> but not necessarily the same class.
>> For example, the default plain text and XML formats include
>> the pairwise alignments, but the tabular output does not. To
>> me having a sub-class which stores the pairwise alignments seems
>> natural here.
>
> Why do we need a sub-class? We don't do this in Bio.SeqIO,
> where GenBank files contain much more information than Fasta
> files, but both are represented by a SeqRecord.

OK, I guess you could have some properties which are left empty
(like the annotations dictionary or features list in a SeqRecord
from a FASTA file).

Peter


From mjldehoon at yahoo.com  Tue Jun  8 13:44:01 2010
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 8 Jun 2010 06:44:01 -0700 (PDT)
Subject: [Biopython-dev] Blast parsers and records
In-Reply-To: <AANLkTimVwjENsgiOgMw167HZ5IVkUAXg7aDtNB3xQd0K@mail.gmail.com>
Message-ID: <756890.46421.qm@web62404.mail.re1.yahoo.com>

--- On Tue, 6/8/10, Peter <biopython at maubp.freeserve.co.uk> wrote:
> > Why do we need a sub-class? We don't do this in
> > Bio.SeqIO, where GenBank files contain much more
> > information than Fasta files, but both are
> > represented by a SeqRecord.
> 
> OK, I guess you could have some properties which are left
> empty
> (like the annotations dictionary or features list in a
> SeqRecord from a FASTA file).

I would prefer that, as it keeps things simple and consistent with other parts of Biopython. But let's see how it goes. Over the weekend I'll set up a rudimentary Blast parser and record so we can see what it would look like in practice.

--Michiel


From bpederse at gmail.com  Tue Jun  8 15:47:18 2010
From: bpederse at gmail.com (Brent Pedersen)
Date: Tue, 8 Jun 2010 08:47:18 -0700
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTincBvgf1pBsEq3a60ftvGPYgFUP9hC800BWC1lV@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
	<AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
	<AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>
	<AANLkTincBvgf1pBsEq3a60ftvGPYgFUP9hC800BWC1lV@mail.gmail.com>
Message-ID: <AANLkTimaz_cdp8Ca2VEzJhXgOfZv9ftw3wgU8itF3nKt@mail.gmail.com>

On Tue, Jun 8, 2010 at 4:00 AM, Kevin Jacobs <jacobs at bioinformed.com>
<bioinformed at gmail.com> wrote:
> On Tue, Jun 8, 2010 at 5:35 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:
>
>> On Mon, Jun 7, 2010 at 10:10 PM, Kevin Jacobs wrote:
>> > On Mon, Jun 7, 2010 at 2:23 PM, Peter wrote:
>> >>
>> >> Having now tried using this on some files with tens of millions of
>> >> records, tuning how we use SQLite is going to be important.
>> >>
>> > Wouldn't a Berkeley database be much much faster for constructing
>> > simple key to offset mappings?
>>
>> Maybe - now that I've done the refactoring on Bio.SeqIO.index() to
>> allow two back ends (python dict or SQLite) trying a third (BDB) is
>> much easier. Did you know BDB was used in the old OBDA index
>> files? However, Python 2.6 deprecated bsddb (the Python Interface
>> to Berkeley DB library) and Python is pushing people to SQLite3
>> instead.
>>
>>
> Hi Peter,
>
> I am aware that SQLite is taking over the job of serving as the default
> embedded database for Python and am in vigorous agreement with that trend.
> ?I use SQLite for a wide range of tasks and am extremely happy with it for
> most applications. ?Unfortunately, for pure key-value mapping tasks, I've
> found ?SQLite to be 4-10x slower than a well-tuned BDB tree, even with
> batched updates and using the most aggressive SQLite performance pragmas. My
> results may not be typical, but I thought I'd raise the issue given the
> magnitude of the performance difference.
>
> Best regards,
> -Kevin
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>

my results may not be typical either, but using an earlier version of
peter's sqlite biopython branch and comparing to screed
(http://github.com/acr/screed), and my file-index
(http://github.com/brentp/bio-playground/tree/master/fileindex/ ) i
found that biopython's implementation is at most, a bit more than 2x
slower. and it does the fastq parsing much more rigorously.

also, i didn't see much difference between berkeleydb and
tokyocabinet--though the ctypes-based TC wrapper i was using has since
been streamlined.
here's what i saw for 15+ million records with this script:
http://github.com/brentp/bio-playground/blob/master/fileindex/examples/bench.py

/opt/src/methylcode/data/s_1_sequence.txt
benchmarking fastq file with 15646356 records (62585424 lines)
performing 500000 random queries

screed
------
create: 704.764
search: 51.717

biopython-sqlite
----------------
create: 727.868
search: 92.947

fileindex
---------
create: 294.356
search: 53.701


From biopython at maubp.freeserve.co.uk  Tue Jun  8 16:35:07 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 8 Jun 2010 17:35:07 +0100
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTimaz_cdp8Ca2VEzJhXgOfZv9ftw3wgU8itF3nKt@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
	<AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
	<AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>
	<AANLkTincBvgf1pBsEq3a60ftvGPYgFUP9hC800BWC1lV@mail.gmail.com>
	<AANLkTimaz_cdp8Ca2VEzJhXgOfZv9ftw3wgU8itF3nKt@mail.gmail.com>
Message-ID: <AANLkTikkuO37pyLQFeHG4Doq8Ebo463HUd-h6uxcfSGe@mail.gmail.com>

On Tue, Jun 8, 2010 at 4:47 PM, Brent Pedersen <bpederse at gmail.com> wrote:
>
> my results may not be typical either, but using an earlier version of
> peter's sqlite biopython branch and comparing to screed
> (http://github.com/acr/screed), and my file-index
> (http://github.com/brentp/bio-playground/tree/master/fileindex/ ) i
> found that biopython's implementation is at most, a bit more than 2x
> slower. and it does the fastq parsing much more rigorously.
>
> also, i didn't see much difference between berkeleydb and
> tokyocabinet--though the ctypes-based TC wrapper i was using has since
> been streamlined.
> here's what i saw for 15+ million records with this script:
> http://github.com/brentp/bio-playground/blob/master/fileindex/examples/bench.py
>
> /opt/src/methylcode/data/s_1_sequence.txt
> benchmarking fastq file with 15646356 records (62585424 lines)
> performing 500000 random queries
>
> screed
> ------
> create: 704.764
> search: 51.717
>
> biopython-sqlite
> ----------------
> create: 727.868
> search: 92.947
>
> fileindex
> ---------
> create: 294.356
> search: 53.701

Are you using a recent version of screed (with SQLite internally)?

Which back end are your "fileindex" numbers for? BDB?

I'd say that the slow "search" from (the old branch of) Biopython is
down to our FASTQ parsing time, which includes lots of object
creation. The get_raw method can be useful here depending on
what you want to achieve:
http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/

The version you tried didn't do anything clever with the SQLite
indexes, batched inserts etc. I'm hoping the current code will be
faster (although there is likely a penalty from having two switchable
back ends). Brent, could you re-run this benchmark with this code:
http://github.com/peterjc/biopython/tree/index-sqlite-batched

You'll need to change the Biopython call in your test script from
this (it was renamed before landing on the trunk):

fi = SeqIO.indexed_dict(f, idx, "fastq")

to this:

fi = SeqIO.index(f, idx, "fastq", db=True)

or give an explicit filename:

fi = SeqIO.index(f, idx, "fastq", db="/tmp/filename.idx")

where db is the new parameter for controlling where and if
the lookup table is stored on disk.

Peter


From anaryin at gmail.com  Tue Jun  8 17:10:48 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 8 Jun 2010 12:10:48 -0500
Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB
	enhancements
In-Reply-To: <AANLkTim3hndnmdtvlXcBqo723U48FOmpd5AiESwqxM8P@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com> 
	<AANLkTin1nilGCBk3zuU4ZMf7vl0XdSlX8Ujai6_g4B-z@mail.gmail.com> 
	<AANLkTikj60J4uhtzzgTYos67aeH3eJ3xJF8-QsKq7qWS@mail.gmail.com> 
	<AANLkTim3hndnmdtvlXcBqo723U48FOmpd5AiESwqxM8P@mail.gmail.com>
Message-ID: <AANLkTilUjxzLPqKvfqki6wpznWKUSlQGJjJ_6x-cqhmS@mail.gmail.com>

Hello all,

I'm replying here to what Thomas wrote on the GSOC Report thread because it
seems a better place.

PDB files can contain anything RNA, DNA, sugars, small molecules... It is
> thus not a good idea to
> directly associate protein-specific methods to the structure class; it will
> lead to a bloated Structure class and a lot of irrelevant methods (ie.
> search_ss_bonds is meaningless for a PDB file that contains RNA).


Agree.

Currently, one creates Polypeptide objects from a Structure object using a
> factory design pattern (via PPBuilder); the Polypeptide class implements
> some protein specific methods. I believe that is a much cleaner way to do it
> (though we need a Protein class that represents collections of connected
> polypeptides). One can also make sure that all such derived objects
> (Protein, NA, DNA,...) adhere to the same interface by providing a suitable
> base class with shared functionality - in that way, the whole thing is also
> extendible.
>

I think there has been already some discussion about this. My personal
opinion/suggestion is having a structure like:

Bio.PDB/
_______/Protein.py
_______/DNA.py
_______/RNA.py

that would translate to an usage of something like:

from Bio.PDB import Protein
structure = Protein('1ABC.pdb')
structure.search_ss_bonds()

but not

structure.calc_melting_temperature() (just an example)

Protein() would call PDBParser(). It could also include, to a certain
extent, an Alphabet-like feature to assure residue names are OK (this goes a
bit with this proposal<http://www.biopython.org/wiki/GSOC2010_Joao#Residue_name_normalisation>).
I believe this goes a bit into what you said. Having a class that basically
abstracts what we do now (Bio.PDB.PDBParser) and allows for
molecule-specific methods. However, it also leads to some problems:
Protein/DNA complexes come to mind.

How does this sound? I think it goes with what Eric said in the first post
of this thread and what Thomas replied in the GSOC thread. We should also
change the PDB name to Struct to better reflect the purpose of the module.
All of the other additions like Bio.Struct.WWW would still apply. And I
don't see a major problem in breaking the existing code by adding this.

Jo?o


From tiagoantao at gmail.com  Tue Jun  8 19:12:00 2010
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 8 Jun 2010 20:12:00 +0100
Subject: [Biopython-dev] Working directly on the main git repository
In-Reply-To: <AANLkTikhnHfMIQfWU3ORrlfG1z2Zxsdx-sEycFSB3oa-@mail.gmail.com>
References: <AANLkTikhnHfMIQfWU3ORrlfG1z2Zxsdx-sEycFSB3oa-@mail.gmail.com>
Message-ID: <AANLkTimsuefKtzbYOWGHpCfs0xfqMWfHdPVRn6UfHl8L@mail.gmail.com>

On Mon, Jun 7, 2010 at 10:35 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Comments?

Maybe put this on the wiki as doc for good practice?


From biopython at maubp.freeserve.co.uk  Tue Jun  8 19:41:03 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 8 Jun 2010 20:41:03 +0100
Subject: [Biopython-dev] Working directly on the main git repository
In-Reply-To: <AANLkTimsuefKtzbYOWGHpCfs0xfqMWfHdPVRn6UfHl8L@mail.gmail.com>
References: <AANLkTikhnHfMIQfWU3ORrlfG1z2Zxsdx-sEycFSB3oa-@mail.gmail.com>
	<AANLkTimsuefKtzbYOWGHpCfs0xfqMWfHdPVRn6UfHl8L@mail.gmail.com>
Message-ID: <AANLkTin452_8Ka5Y_dLlggF6B3SnvxE6jWru5Hhnr-sQ@mail.gmail.com>

2010/6/8 Tiago Ant?o <tiagoantao at gmail.com>:
> On Mon, Jun 7, 2010 at 10:35 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> Comments?
>
> Maybe put this on the wiki as doc for good practice?

So this does seems like a sensible approach (for those
of use with commit access to the main repository)?

We can add it to the git usage page then...
http://www.biopython.org/wiki/GitUsage

Peter


From eric.talevich at gmail.com  Tue Jun  8 21:45:42 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 8 Jun 2010 17:45:42 -0400
Subject: [Biopython-dev] Working directly on the main git repository
In-Reply-To: <AANLkTikhnHfMIQfWU3ORrlfG1z2Zxsdx-sEycFSB3oa-@mail.gmail.com>
References: <AANLkTikhnHfMIQfWU3ORrlfG1z2Zxsdx-sEycFSB3oa-@mail.gmail.com>
Message-ID: <AANLkTiluRrKJ9AhHHIVwUSm0zXqpyQqn-TIVQUZHkBBF@mail.gmail.com>

On Mon, Jun 7, 2010 at 5:35 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> Hi all,
>
> I thought I'd write down some notes about how I've been using git recently.
> This may be of interest to any of the other core developers (those of us
> with read-write access to the main repository), and I might get some good
> tips from any discussion. The key point is that I have read+write access
> to two repositories on github (the official repository AND my own fork),
> so there are different advantages/disadvantages about which I choose
> to work with directly as my main repository.
>
> [...]
>
> Instead, I have a github repository of my own (what github calls a
> fork), and I push branches there.
>
> http://github.com/biopython/biopython - the official branch(es)
> http://github.com/peterjc/biopython - my branches
>
> How does this work in practice? Like this - I clone the master
> and add a reference to my repository (and I do the same when I
> want to grab a branch from another developer):
>
> git clone git at github.com:biopython/biopython.git
> cd biopython
> git remote add peterjc git at github.com:peterjc/biopython.git
> git fetch peterjc
>
> Then make a new local branch as usual, and when ready to share
> it publicly, I push it to *my* repository on github:
>
> git branch new-work
> git checkout new-work
> git commit ...
> git push peterjc new-work
>
> This would then appear as a new-work branch on my github page.
> Then if I (or someone else) wants to access these branches later
> (e.g. from another machine) just use the checkout tracked remote
> branch. For example,
>
> git clone git at github.com:biopython/biopython.git
> cd biopython
> git remote add peterjc git at github.com:peterjc/biopython.git
> git fetch peterjc
> git checkout -t peterjc/seqio-imgt
>
> This then looks like a normal branch (called just "seqio-imgt" in
> this example), but git knows it is linked to the remote branch on
> the "peterjc" repository (not the origin which is the "official"
> repository).
>

This looks reasonable to me. I'd add that the procedure to delete a public
branch from your personal fork on GitHub is a little obscure:

git branch -a   # list local and remote branches
git branch -d new-work   # delete a local branch that's been merged already
git push peterjc :new-work  # delete the public branch from GitHub

This doesn't do what you'd expect:
git branch -d peterjc/new-work

That only removes your local reference to the the public branch; the branch
is still visible on GitHub.

(It's kind of hard to find in the GitHub documentation.)


I'd have to check, but I guess that if the original git clone is done
> with git://github.com/biopython/biopython.git instead (read only
> access) the same procedure could be used by non core devs.
> However, I'm not sure this is clearer for them. I think the current
> procedure (on our wiki) where you add a remote reference to
> the "upstream" official repository works better in this case.
>

I still have an "upstream" reference to the main repo. I wouldn't want to
accidentally push something foolish to the main repo with a stray "git
push"... better to have the safe thing happen by default.

If the initial clone was from biopython master, and you later create a
personal forkon GitHub, then it's not too hard to switch the references
around in your local repo to make the public fork your "origin".

-Eric


From bugzilla-daemon at portal.open-bio.org  Tue Jun  8 22:52:28 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 8 Jun 2010 18:52:28 -0400
Subject: [Biopython-dev] [Bug 3096] New: PPBuilder build_peptides bugs
Message-ID: <bug-3096-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3096

           Summary: PPBuilder build_peptides bugs
           Product: Biopython
           Version: Not Applicable
          Platform: Other
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: skong at zymeworks.com


Given a chain of backbone connected residues 'IXRGXTGL' that contains two
non-standard amino acids 'X' in between, building peptide with only standard
amino acid builder should return two peptides 'RG' and 'TGL'. 'I' should not be
returned as a peptide since it is just one residue. Currently biopython would
return 'IXGXGL', with two bugs in between:

1. Skipping a standard amino acid R and T after each X, while keeping X (Should
skip X instead not R or T). Related to
http://bugzilla.open-bio.org/show_bug.cgi?id=2910 and
http://lists.open-bio.org/pipermail/biopython/2009-September/005532.html
2. Return one peptide even though after filtering the two X residues which
connect 'I', 'RG', 'TGL' are no longer present and fragment 'IRGTGL' cannot be
considered as a valid peptide without the two Xs connecting them.

The above sequence 'IXRGXTGL' are taken from 1bfe and mutated. The 'mutation'
referred here is simply renaming the residue name to something that is not
standard and represented as 'X'. 

Each solution proposed below is meant to fix respective bug above: 
1. Insert (not accept(prev) or not accept(next)) after if aa_only check at line
299 of Bio/PDB/Polypeptide.py
2. Insert pp=None when either of the residues compared are filtered at line 300
or Bio/PDB/Polypeptide.py


Amino acids filtering bug in method build_peptides() of class _PPBuilder ofin
Bio/PDB/Polypeptide.py:

Original:
        for chain in chain_list:
            chain_it=iter(chain)
            prev=chain_it.next()
            pp=None
            for next in chain_it:
                if aa_only and not accept(prev):
                    prev=next
                    continue
                if is_connected(prev, next):
                    if pp is None:
                        pp=Polypeptide()
                        pp.append(prev)
                        pp_list.append(pp)
                    pp.append(next)
                else:
                    pp=None
                prev=next
        return pp_list


Fixed:

        for chain in chain_list:
            chain_it=iter(chain)
            prev=chain_it.next()
            pp=None
            for next in chain_it:
                if aa_only and (not accept(prev) or not accept(next)):
                    prev=next; pp=None
                    continue
                if is_connected(prev, next):
                    if pp is None:
                        pp=Polypeptide()
                        pp.append(prev)
                        pp_list.append(pp)
                    pp.append(next)
                else:
                    pp=None
                prev=next
        return pp_list

Attached here is the code used to test the above case, with and without
mutations, and with and without standard amino acid filtering. The case without
mutation is just to show that the backbone atoms of the mutated version are
connected:

from Bio.PDB.PDBParser import PDBParser
from Bio.PDB.Polypeptide import PPBuilder, is_aa 

class StandardAABuilder(PPBuilder): 
    """ Polypeptide builder which accepts only standard amino acids.""" 
    def _accept(self, residue): 
        return is_aa(residue, standard=True) 

def extract_peptides(model):
    """Extracts the peptides from a model.
    Returns a list of Peptide object."""
    output = []
    for peptide in PPBuilder().build_peptides(model): 
        seq = str(peptide.get_sequence())
        output.append(seq)
    return output

def extract_peptides_saa(model):
    """Extracts the peptides from a model.
    Returns a list of Peptide object."""
    output = []
    for peptide in StandardAABuilder().build_peptides(model): 
        seq = str(peptide.get_sequence())
        output.append(seq)
    return output

if __name__ == '__main__':

    oripdb = open('chopped_pdb1bfe.ent')
    sto = PDBParser().get_structure('', oripdb)
    seqao = extract_peptides(sto)
    seqbo = extract_peptides_saa(sto)
    print 'ori seq all '
    print seqao  
    print 'ori seq standard only'
    print seqbo

    pdb = open('chopped_mutated_pdb1bfe.ent')
    st = PDBParser().get_structure('', pdb)
    seqa = extract_peptides(st)
    seqb = extract_peptides_saa(st)
    print 'mut seq all'
    print seqa
    print 'mut seq standard only '
    print seqb


Attached below are the two fragments of PDB files, pre and post mutated.

chopped_pdb1bfe.ent
ATOM     85  N   ILE A 316      37.386  71.217  31.070  1.00 36.97           N  
ATOM     86  CA  ILE A 316      38.311  71.290  29.949  1.00 33.71           C  
ATOM     87  C   ILE A 316      37.634  72.103  28.862  1.00 33.93           C  
ATOM     88  O   ILE A 316      36.415  72.216  28.839  1.00 36.46           O  
ATOM     89  CB  ILE A 316      38.651  69.876  29.404  1.00 35.79           C  
ATOM     90  CG1 ILE A 316      39.331  69.049  30.501  1.00 36.78           C  
ATOM     91  CG2 ILE A 316      39.572  69.979  28.187  1.00 37.71           C  
ATOM     92  CD1 ILE A 316      39.881  67.724  30.023  1.00 39.20           C  
ATOM     93  N   HIS A 317      38.425  72.679  27.969  1.00 35.61           N  
ATOM     94  CA  HIS A 317      37.880  73.473  26.881  1.00 37.92           C  
ATOM     95  C   HIS A 317      38.360  72.928  25.540  1.00 37.79           C  
ATOM     96  O   HIS A 317      39.463  73.240  25.094  1.00 37.44           O  
ATOM     97  CB  HIS A 317      38.303  74.930  27.052  1.00 35.19           C  
ATOM     98  CG  HIS A 317      37.888  75.519  28.363  1.00 35.76           C  
ATOM     99  ND1 HIS A 317      36.611  75.981  28.602  1.00 37.74           N  
ATOM    100  CD2 HIS A 317      38.575  75.701  29.516  1.00 37.59           C  
ATOM    101  CE1 HIS A 317      36.529  76.420  29.844  1.00 38.74           C  
ATOM    102  NE2 HIS A 317      37.706  76.262  30.421  1.00 36.76           N  
ATOM    103  N   ARG A 318      37.527  72.109  24.905  1.00 38.78           N  
ATOM    104  CA  ARG A 318      37.884  71.512  23.627  1.00 42.04           C  
ATOM    105  C   ARG A 318      38.469  72.559  22.699  1.00 45.14           C  
ATOM    106  O   ARG A 318      39.592  72.425  22.205  1.00 42.05           O  
ATOM    107  CB  ARG A 318      36.657  70.880  22.967  1.00 42.93           C  
ATOM    108  CG  ARG A 318      36.934  70.321  21.576  1.00 38.60           C  
ATOM    109  CD  ARG A 318      35.654  70.038  20.821  1.00 35.39           C  
ATOM    110  NE  ARG A 318      34.624  69.538  21.724  1.00 34.96           N  
ATOM    111  CZ  ARG A 318      34.539  68.278  22.141  1.00 31.51           C  
ATOM    112  NH1 ARG A 318      35.419  67.373  21.736  1.00 25.19           N  
ATOM    113  NH2 ARG A 318      33.579  67.929  22.983  1.00 29.10           N  
ATOM    114  N   GLY A 319      37.690  73.604  22.461  1.00 49.96           N  
ATOM    115  CA  GLY A 319      38.138  74.668  21.592  1.00 55.53           C  
ATOM    116  C   GLY A 319      38.459  74.219  20.180  1.00 58.85           C  
ATOM    117  O   GLY A 319      37.583  73.766  19.440  1.00 58.98           O  
ATOM    118  N   SER A 320      39.734  74.334  19.823  1.00 61.64           N  
ATOM    119  CA  SER A 320      40.219  73.992  18.493  1.00 63.16           C  
ATOM    120  C   SER A 320      40.212  72.517  18.110  1.00 65.27           C  
ATOM    121  O   SER A 320      39.558  72.127  17.145  1.00 65.12           O  
ATOM    122  CB  SER A 320      41.634  74.542  18.316  1.00 65.36           C  
ATOM    123  OG  SER A 320      42.124  74.255  17.019  1.00 72.05           O  
ATOM    124  N   THR A 321      40.955  71.702  18.853  1.00 67.43           N  
ATOM    125  CA  THR A 321      41.049  70.274  18.562  1.00 67.73           C  
ATOM    126  C   THR A 321      40.220  69.430  19.529  1.00 66.41           C  
ATOM    127  O   THR A 321      39.244  69.917  20.095  1.00 70.21           O  
ATOM    128  CB  THR A 321      42.517  69.810  18.620  1.00 70.22           C  
ATOM    129  OG1 THR A 321      42.613  68.453  18.169  1.00 77.03           O  
ATOM    130  CG2 THR A 321      43.049  69.915  20.045  1.00 72.07           C  
ATOM    131  N   GLY A 322      40.608  68.168  19.707  1.00 61.22           N  
ATOM    132  CA  GLY A 322      39.892  67.286  20.614  1.00 53.23           C  
ATOM    133  C   GLY A 322      40.037  67.705  22.065  1.00 48.00           C  
ATOM    134  O   GLY A 322      40.138  68.892  22.372  1.00 50.41           O  
ATOM    135  N   LEU A 323      40.044  66.734  22.968  1.00 41.92           N  
ATOM    136  CA  LEU A 323      40.190  67.033  24.385  1.00 35.58           C  
ATOM    137  C   LEU A 323      41.613  66.738  24.874  1.00 31.41           C  
ATOM    138  O   LEU A 323      41.932  66.921  26.046  1.00 30.47           O  
ATOM    139  CB  LEU A 323      39.160  66.240  25.191  1.00 35.76           C  
ATOM    140  CG  LEU A 323      37.716  66.576  24.802  1.00 39.50           C  
ATOM    141  CD1 LEU A 323      36.733  65.796  25.670  1.00 38.15           C  
ATOM    142  CD2 LEU A 323      37.493  68.074  24.955  1.00 38.58           C

PDB FILE: mutated_chopped_pdb1bfe.ent
ATOM     85  N   ILE A 316      37.386  71.217  31.070  1.00 36.97           N  
ATOM     86  CA  ILE A 316      38.311  71.290  29.949  1.00 33.71           C  
ATOM     87  C   ILE A 316      37.634  72.103  28.862  1.00 33.93           C  
ATOM     88  O   ILE A 316      36.415  72.216  28.839  1.00 36.46           O  
ATOM     89  CB  ILE A 316      38.651  69.876  29.404  1.00 35.79           C  
ATOM     90  CG1 ILE A 316      39.331  69.049  30.501  1.00 36.78           C  
ATOM     91  CG2 ILE A 316      39.572  69.979  28.187  1.00 37.71           C  
ATOM     92  CD1 ILE A 316      39.881  67.724  30.023  1.00 39.20           C  
ATOM     93  N   HIE A 317      38.425  72.679  27.969  1.00 35.61           N  
ATOM     94  CA  HIE A 317      37.880  73.473  26.881  1.00 37.92           C  
ATOM     95  C   HIE A 317      38.360  72.928  25.540  1.00 37.79           C  
ATOM     96  O   HIE A 317      39.463  73.240  25.094  1.00 37.44           O  
ATOM     97  CB  HIE A 317      38.303  74.930  27.052  1.00 35.19           C  
ATOM     98  CG  HIE A 317      37.888  75.519  28.363  1.00 35.76           C  
ATOM     99  ND1 HIE A 317      36.611  75.981  28.602  1.00 37.74           N  
ATOM    100  CD2 HIE A 317      38.575  75.701  29.516  1.00 37.59           C  
ATOM    101  CE1 HIE A 317      36.529  76.420  29.844  1.00 38.74           C  
ATOM    102  NE2 HIE A 317      37.706  76.262  30.421  1.00 36.76           N
ATOM    103  N   ARG A 318      37.527  72.109  24.905  1.00 38.78           N  
ATOM    104  CA  ARG A 318      37.884  71.512  23.627  1.00 42.04           C  
ATOM    105  C   ARG A 318      38.469  72.559  22.699  1.00 45.14           C  
ATOM    106  O   ARG A 318      39.592  72.425  22.205  1.00 42.05           O  
ATOM    107  CB  ARG A 318      36.657  70.880  22.967  1.00 42.93           C  
ATOM    108  CG  ARG A 318      36.934  70.321  21.576  1.00 38.60           C  
ATOM    109  CD  ARG A 318      35.654  70.038  20.821  1.00 35.39           C  
ATOM    110  NE  ARG A 318      34.624  69.538  21.724  1.00 34.96           N  
ATOM    111  CZ  ARG A 318      34.539  68.278  22.141  1.00 31.51           C  
ATOM    112  NH1 ARG A 318      35.419  67.373  21.736  1.00 25.19           N  
ATOM    113  NH2 ARG A 318      33.579  67.929  22.983  1.00 29.10           N  
ATOM    114  N   GLY A 319      37.690  73.604  22.461  1.00 49.96           N  
ATOM    115  CA  GLY A 319      38.138  74.668  21.592  1.00 55.53           C  
ATOM    116  C   GLY A 319      38.459  74.219  20.180  1.00 58.85           C  
ATOM    117  O   GLY A 319      37.583  73.766  19.440  1.00 58.98           O  
ATOM    118  N   XQQ A 320      39.734  74.334  19.823  1.00 61.64           N  
ATOM    119  CA  XQQ A 320      40.219  73.992  18.493  1.00 63.16           C  
ATOM    120  C   XQQ A 320      40.212  72.517  18.110  1.00 65.27           C  
ATOM    121  O   XQQ A 320      39.558  72.127  17.145  1.00 65.12           O  
ATOM    122  CB  XQQ A 320      41.634  74.542  18.316  1.00 65.36           C  
ATOM    123  OG  XQQ A 320      42.124  74.255  17.019  1.00 72.05           O
ATOM    124  N   THR A 321      40.955  71.702  18.853  1.00 67.43           N  
ATOM    125  CA  THR A 321      41.049  70.274  18.562  1.00 67.73           C  
ATOM    126  C   THR A 321      40.220  69.430  19.529  1.00 66.41           C  
ATOM    127  O   THR A 321      39.244  69.917  20.095  1.00 70.21           O  
ATOM    128  CB  THR A 321      42.517  69.810  18.620  1.00 70.22           C  
ATOM    129  OG1 THR A 321      42.613  68.453  18.169  1.00 77.03           O  
ATOM    130  CG2 THR A 321      43.049  69.915  20.045  1.00 72.07           C  
ATOM    131  N   GLY A 322      40.608  68.168  19.707  1.00 61.22           N  
ATOM    132  CA  GLY A 322      39.892  67.286  20.614  1.00 53.23           C  
ATOM    133  C   GLY A 322      40.037  67.705  22.065  1.00 48.00           C  
ATOM    134  O   GLY A 322      40.138  68.892  22.372  1.00 50.41           O  
ATOM    135  N   LEU A 323      40.044  66.734  22.968  1.00 41.92           N  
ATOM    136  CA  LEU A 323      40.190  67.033  24.385  1.00 35.58           C  
ATOM    137  C   LEU A 323      41.613  66.738  24.874  1.00 31.41           C  
ATOM    138  O   LEU A 323      41.932  66.921  26.046  1.00 30.47           O  
ATOM    139  CB  LEU A 323      39.160  66.240  25.191  1.00 35.76           C  
ATOM    140  CG  LEU A 323      37.716  66.576  24.802  1.00 39.50           C  
ATOM    141  CD1 LEU A 323      36.733  65.796  25.670  1.00 38.15           C  
ATOM    142  CD2 LEU A 323      37.493  68.074  24.955  1.00 38.58           C


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bpederse at gmail.com  Wed Jun  9 04:33:12 2010
From: bpederse at gmail.com (Brent Pedersen)
Date: Tue, 8 Jun 2010 21:33:12 -0700
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTikkuO37pyLQFeHG4Doq8Ebo463HUd-h6uxcfSGe@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
	<AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
	<AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>
	<AANLkTincBvgf1pBsEq3a60ftvGPYgFUP9hC800BWC1lV@mail.gmail.com>
	<AANLkTimaz_cdp8Ca2VEzJhXgOfZv9ftw3wgU8itF3nKt@mail.gmail.com>
	<AANLkTikkuO37pyLQFeHG4Doq8Ebo463HUd-h6uxcfSGe@mail.gmail.com>
Message-ID: <AANLkTil-azYjpsrP7W2PG6H6dLJhv2WV-Ag0tY1kRH_C@mail.gmail.com>

On Tue, Jun 8, 2010 at 9:35 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Tue, Jun 8, 2010 at 4:47 PM, Brent Pedersen <bpederse at gmail.com> wrote:
>>
>> my results may not be typical either, but using an earlier version of
>> peter's sqlite biopython branch and comparing to screed
>> (http://github.com/acr/screed), and my file-index
>> (http://github.com/brentp/bio-playground/tree/master/fileindex/ ) i
>> found that biopython's implementation is at most, a bit more than 2x
>> slower. and it does the fastq parsing much more rigorously.
>>
>> also, i didn't see much difference between berkeleydb and
>> tokyocabinet--though the ctypes-based TC wrapper i was using has since
>> been streamlined.
>> here's what i saw for 15+ million records with this script:
>> http://github.com/brentp/bio-playground/blob/master/fileindex/examples/bench.py
>>
>> /opt/src/methylcode/data/s_1_sequence.txt
>> benchmarking fastq file with 15646356 records (62585424 lines)
>> performing 500000 random queries
>>
>> screed
>> ------
>> create: 704.764
>> search: 51.717
>>
>> biopython-sqlite
>> ----------------
>> create: 727.868
>> search: 92.947
>>
>> fileindex
>> ---------
>> create: 294.356
>> search: 53.701
>
> Are you using a recent version of screed (with SQLite internally)?
>
> Which back end are your "fileindex" numbers for? BDB?
>
> I'd say that the slow "search" from (the old branch of) Biopython is
> down to our FASTQ parsing time, which includes lots of object
> creation. The get_raw method can be useful here depending on
> what you want to achieve:
> http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/
>
> The version you tried didn't do anything clever with the SQLite
> indexes, batched inserts etc. I'm hoping the current code will be
> faster (although there is likely a penalty from having two switchable
> back ends). Brent, could you re-run this benchmark with this code:
> http://github.com/peterjc/biopython/tree/index-sqlite-batched
>
> You'll need to change the Biopython call in your test script from
> this (it was renamed before landing on the trunk):
>
> fi = SeqIO.indexed_dict(f, idx, "fastq")
>
> to this:
>
> fi = SeqIO.index(f, idx, "fastq", db=True)
>
> or give an explicit filename:
>
> fi = SeqIO.index(f, idx, "fastq", db="/tmp/filename.idx")
>
> where db is the new parameter for controlling where and if
> the lookup table is stored on disk.
>
> Peter
>

done. the previous times and the current were using py-tcdb not bsddb.
the author of tcdb made some improvements so it's faster this time,
and your SeqIO implementation is almost 2x as fast to load as the
previous one. that's a nice implementation. i didn't try get_raw.

these timints are are with your latest version, and the version of
screed pulled from http://github.com/acr/screed master today.

/opt/src/methylcode/data/s_1_sequence.txt
benchmarking fastq file with 15646356 records (62585424 lines)
performing 500000 random queries

screed
------
create: 699.210
search: 51.043

biopython-sqlite
----------------
create: 386.647
search: 93.391

fileindex
---------
create: 184.088
search: 48.887


From bugzilla-daemon at portal.open-bio.org  Wed Jun  9 08:43:02 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 9 Jun 2010 04:43:02 -0400
Subject: [Biopython-dev] [Bug 3096] PPBuilder build_peptides bugs
In-Reply-To: <bug-3096-42@http.bugzilla.open-bio.org/>
Message-ID: <201006090843.o598h2tx024780@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3096


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-09 04:43 EST -------
(In reply to comment #0)
> Given a chain of backbone connected residues 'IXRGXTGL' that contains two
> non-standard amino acids 'X' in between, building peptide with only standard
> amino acid builder should return two peptides 'RG' and 'TGL'. 'I' should not
> be returned as a peptide since it is just one residue. Currently biopython
> would return 'IXGXGL', with two bugs in between:

What is wrong with returning 'IXGXGL'? The PDB contains a peptide of six
linked residues doesn't it? It looks like Bio.PDB is doing something sensible.

P.S. You didn't fill in which version of Biopython you are using.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Wed Jun  9 08:55:37 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 9 Jun 2010 09:55:37 +0100
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTil-azYjpsrP7W2PG6H6dLJhv2WV-Ag0tY1kRH_C@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
	<AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
	<AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>
	<AANLkTincBvgf1pBsEq3a60ftvGPYgFUP9hC800BWC1lV@mail.gmail.com>
	<AANLkTimaz_cdp8Ca2VEzJhXgOfZv9ftw3wgU8itF3nKt@mail.gmail.com>
	<AANLkTikkuO37pyLQFeHG4Doq8Ebo463HUd-h6uxcfSGe@mail.gmail.com>
	<AANLkTil-azYjpsrP7W2PG6H6dLJhv2WV-Ag0tY1kRH_C@mail.gmail.com>
Message-ID: <AANLkTilMHNtJx5QivAWSjFiXnW1BFWi86H-ow8kBq6pi@mail.gmail.com>

On Wed, Jun 9, 2010 at 5:33 AM, Brent Pedersen <bpederse at gmail.com> wrote:
>>
>> The version you tried didn't do anything clever with the SQLite
>> indexes, batched inserts etc. I'm hoping the current code will be
>> faster (although there is likely a penalty from having two switchable
>> back ends). Brent, could you re-run this benchmark with this code:
>> http://github.com/peterjc/biopython/tree/index-sqlite-batched
>> ...
>
> done.

Thank you Brent :)

> the previous times and the current were using py-tcdb not bsddb.
> the author of tcdb made some improvements so it's faster this time,

OK, so you are using Tokyo Cabinet to store the lookup table here
rather than BDB. Link, http://code.google.com/p/py-tcdb/

> and your SeqIO implementation is almost 2x as fast to load as the
> previous one. that's a nice implementation. i didn't try get_raw.

I've got some more re-factoring in mind which should help a little
more (but mainly to make the structure clearer).

> these timints are are with your latest version, and the version of
> screed pulled from http://github.com/acr/screed master today.

Having had a quick look, they are using SQLite3 in much the
say way as I was initially. They create the index before loading
(rather than after loading) and they use a single insert per
offset (rather than using a batch in a transaction or the
executemany method). I'm pretty sure from my experiments
those changes would speed up screed's loading time a lot
(probably inline with the speed up I achieved).

> /opt/src/methylcode/data/s_1_sequence.txt
> benchmarking fastq file with 15646356 records (62585424 lines)
> performing 500000 random queries
>
> screed
> ------
> create: 699.210
> search: 51.043
>
> biopython-sqlite
> ----------------
> create: 386.647
> search: 93.391
>
> fileindex
> ---------
> create: 184.088
> search: 48.887

That's got us looking more competitive. As noted above, I think
sceed's loading time could be much reduced by tweaking how
they use SQLite3. I wonder what the breakdown for fileindex is
between calling Tokyo Cabinet and the fileindex code itself?
I guess we should try TK as the back end in Bio.SeqIO.index()
for comparison.

Peter

P.S. Could you measure the database file sizes on disk?


From thomas.hamelryck at gmail.com  Wed Jun  9 12:18:41 2010
From: thomas.hamelryck at gmail.com (Thomas Hamelryck)
Date: Wed, 9 Jun 2010 14:18:41 +0200
Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB
	enhancements
In-Reply-To: <AANLkTilUjxzLPqKvfqki6wpznWKUSlQGJjJ_6x-cqhmS@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<AANLkTin1nilGCBk3zuU4ZMf7vl0XdSlX8Ujai6_g4B-z@mail.gmail.com>
	<AANLkTikj60J4uhtzzgTYos67aeH3eJ3xJF8-QsKq7qWS@mail.gmail.com>
	<AANLkTim3hndnmdtvlXcBqo723U48FOmpd5AiESwqxM8P@mail.gmail.com>
	<AANLkTilUjxzLPqKvfqki6wpznWKUSlQGJjJ_6x-cqhmS@mail.gmail.com>
Message-ID: <AANLkTilC9Jqf4Kl0QOIYkGWl309F-kDpnbWHASyRE1T5@mail.gmail.com>

Hi,

On Tue, Jun 8, 2010 at 7:10 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:

>
> from Bio.PDB import Protein
> structure = Protein('1ABC.pdb')
> structure.search_ss_bonds()
>

Indeed, that would run into problems for complexes where proteins, RNA, DNA,
etc. occur in the same file. It makes much more sense to have a Structure
centred approach:

proteins=Protein(structure)
chains=proteins.get_chains()
chain_a=chains["A"]
polypeptides=chain_a.get_peptides()

rnas=RNA(structure)

etc.

-Thomas

-- 
Thomas Hamelryck, Assoc. Prof.
Group leader Structural Bioinformatics
Bioinformatics center
Department of Biology
University of Copenhagen
Ole Maaloes Vej 5
DK-2200 Copenhagen N
Denmark
http://wiki.binf.ku.dk/User:Thomas_Hamelryck
http://www.binf.ku.dk/research/structural_bioinformatics/


From lgautier at gmail.com  Wed Jun  9 12:28:20 2010
From: lgautier at gmail.com (Laurent)
Date: Wed, 09 Jun 2010 14:28:20 +0200
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <mailman.2432.1275979197.3120.biopython-dev@lists.open-bio.org>
References: <mailman.2432.1275979197.3120.biopython-dev@lists.open-bio.org>
Message-ID: <4C0F88E4.7070607@gmail.com>

What about having a class instance instead ? This would let one change 
the index storage system very easily.

For example, to use a dictionary:

Bio.SeqIO.index(keyval_map = dict() )

A minimal requirement for the instance 'keyval_map' passed would be to 
implement the methods __getitem__(self, key) and __setitem__(self, key, 
value), allowing the "duck typing" approach commonly found in Python.

An SQLite-based index would be a matter of having a class such as:

class KeyValSQLite(object):
   def __init__(self, filename):
       # create the database into file "filename"
       pass

   def __getitem__(self, key):
       """ return the value """
       # select whatever in something where key='<key>'...
       pass

   def __setitem__(self, key, value):
       # update...
       pass


The this would be a call like:

Bio.SeqIO.index(keyval_map = KeyValSQLite("myindex.db"))


Now that you have the idea, getting a custom index based on BDB or 
anything should be a breeze...


L.

On 08/06/10 08:39, biopython-dev-request at lists.open-bio.org wrote:
> Hi all,
>
> Thanks for the lively discussion on the main list,
>
> http://lists.open-bio.org/pipermail/biopython/2010-June/006546.html
> ...
> http://lists.open-bio.org/pipermail/biopython/2010-June/006580.html
>
> I've spent the afternoon updating my old branch which uses SQLite
> to store the record identifier to file offset mapping. Using the code
> on this branch, Bio.SeqIO.index() supports a new optional argument
> currently called "db" (other names I like including "cache", suggestions
> welcome):
>
> http://github.com/peterjc/biopython/tree/index-sqlite
>
> The default (False) is not to use SQLite, but continue with an in
> memory Python dictionary. As long as you have enough RAM
> and don't plan to use the index at a later date, this will be fastest.
>
> If set to True or a filename, then an SQLite index is used to hold
> the offsets. This means very low RAM requirements, but is a lot
> slower because the offsets are written to disk and the SQLite
> index is updated as we go. I expect this part can be optimised
> (e.g. try to build the index at the end, try committing in batches).
>
> I'm still testing this, but the core of the work is done I think.
> Once we're happy with the public API, we can concentrate
> on things like the SQLite schema, and optimising the code.
>
> Peter
>
> P.S. I know it will need a little work to fail gracefully on Python 2.4
> when SQLite isn't installed.
>


From biopython at maubp.freeserve.co.uk  Wed Jun  9 12:53:39 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 9 Jun 2010 13:53:39 +0100
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <4C0F88E4.7070607@gmail.com>
References: <mailman.2432.1275979197.3120.biopython-dev@lists.open-bio.org>
	<4C0F88E4.7070607@gmail.com>
Message-ID: <AANLkTilZPP2928RbnzTl7c-CcyWlJ8QBThG1XOuiJ8ZX@mail.gmail.com>

On Wed, Jun 9, 2010 at 1:28 PM, Laurent <lgautier at gmail.com> wrote:
> What about having a class instance instead ? This would let one change the
> index storage system very easily.

That is essentially what the recent code on my branch is doing, but
the back end isn't being exposed to the public API (yet).

> The this would be a call like:
>
> Bio.SeqIO.index(keyval_map = KeyValSQLite("myindex.db"))
>
>
> Now that you have the idea, getting a custom index based on BDB or
> anything should be a breeze...

Indeed. Most DB like back ends should offset a bulk loader we can
exploit via the dict's update method.

Peter


From eric.talevich at gmail.com  Wed Jun  9 13:31:18 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Wed, 9 Jun 2010 09:31:18 -0400
Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB
	enhancements
In-Reply-To: <AANLkTilUjxzLPqKvfqki6wpznWKUSlQGJjJ_6x-cqhmS@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com> 
	<AANLkTin1nilGCBk3zuU4ZMf7vl0XdSlX8Ujai6_g4B-z@mail.gmail.com> 
	<AANLkTikj60J4uhtzzgTYos67aeH3eJ3xJF8-QsKq7qWS@mail.gmail.com> 
	<AANLkTim3hndnmdtvlXcBqo723U48FOmpd5AiESwqxM8P@mail.gmail.com> 
	<AANLkTilUjxzLPqKvfqki6wpznWKUSlQGJjJ_6x-cqhmS@mail.gmail.com>
Message-ID: <AANLkTikldxobdl9u2B2NDlYe2qpe74H1exiZxmGMcEnY@mail.gmail.com>

On Tue, Jun 8, 2010 at 1:10 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:

> Hello all,
>
> I'm replying here to what Thomas wrote on the GSOC Report thread because it
> seems a better place.
>
> PDB files can contain anything RNA, DNA, sugars, small molecules... It is
>> thus not a good idea to
>> directly associate protein-specific methods to the structure class; it
>> will lead to a bloated Structure class and a lot of irrelevant methods (ie.
>> search_ss_bonds is meaningless for a PDB file that contains RNA).
>
>
> Agree.
>
> Currently, one creates Polypeptide objects from a Structure object using a
>> factory design pattern (via PPBuilder); the Polypeptide class implements
>> some protein specific methods. I believe that is a much cleaner way to do it
>> (though we need a Protein class that represents collections of connected
>> polypeptides). One can also make sure that all such derived objects
>> (Protein, NA, DNA,...) adhere to the same interface by providing a suitable
>> base class with shared functionality - in that way, the whole thing is also
>> extendible.
>>
>
> I think there has been already some discussion about this. My personal
> opinion/suggestion is having a structure like:
>
> Bio.PDB/
> _______/Protein.py
> _______/DNA.py
> _______/RNA.py
>
> that would translate to an usage of something like:
>
> from Bio.PDB import Protein
> structure = Protein('1ABC.pdb')
> structure.search_ss_bonds()
>
> but not
>
> structure.calc_melting_temperature() (just an example)
>

How about:

from Bio import Struct

# extract the protein from a bound TF structure
complex = Struct.read("3IKT.pdb")
prot = complex.as_protein()

# which is a wrapper for:
from Bio.Struct.Protein import Protein
# if Protein contains a Structure instance:
prot = Protein(complex)
# or, if Protein inherits from Structure:
prot = Protein.from_structure(complex)


The Bio.Struct.Protein module would mostly wrap Bio.PDB's protein-specific
functionality, and contain a class called Protein which you construct using
a Bio.PDB.Structure.Structure instance, in some way.

I think the convenience methods as_protein, as_dna and as_rna are acceptable
additions to the Structure class if that saves us from (a) polluting
Structure with protein- and RNA-specific methods, or (b) requiring a slew of
imports to reach any new functionality. You can add as_protein yourself and
leave the other methods for other brave souls to implement. (Bio.Struct.RNA
deserves its own directory, and I don't know of anyone working on a
structural DNA branch.)


Protein() would call PDBParser(). It could also include, to a certain
> extent, an Alphabet-like feature to assure residue names are OK (this goes a
> bit with this proposal<http://www.biopython.org/wiki/GSOC2010_Joao#Residue_name_normalisation>).
> I believe this goes a bit into what you said. Having a class that basically
> abstracts what we do now (Bio.PDB.PDBParser) and allows for
> molecule-specific methods. However, it also leads to some problems:
> Protein/DNA complexes come to mind.
>
> How does this sound? I think it goes with what Eric said in the first post
> of this thread and what Thomas replied in the GSOC thread. We should also
> change the PDB name to Struct to better reflect the purpose of the module.
> All of the other additions like Bio.Struct.WWW would still apply. And I
> don't see a major problem in breaking the existing code by adding this.
>

To be clear, we don't need to rename anything -- Bio.Struct and Bio.PDB can
live in harmony for the foreseeable future.

Best,
Eric


From bpederse at gmail.com  Wed Jun  9 14:42:29 2010
From: bpederse at gmail.com (Brent Pedersen)
Date: Wed, 9 Jun 2010 07:42:29 -0700
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTilMHNtJx5QivAWSjFiXnW1BFWi86H-ow8kBq6pi@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
	<AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
	<AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>
	<AANLkTincBvgf1pBsEq3a60ftvGPYgFUP9hC800BWC1lV@mail.gmail.com>
	<AANLkTimaz_cdp8Ca2VEzJhXgOfZv9ftw3wgU8itF3nKt@mail.gmail.com>
	<AANLkTikkuO37pyLQFeHG4Doq8Ebo463HUd-h6uxcfSGe@mail.gmail.com>
	<AANLkTil-azYjpsrP7W2PG6H6dLJhv2WV-Ag0tY1kRH_C@mail.gmail.com>
	<AANLkTilMHNtJx5QivAWSjFiXnW1BFWi86H-ow8kBq6pi@mail.gmail.com>
Message-ID: <AANLkTimmo4rIJYBZ5zCV-fXpLkrqsUNs-VBvOOsLpk-a@mail.gmail.com>

On Wed, Jun 9, 2010 at 1:55 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Wed, Jun 9, 2010 at 5:33 AM, Brent Pedersen <bpederse at gmail.com> wrote:
>>>
>>> The version you tried didn't do anything clever with the SQLite
>>> indexes, batched inserts etc. I'm hoping the current code will be
>>> faster (although there is likely a penalty from having two switchable
>>> back ends). Brent, could you re-run this benchmark with this code:
>>> http://github.com/peterjc/biopython/tree/index-sqlite-batched
>>> ...
>>
>> done.
>
> Thank you Brent :)
>
>> the previous times and the current were using py-tcdb not bsddb.
>> the author of tcdb made some improvements so it's faster this time,
>
> OK, so you are using Tokyo Cabinet to store the lookup table here
> rather than BDB. Link, http://code.google.com/p/py-tcdb/
>
>> and your SeqIO implementation is almost 2x as fast to load as the
>> previous one. that's a nice implementation. i didn't try get_raw.
>
> I've got some more re-factoring in mind which should help a little
> more (but mainly to make the structure clearer).
>
>> these timints are are with your latest version, and the version of
>> screed pulled from http://github.com/acr/screed master today.
>
> Having had a quick look, they are using SQLite3 in much the
> say way as I was initially. They create the index before loading
> (rather than after loading) and they use a single insert per
> offset (rather than using a batch in a transaction or the
> executemany method). I'm pretty sure from my experiments
> those changes would speed up screed's loading time a lot
> (probably inline with the speed up I achieved).
>
>> /opt/src/methylcode/data/s_1_sequence.txt
>> benchmarking fastq file with 15646356 records (62585424 lines)
>> performing 500000 random queries
>>
>> screed
>> ------
>> create: 699.210
>> search: 51.043
>>
>> biopython-sqlite
>> ----------------
>> create: 386.647
>> search: 93.391
>>
>> fileindex
>> ---------
>> create: 184.088
>> search: 48.887
>
> That's got us looking more competitive. As noted above, I think
> sceed's loading time could be much reduced by tweaking how
> they use SQLite3. I wonder what the breakdown for fileindex is
> between calling Tokyo Cabinet and the fileindex code itself?
> I guess we should try TK as the back end in Bio.SeqIO.index()
> for comparison.
>
> Peter
>
> P.S. Could you measure the database file sizes on disk?
>

for raw reads, screed, fileindex(tcdb), biopython respectively:
-rw-r--r-T 1 brentp users  3.3G 2009-11-17 13:32
/opt/src/methylcode/data/s_1_sequence.txt
-rw-r--r-- 1 brentp brentp 3.8G 2010-06-08 16:09
/opt/src/methylcode/data/s_1_sequence.txt_screed
-rw-r--r-- 1 brentp brentp 1.2G 2010-06-08 16:21
/opt/src/methylcode/data/s_1_sequence.txt.fidx
-rw-r--r-- 1 brentp brentp 1.5G 2010-06-08 21:15
/opt/src/methylcode/data/s_1_sequence.txt.bidx

that's not using any compression for the fileindex.
i think the overhead of the fileindex code + tcdb code is pretty low
now. i think there'd only be improvement
using a cython or c version of a TC wrapper--and even then, not much.

-brentp


From biopython at maubp.freeserve.co.uk  Wed Jun  9 14:55:23 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 9 Jun 2010 15:55:23 +0100
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTilMHNtJx5QivAWSjFiXnW1BFWi86H-ow8kBq6pi@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
	<AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
	<AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>
	<AANLkTincBvgf1pBsEq3a60ftvGPYgFUP9hC800BWC1lV@mail.gmail.com>
	<AANLkTimaz_cdp8Ca2VEzJhXgOfZv9ftw3wgU8itF3nKt@mail.gmail.com>
	<AANLkTikkuO37pyLQFeHG4Doq8Ebo463HUd-h6uxcfSGe@mail.gmail.com>
	<AANLkTil-azYjpsrP7W2PG6H6dLJhv2WV-Ag0tY1kRH_C@mail.gmail.com>
	<AANLkTilMHNtJx5QivAWSjFiXnW1BFWi86H-ow8kBq6pi@mail.gmail.com>
Message-ID: <AANLkTinD2RBMlJF3AF_eOAahPN7Bn3hWyg--GhWCnk8y@mail.gmail.com>

On Wed, Jun 9, 2010 at 9:55 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> Having had a quick look, they are using SQLite3 in much the
> say way as I was initially. They create the index before loading
> (rather than after loading) and they use a single insert per
> offset (rather than using a batch in a transaction or the
> executemany method). I'm pretty sure from my experiments
> those changes would speed up screed's loading time a lot
> (probably inline with the speed up I achieved).
>

Do you fancy trying this version of screed? It seems much
faster on medium sized FASTQ files:-

http://github.com/peterjc/screed/tree/sqlite-tweaks

I'm still running a few tests myself, but will pass this on to
the screed team unless I find some regressions.

Peter


From bpederse at gmail.com  Wed Jun  9 15:56:27 2010
From: bpederse at gmail.com (Brent Pedersen)
Date: Wed, 9 Jun 2010 08:56:27 -0700
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTinD2RBMlJF3AF_eOAahPN7Bn3hWyg--GhWCnk8y@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
	<AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
	<AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>
	<AANLkTincBvgf1pBsEq3a60ftvGPYgFUP9hC800BWC1lV@mail.gmail.com>
	<AANLkTimaz_cdp8Ca2VEzJhXgOfZv9ftw3wgU8itF3nKt@mail.gmail.com>
	<AANLkTikkuO37pyLQFeHG4Doq8Ebo463HUd-h6uxcfSGe@mail.gmail.com>
	<AANLkTil-azYjpsrP7W2PG6H6dLJhv2WV-Ag0tY1kRH_C@mail.gmail.com>
	<AANLkTilMHNtJx5QivAWSjFiXnW1BFWi86H-ow8kBq6pi@mail.gmail.com>
	<AANLkTinD2RBMlJF3AF_eOAahPN7Bn3hWyg--GhWCnk8y@mail.gmail.com>
Message-ID: <AANLkTilokHoei_oEHVsFlq_2o3X1ZZdeH75G6oPo5XzU@mail.gmail.com>

On Wed, Jun 9, 2010 at 7:55 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Wed, Jun 9, 2010 at 9:55 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>>
>> Having had a quick look, they are using SQLite3 in much the
>> say way as I was initially. They create the index before loading
>> (rather than after loading) and they use a single insert per
>> offset (rather than using a batch in a transaction or the
>> executemany method). I'm pretty sure from my experiments
>> those changes would speed up screed's loading time a lot
>> (probably inline with the speed up I achieved).
>>
>
> Do you fancy trying this version of screed? It seems much
> faster on medium sized FASTQ files:-
>
> http://github.com/peterjc/screed/tree/sqlite-tweaks
>
> I'm still running a few tests myself, but will pass this on to
> the screed team unless I find some regressions.
>
> Peter
>

not too much difference.

screed
------
create: 666.381
search: 51.839


From biopython at maubp.freeserve.co.uk  Wed Jun  9 16:19:24 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 9 Jun 2010 17:19:24 +0100
Subject: [Biopython-dev] Storing Bio.SeqIO.index() offsets in SQLite
In-Reply-To: <AANLkTilokHoei_oEHVsFlq_2o3X1ZZdeH75G6oPo5XzU@mail.gmail.com>
References: <AANLkTikk33JEaSNBevWfNZpEyOFW4PiTvBIOtw7i_kub@mail.gmail.com>
	<AANLkTilU9Bavj3RRlpd8pGXwukTk-vxomDBfIQLQFtOc@mail.gmail.com>
	<AANLkTimeDRYn2cYV4MMp2W4dDS_cXpRQefUYi4QH0O6Q@mail.gmail.com>
	<AANLkTikXN0U7ZjhlKZMjRHB_4OPUQXz-WtC_qDS0_H1N@mail.gmail.com>
	<AANLkTincBvgf1pBsEq3a60ftvGPYgFUP9hC800BWC1lV@mail.gmail.com>
	<AANLkTimaz_cdp8Ca2VEzJhXgOfZv9ftw3wgU8itF3nKt@mail.gmail.com>
	<AANLkTikkuO37pyLQFeHG4Doq8Ebo463HUd-h6uxcfSGe@mail.gmail.com>
	<AANLkTil-azYjpsrP7W2PG6H6dLJhv2WV-Ag0tY1kRH_C@mail.gmail.com>
	<AANLkTilMHNtJx5QivAWSjFiXnW1BFWi86H-ow8kBq6pi@mail.gmail.com>
	<AANLkTinD2RBMlJF3AF_eOAahPN7Bn3hWyg--GhWCnk8y@mail.gmail.com>
	<AANLkTilokHoei_oEHVsFlq_2o3X1ZZdeH75G6oPo5XzU@mail.gmail.com>
Message-ID: <AANLkTikYjpuhFAN5fjHXVtnvKO0AH-bYNVNv5GOx-W81@mail.gmail.com>

On Wed, Jun 9, 2010 at 4:56 PM, Brent Pedersen <bpederse at gmail.com> wrote:
> On Wed, Jun 9, 2010 at 7:55 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> On Wed, Jun 9, 2010 at 9:55 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>>
>> Do you fancy trying this version of screed? It seems much
>> faster on medium sized FASTQ files:-
>>
>> http://github.com/peterjc/screed/tree/sqlite-tweaks
>>
>> I'm still running a few tests myself, but will pass this on to
>> the screed team unless I find some regressions.
>>
>> Peter
>>
>
> not too much difference.
>
> screed
> ------
> create: 666.381
> search: 51.839

Still noticeable, but not quite as much of a speed up as I was
seeing (but different example, different OS, etc). Anyway, I've
sent them a "pull request" and they can merge it if they like.

Peter


From rodrigo_faccioli at uol.com.br  Wed Jun  9 17:35:24 2010
From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli)
Date: Wed, 9 Jun 2010 14:35:24 -0300
Subject: [Biopython-dev] Working directly on the main git repository
In-Reply-To: <AANLkTiluRrKJ9AhHHIVwUSm0zXqpyQqn-TIVQUZHkBBF@mail.gmail.com>
References: <AANLkTikhnHfMIQfWU3ORrlfG1z2Zxsdx-sEycFSB3oa-@mail.gmail.com>
	<AANLkTiluRrKJ9AhHHIVwUSm0zXqpyQqn-TIVQUZHkBBF@mail.gmail.com>
Message-ID: <AANLkTimbkbFYM-1M9-EMO4xyRNDbAbdxEnt68NGUlmv_@mail.gmail.com>

About your Github's problem, you may try to perform the command below, after
you removed your local branch.

git push git at github.com:<my_account>/<my_repository>.git :heads/<mybranch>

I've found the command below in [1].

[1]
http://originblog.wordpress.com/2008/04/28/github-tips-removing-a-remote-branch/

Thanks in advance,

--
Rodrigo Antonio Faccioli
Ph.D Student in Electrical Engineering
University of Sao Paulo - USP
Engineering School of Sao Carlos - EESC
Department of Electrical Engineering - SEL
Intelligent System in Structure Bioinformatics
http://laips.sel.eesc.usp.br
Phone: 55 (16) 3373-9366 Ext 229
Curriculum Lattes - http://lattes.cnpq.br/1025157978990218
Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5


On Tue, Jun 8, 2010 at 6:45 PM, Eric Talevich <eric.talevich at gmail.com>wrote:

> On Mon, Jun 7, 2010 at 5:35 AM, Peter <biopython at maubp.freeserve.co.uk
> >wrote:
>
> > Hi all,
> >
> > I thought I'd write down some notes about how I've been using git
> recently.
> > This may be of interest to any of the other core developers (those of us
> > with read-write access to the main repository), and I might get some good
> > tips from any discussion. The key point is that I have read+write access
> > to two repositories on github (the official repository AND my own fork),
> > so there are different advantages/disadvantages about which I choose
> > to work with directly as my main repository.
> >
> > [...]
> >
> > Instead, I have a github repository of my own (what github calls a
> > fork), and I push branches there.
> >
> > http://github.com/biopython/biopython - the official branch(es)
> > http://github.com/peterjc/biopython - my branches
> >
> > How does this work in practice? Like this - I clone the master
> > and add a reference to my repository (and I do the same when I
> > want to grab a branch from another developer):
> >
> > git clone git at github.com:biopython/biopython.git
> > cd biopython
> > git remote add peterjc git at github.com:peterjc/biopython.git
> > git fetch peterjc
> >
> > Then make a new local branch as usual, and when ready to share
> > it publicly, I push it to *my* repository on github:
> >
> > git branch new-work
> > git checkout new-work
> > git commit ...
> > git push peterjc new-work
> >
> > This would then appear as a new-work branch on my github page.
> > Then if I (or someone else) wants to access these branches later
> > (e.g. from another machine) just use the checkout tracked remote
> > branch. For example,
> >
> > git clone git at github.com:biopython/biopython.git
> > cd biopython
> > git remote add peterjc git at github.com:peterjc/biopython.git
> > git fetch peterjc
> > git checkout -t peterjc/seqio-imgt
> >
> > This then looks like a normal branch (called just "seqio-imgt" in
> > this example), but git knows it is linked to the remote branch on
> > the "peterjc" repository (not the origin which is the "official"
> > repository).
> >
>
> This looks reasonable to me. I'd add that the procedure to delete a public
> branch from your personal fork on GitHub is a little obscure:
>
> git branch -a   # list local and remote branches
> git branch -d new-work   # delete a local branch that's been merged already
> git push peterjc :new-work  # delete the public branch from GitHub
>
> This doesn't do what you'd expect:
> git branch -d peterjc/new-work
>
> That only removes your local reference to the the public branch; the branch
> is still visible on GitHub.
>
> (It's kind of hard to find in the GitHub documentation.)
>
>
> I'd have to check, but I guess that if the original git clone is done
> > with git://github.com/biopython/biopython.git instead (read only
> > access) the same procedure could be used by non core devs.
> > However, I'm not sure this is clearer for them. I think the current
> > procedure (on our wiki) where you add a remote reference to
> > the "upstream" official repository works better in this case.
> >
>
> I still have an "upstream" reference to the main repo. I wouldn't want to
> accidentally push something foolish to the main repo with a stray "git
> push"... better to have the safe thing happen by default.
>
> If the initial clone was from biopython master, and you later create a
> personal forkon GitHub, then it's not too hard to switch the references
> around in your local repo to make the public fork your "origin".
>
> -Eric
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


From eric.talevich at gmail.com  Wed Jun  9 23:56:35 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Wed, 9 Jun 2010 19:56:35 -0400
Subject: [Biopython-dev] Tested Fixup branch for Bio.PDB
In-Reply-To: <df95eaa0e6f3c40d451630cb54332b3c-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FaLkQAWF9QSVlUUAw=-webmailer2@server04.webmailer.hosteurope.de>
References: <df95eaa0e6f3c40d451630cb54332b3c-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FaLkQAWF9QSVlUUAw=-webmailer2@server04.webmailer.hosteurope.de>
Message-ID: <AANLkTinESGFm0Z2m7VAWYgtWXh9wXVYRViDKtPx4rKN6@mail.gmail.com>

On Tue, Jun 8, 2010 at 5:59 AM, Kristian Rother <krother at rubor.de> wrote:

>
> Hi Eric,
>
> I've checked out your pdbfixes branch and ran our 431 Unit Tests of
> ModeRNA with it. There were no changes to the master Bio.PDB branch -->
> for us everything OK.
>
> Details:
> ModeRNA (http://www.genesilico.pl/moderna) engineers RNA 3D structures and
> uses Bio.PDB for most of its operations: reading files,
> adding/copying/manipulating residues/atoms, superimposing structures,
> searching neighbors by KDTree, writing files.
>
> Right, the tests most probably did not depend directly on the code you
> changed, but as I understand you wanted to go sure the branch didnt break
> anything by accident.
>

Thanks, Kristian! I didn't expect the patches to break anything, but it's
hard to be sure until someone else has tried it.

I've pushed the pdbfixes branch to Biopython's master branch on GitHub.

Cheers,
Eric


From biopython at maubp.freeserve.co.uk  Thu Jun 10 16:24:20 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 10 Jun 2010 17:24:20 +0100
Subject: [Biopython-dev] subprocess and calling application wrappers
In-Reply-To: <AANLkTimZShDqgJ__BO8sqvkJl7DBsLXS2iz-0ATW0saa@mail.gmail.com>
References: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com>
	<20100601132355.GU1054@sobchak.mgh.harvard.edu>
	<AANLkTinJplczWd6_m4sCmY8Gkh-AsjptvfAwa2TDa5M7@mail.gmail.com>
	<AANLkTilkILMvG5huqqZ2-rKMSQNqOAQGiirVO3rNBCwt@mail.gmail.com>
	<AANLkTimZShDqgJ__BO8sqvkJl7DBsLXS2iz-0ATW0saa@mail.gmail.com>
Message-ID: <AANLkTinpmWEFkXwHDQ9CeGft4UrzhUMvsJ-QDTRotwLk@mail.gmail.com>

On Wed, Jun 2, 2010 at 12:59 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> With that in mind, as I mentioned yesterday maybe we should just
> update the documentation to suggest using os.system() when you
> just need the return code and there is no stdin to worry about:
>

I've added a basic example to the tutorial now, but the potential
trouble is any output from the called tool will spew out at the
python prompt (if working at the terminal). This may or may not
be an issue. ClustalW for example is rather verbose.

Peter


From bugzilla-daemon at portal.open-bio.org  Thu Jun 10 18:18:41 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 10 Jun 2010 14:18:41 -0400
Subject: [Biopython-dev] [Bug 3098] New: GenBank/EMBL parser breaks for
	between features at origin
Message-ID: <bug-3098-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3098

           Summary: GenBank/EMBL parser breaks for between features at
                    origin
           Product: Biopython
           Version: 1.54
          Platform: PC
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk


I was testing Bio.SeqIO with with a GenBank file gbpln1.seq which includes:

LOCUS       AB042240              134545 bp    DNA     circular PLN 02-MAY-2006
...
     misc_feature    134545^1
                     /standard_name="JLA"
                     /note="Junction IRA-LSC"
ORIGIN 
...

This is a "between" feature of length zero at the origin of this circular
genome. This is a special case since normally between positions "start^end"
have end=start+1 (using one based counting) which the parser does not allow
for.

The same applies to EMBL files as well, e.g.
http://www.ebi.ac.uk/cgi-bin/expasyfetch?AB042240


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu Jun 10 18:35:48 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 10 Jun 2010 14:35:48 -0400
Subject: [Biopython-dev] [Bug 3098] GenBank/EMBL parser breaks for between
	features at origin
In-Reply-To: <bug-3098-42@http.bugzilla.open-bio.org/>
Message-ID: <201006101835.o5AIZm0b025094@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3098


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-10 14:35 EST -------
Fixed,
http://github.com/biopython/biopython/commit/80aa43e5434316d151bca5916442a3429b8724e2


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From eric.talevich at gmail.com  Thu Jun 10 19:18:38 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 10 Jun 2010 15:18:38 -0400
Subject: [Biopython-dev] subprocess and calling application wrappers
In-Reply-To: <AANLkTimZShDqgJ__BO8sqvkJl7DBsLXS2iz-0ATW0saa@mail.gmail.com>
References: <AANLkTikSYJeTbiY_fnqNH33lnGLKMzO7k53amf8Pxb5z@mail.gmail.com> 
	<20100601132355.GU1054@sobchak.mgh.harvard.edu>
	<AANLkTinJplczWd6_m4sCmY8Gkh-AsjptvfAwa2TDa5M7@mail.gmail.com> 
	<AANLkTilkILMvG5huqqZ2-rKMSQNqOAQGiirVO3rNBCwt@mail.gmail.com> 
	<AANLkTimZShDqgJ__BO8sqvkJl7DBsLXS2iz-0ATW0saa@mail.gmail.com>
Message-ID: <AANLkTimKXGT6Of22aw0b1_2EEVPdXSzwSLhMuM9d4le1@mail.gmail.com>

On Wed, Jun 2, 2010 at 7:59 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

>
> Even if the Python documentation seems to be discouraging it,
> using os.system() seems simple, robust, and cross platform. We
> could even update the tutorial now and post it online - it should
> make some people's lives a little easier.
>

The Python docs claim os.system(cmd) is equivalent to subprocess.call(cmd,
shell=True):
http://docs.python.org/library/subprocess.html#replacing-os-system

As I understood it, the reason for usually skipping the shell on Unix
systems was for additional security -- the called program sees the same
thing either way.

Should we use this as a "teachable moment" involving the subprocess module
in the tutorial?

-Eric


From anaryin at gmail.com  Thu Jun 10 23:45:02 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 10 Jun 2010 18:45:02 -0500
Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB
	enhancements
In-Reply-To: <AANLkTikldxobdl9u2B2NDlYe2qpe74H1exiZxmGMcEnY@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com> 
	<AANLkTin1nilGCBk3zuU4ZMf7vl0XdSlX8Ujai6_g4B-z@mail.gmail.com> 
	<AANLkTikj60J4uhtzzgTYos67aeH3eJ3xJF8-QsKq7qWS@mail.gmail.com> 
	<AANLkTim3hndnmdtvlXcBqo723U48FOmpd5AiESwqxM8P@mail.gmail.com> 
	<AANLkTilUjxzLPqKvfqki6wpznWKUSlQGJjJ_6x-cqhmS@mail.gmail.com> 
	<AANLkTikldxobdl9u2B2NDlYe2qpe74H1exiZxmGMcEnY@mail.gmail.com>
Message-ID: <AANLkTinYFfirod6vG2PHEQ2asMz7IfVQFw9spuH_N4E9@mail.gmail.com>

Hello all,

I'm having some issues dealing with this :x

I created a module Bio.Struct that has the following contents:

__init__.py
Protein.py
WWW/

The __init__.py file has a read() method that calls PDBParser and returns a
Structure object. So far so good I think. Then I added a method to
Bio.PDB.Structure more or less like this:

    def as_protein(self):
        from Bio.Struct.Protein import Protein
        prot = Protein(self)
        return prot

so when you call it you get a new object. Protein is a class that inherits
from Structure and that has the search_ss_bonds function.

I can make the new object get all the methods from Structure AND from
Protein, but when I try to execute search_ss_bonds, it fails because
child_list, a Structure method, comes empty.. In fact, the whole SMCRA
object comes empty..

How do I effectively do the inheritance on the Protein class?

from Bio.PDB.Structure import Structure

class Protein(Structure):

    def __init__(self, protein):

        self = protein

This is what I last tried and doesn't work.. I've tried Structure.__init__,
and several other things but to no avail. I'm sure this is simple OOP but I
really can't understand that well how to do it ...

Care to give a hand to a friend in need? :)

Thanks in advance! By the way, I assume that if I got no comments on
anything else on the GSOC thread that I'm doing a perfect job :P Thanks for
that too :D

Best!

Jo?o [...] Rodrigues


From eric.talevich at gmail.com  Fri Jun 11 01:49:39 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 10 Jun 2010 21:49:39 -0400
Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB
	enhancements
In-Reply-To: <AANLkTinYFfirod6vG2PHEQ2asMz7IfVQFw9spuH_N4E9@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com> 
	<AANLkTin1nilGCBk3zuU4ZMf7vl0XdSlX8Ujai6_g4B-z@mail.gmail.com> 
	<AANLkTikj60J4uhtzzgTYos67aeH3eJ3xJF8-QsKq7qWS@mail.gmail.com> 
	<AANLkTim3hndnmdtvlXcBqo723U48FOmpd5AiESwqxM8P@mail.gmail.com> 
	<AANLkTilUjxzLPqKvfqki6wpznWKUSlQGJjJ_6x-cqhmS@mail.gmail.com> 
	<AANLkTikldxobdl9u2B2NDlYe2qpe74H1exiZxmGMcEnY@mail.gmail.com> 
	<AANLkTinYFfirod6vG2PHEQ2asMz7IfVQFw9spuH_N4E9@mail.gmail.com>
Message-ID: <AANLkTilZh3xH7nsZuGf8HgV2uTzv8eM6Omm4YmSfcQ4Y@mail.gmail.com>

On Thu, Jun 10, 2010 at 7:45 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:

> Hello all,
>
> I'm having some issues dealing with this :x
>
> I created a module Bio.Struct that has the following contents:
>
> __init__.py
> Protein.py
> WWW/
>
> The __init__.py file has a read() method that calls PDBParser and returns a
> Structure object. So far so good I think. Then I added a method to
> Bio.PDB.Structure more or less like this:
>
>     def as_protein(self):
>
>         from Bio.Struct.Protein import Protein
>         prot = Protein(self)
>         return prot
>
> so when you call it you get a new object. Protein is a class that inherits
> from Structure and that has the search_ss_bonds function.
>
> I can make the new object get all the methods from Structure AND from
> Protein, but when I try to execute search_ss_bonds, it fails because
> child_list, a Structure method, comes empty.. In fact, the whole SMCRA
> object comes empty..
>
> How do I effectively do the inheritance on the Protein class?
>
> from Bio.PDB.Structure import Structure
>
> class Protein(Structure):
>
>     def __init__(self, protein):
>
>         self = protein
>
> This is what I last tried and doesn't work.. I've tried Structure.__init__,
> and several other things but to no avail. I'm sure this is simple OOP but I
> really can't understand that well how to do it ...
>
> Care to give a hand to a friend in need? :)
>
> Thanks in advance! By the way, I assume that if I got no comments on
> anything else on the GSOC thread that I'm doing a perfect job :P Thanks for
> that too :D
>
> Best!
>
> Jo?o [...] Rodrigues
>

Hi Jo?o,

You have it mostly correct, but you need to call the parent class's
constructor, too.

Here's the constructor for Structure:

    def __init__(self, id):
        self.level="S"
        Entity.__init__(self, id)

And here it is for Entity:

    def __init__(self, id):
        self.id=id
        self.full_id=None
        self.parent=None
        self.child_list=[]
        self.child_dict={}
        # Dictionary that keeps addictional properties
        self.xtra={}

See the problem? Every subclass of Entity takes an "id" argument and sets
the other attributes separately.

In Bio.Phylo, I used another convention for converting an object of one type
to a sub-class of the original type, as you're doing here. Rather than
change the arguments to the constructor (which could have weird
side-effects), I added a class method in the target class:

@classmethod
def from_structure(cls, struct):
    # Instantiate a Protein with the structure's id
    # Assign the other attributes individually from struct


Then Structure.as_protein() becomes fairly simple. Alternatively, you could
skip implementing Protein.from_structure() and do the attribute reassignment
in as_protein(). Or, covering all the options, implement from_structure()
but not as_protein(), and let the user figure it out.

Do you think it would also be useful if as_protein() or from_structure()
dropped any non-protein molecules during the conversion, and raise an error
if nothing's left? Or would that cause more problems than it solves?

Best,
Eric


From biopython at maubp.freeserve.co.uk  Mon Jun 14 14:44:50 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 14 Jun 2010 15:44:50 +0100
Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method?
In-Reply-To: <320fb6e00910010204x5035a6edjde030e3072f5f91b@mail.gmail.com>
References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com>
	<320fb6e00910010204x5035a6edjde030e3072f5f91b@mail.gmail.com>
Message-ID: <AANLkTik6TlYQER9MOEf56NZi0bNLANe_uJK3qjSKlQVG@mail.gmail.com>

Hi all,

You may recall late last year I posted about adding a reverse
complement method to the SeqRecord, and addition support:
http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006850.html
http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006851.html

SeqRecord addition was included in Biopython 1.53, but
not the reverse_complement() method - which is something
I wanted to use again today to reverse complement an
annotated GenBank file and have all the SeqFeature
locations flipped for me. I've rescued my old code and
its unit tests and created a new branch for it:
http://github.com/peterjc/biopython/commits/seqrecord-rc

As I said at the end of last year, I think the general idea of
a SeqRecord reverse_complement() method is nice but the
details about handling the annotation is tricky. When we
discussed slicing and addition, it was agreed that we
should be cautious to avoid blindly transferring annotation
inappropriately. The code on this branch allows the user to
choose for each annotation type if it should be dropped
(False), kept (True) or set to a supplied new value. The
docstring has examples of how this works (which double
as doctests).

Jose - I've CC'd you since I know you wrote your own
SeqRecord subclass with a complement() method (but not
a reverse_complement() method) for Franklin. I'm curious
about this choice.

Cedar - I've CC'd you since you asked about this kind of
think last year:
http://lists.open-bio.org/pipermail/biopython/2009-June/005307.html

Regards,

Peter


From biopython at maubp.freeserve.co.uk  Mon Jun 14 14:50:31 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 14 Jun 2010 15:50:31 +0100
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <20100614164348.186267pfu17v2ntw@horde.genesilico.pl>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>
	<312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>
	<AANLkTimtoy9zWv74_KwwRhU-UQny1u0uzHQsy894wXJ8@mail.gmail.com>
	<6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTinJE4eBPO8ydgrunBzWZqeN5Nw4nyuo8GEaRTzr@mail.gmail.com>
	<20100614164348.186267pfu17v2ntw@horde.genesilico.pl>
Message-ID: <AANLkTikC_fuVP4kDMckW0EqIq0oZWwcHYy-TpbnM2hCl@mail.gmail.com>

On Mon, Jun 14, 2010 at 3:43 PM, Kristian Rother <krother at genesilico.pl> wrote:
>
>
> Hi Peter,
>
> just digesting BioPy mails from last week.
>
>>> Where should the str subclass for secondary structures that the parsers
>>> create go? Could it be Bio.Struct.RNA?
>>
>> You don't think plain strings in the SeqRecord's letter_annotation
>> dict would be enough?
>
> Not really - base pairing makes most normal string functions useless.
>
>
>> Assuming you do need something then
>> perhaps under Bio.Seq or Bio.SeqUtils might be worth considering
>> as alternatives to Bio.Struct.RNA.
>
> OK, I'll try that.
>
> Thanks,
> ? Kristian
>
>

Hi Kristian,

Could you explain at little more about why plain strings wouldn't be
suitable here. What kind of things do you want to do with them?

Peter


From krother at rubor.de  Mon Jun 14 14:55:21 2010
From: krother at rubor.de (Kristian Rother)
Date: Mon, 14 Jun 2010 16:55:21 +0200
Subject: [Biopython-dev] Module reorganization for upcoming Bio.PDB
 enhancements
Message-ID: <1cf21a9224e1cd3ad4c8e2853d99100b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XXwtdXg==-webmailer2@server03.webmailer.hosteurope.de>


Hi guys,

I'm fine with your ideas regarding different wrappers for
Bio.PDB.Structure objects discussed last week, in particular:

- creating Bio.Struct.RNA or Bio.PDB.RNA with a Structure instance.
- having a structure.as_rna() helper method as suggested by Eric (but this
is no must).

I'd like to take what Joao does for proteins and add some basic equivalent
for RNA structures shortly after.

Best Regards,
    Kristian


Quoting Thomas Hamelryck <thomas.hamelryck at gmail.com>:

> Hi,
>
> On Tue, Jun 8, 2010 at 7:10 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
>
>>
>> from Bio.PDB import Protein
>> structure = Protein('1ABC.pdb')
>> structure.search_ss_bonds()
>>
>
> Indeed, that would run into problems for complexes where proteins, RNA,
DNA,
> etc. occur in the same file. It makes much more sense to have a Structure
> centred approach:
>
> proteins=Protein(structure)
> chains=proteins.get_chains()
> chain_a=chains["A"]
> polypeptides=chain_a.get_peptides()
>
> rnas=RNA(structure)
>
> etc.
>
> -Thomas


From krother at rubor.de  Mon Jun 14 15:01:48 2010
From: krother at rubor.de (Kristian Rother)
Date: Mon, 14 Jun 2010 17:01:48 +0200
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <AANLkTikC_fuVP4kDMckW0EqIq0oZWwcHYy-TpbnM2hCl@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>
	<312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>
	<AANLkTimtoy9zWv74_KwwRhU-UQny1u0uzHQsy894wXJ8@mail.gmail.com>
	<6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTinJE4eBPO8ydgrunBzWZqeN5Nw4nyuo8GEaRTzr@mail.gmail.com>
	<20100614164348.186267pfu17v2ntw@horde.genesilico.pl>
	<AANLkTikC_fuVP4kDMckW0EqIq0oZWwcHYy-TpbnM2hCl@mail.gmail.com>
Message-ID: <beb3f08b05db7a4112966711a97b98e0-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XXw9fVw==-webmailer2@server03.webmailer.hosteurope.de>


Hi,

much of what I do with RNA secondary structures strongly depends on
iterating base pairs, e.g..

>>> sec = Secstruc("(((...)).)")
>>> for bp in sec.basepairs():
>>>    print bp
(0, 9)
(1, 7)
(2, 6)

also:
>>> sec.get_helices()
>>> sec.get_bulges()
>>> sec.get_hairpins()
>>> sec.contains_pseudoknot()
.. and a couple of similar ones.


The reason why I'd prefer to have something more than a string as a sec
feature is that I wouldn't want to do all the time:

sec = Secstruc(my_seq['secondary_structure'])
sec.get_helices()

but

my_seq['secondary_structure'].get_helices()

instead.

Best Regards,
   Kristian


>> Hi Peter,
>>
>> just digesting BioPy mails from last week.
>>
>>>> Where should the str subclass for secondary structures that the
>>>> parsers
>>>> create go? Could it be Bio.Struct.RNA?
>>>
>>> You don't think plain strings in the SeqRecord's letter_annotation
>>> dict would be enough?
>>
>> Not really - base pairing makes most normal string functions useless.
>>
>>
>>> Assuming you do need something then
>>> perhaps under Bio.Seq or Bio.SeqUtils might be worth considering
>>> as alternatives to Bio.Struct.RNA.
>>
>> OK, I'll try that.
>>
>> Thanks,
>> ? Kristian
>>
>>
>
> Hi Kristian,
>
> Could you explain at little more about why plain strings wouldn't be
> suitable here. What kind of things do you want to do with them?
>
> Peter
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
>


From krother at rubor.de  Mon Jun 14 15:13:19 2010
From: krother at rubor.de (Kristian Rother)
Date: Mon, 14 Jun 2010 17:13:19 +0200
Subject: [Biopython-dev] creating Protein(structure) object
Message-ID: <851b929e1adfe21a7a4d7e938c0bed7b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XUAtWVg==-webmailer2@server03.webmailer.hosteurope.de>


Hi Joao,

what you are describing is the classical Decorator Pattern (see
http://en.wikipedia.org/wiki/Decorator_pattern). In the books, they say
that the Decorator (Protein) must implement all methods of the decorated
object (Structure).
Of course, for a class as big as Bio.PDB.Structure, this sucks a lot. I
see two alternatives:

(1) override Protein.__getattr__(self, attr) to return self.struc.attr if
it exists. I tried this recently and it worked fine until the decorated
class used Python properties, when it started getting ugly again.

(2) have Protein inherit from Structure, and grab all the children from
the structure class, e.g.:

class Protein(Structure):
    def __init__(self, struc):
        """
        The given Structure instance becomes a Protein.
        """
        Structure.__init__(self, struc.id)
        for child in struc.child_list:
            # eventually check if its a protein chain.
            self.add_child(child)


Any comments?
    Kristian


> Hello all,
>
> I'm having some issues dealing with this :x
>
> I created a module Bio.Struct that has the following contents:
>
> __init__.py
> Protein.py
> WWW/
>
> The __init__.py file has a read() method that calls PDBParser and returns a
> Structure object. So far so good I think. Then I added a method to
> Bio.PDB.Structure more or less like this:
>
>     def as_protein(self):
>         from Bio.Struct.Protein import Protein
>         prot = Protein(self)
>         return prot
>
> so when you call it you get a new object. Protein is a class that inherits
> from Structure and that has the search_ss_bonds function.
>
> I can make the new object get all the methods from Structure AND from
> Protein, but when I try to execute search_ss_bonds, it fails because
> child_list, a Structure method, comes empty.. In fact, the whole SMCRA
> object comes empty..
>
> How do I effectively do the inheritance on the Protein class?
>
> from Bio.PDB.Structure import Structure
>
> class Protein(Structure):
>
>     def __init__(self, protein):
>
>         self = protein
>
> This is what I last tried and doesn't work.. I've tried Structure.__init__,
> and several other things but to no avail. I'm sure this is simple OOP but I
> really can't understand that well how to do it ...
>
> Care to give a hand to a friend in need? :)
>
> Thanks in advance! By the way, I assume that if I got no comments on
> anything else on the GSOC thread that I'm doing a perfect job :P Thanks for
> that too :D
>
> Best!
>
> Jo?o [...] Rodrigues
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
>
>
>


From biopython at maubp.freeserve.co.uk  Mon Jun 14 15:23:25 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 14 Jun 2010 16:23:25 +0100
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <beb3f08b05db7a4112966711a97b98e0-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XXw9fVw==-webmailer2@server03.webmailer.hosteurope.de>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<ca4f3645f6dbd8a1806adc22556af35b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVtdXgtYXg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>
	<312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>
	<AANLkTimtoy9zWv74_KwwRhU-UQny1u0uzHQsy894wXJ8@mail.gmail.com>
	<6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTinJE4eBPO8ydgrunBzWZqeN5Nw4nyuo8GEaRTzr@mail.gmail.com>
	<20100614164348.186267pfu17v2ntw@horde.genesilico.pl>
	<AANLkTikC_fuVP4kDMckW0EqIq0oZWwcHYy-TpbnM2hCl@mail.gmail.com>
	<beb3f08b05db7a4112966711a97b98e0-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XXw9fVw==-webmailer2@server03.webmailer.hosteurope.de>
Message-ID: <AANLkTinkppj-wfBPlBtNwub8edWzks3TfmcwKzKDJml5@mail.gmail.com>

On Mon, Jun 14, 2010 at 4:01 PM, Kristian Rother <krother at rubor.de> wrote:
>
> Hi,
>
> much of what I do with RNA secondary structures strongly depends on
> iterating base pairs, e.g..
>
>>>> sec = Secstruc("(((...)).)")
>>>> for bp in sec.basepairs():
>>>> ? ?print bp
> (0, 9)
> (1, 7)
> (2, 6)
>
> also:
>>>> sec.get_helices()
>>>> sec.get_bulges()
>>>> sec.get_hairpins()
>>>> sec.contains_pseudoknot()
> .. and a couple of similar ones.
>
> The reason why I'd prefer to have something more than a string as a sec
> feature is that I wouldn't want to do all the time:
>
> sec = Secstruc(my_seq['secondary_structure'])
> sec.get_helices()
>
> but
>
> my_seq['secondary_structure'].get_helices()
>
> instead.
>
> Best Regards,
> ? Kristian

That helped - thanks. Does your Secstruc object behave like a Python
sequence (string/list/tuple) in that it has a length and can be sliced (as
if acting on the string representation)? If so then it should be fine to
store in the SeqRecord's letter_annotation dictionary.

Peter


From krother at rubor.de  Mon Jun 14 15:41:05 2010
From: krother at rubor.de (Kristian Rother)
Date: Mon, 14 Jun 2010 17:41:05 +0200
Subject: [Biopython-dev] upcoming Bio.PDB enhancements - RNA
In-Reply-To: <AANLkTinkppj-wfBPlBtNwub8edWzks3TfmcwKzKDJml5@mail.gmail.com>
References: <AANLkTinx2yfz2gTbqQ-6VwyLRXWygiZbNzXJ53T0_fl0@mail.gmail.com>
	<AANLkTilGwG5N8PM61leFpB8dTc7sxVWpc9UdgAKigaOq@mail.gmail.com>
	<312aded59e9223ed4d60fcf17c93ee98-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxVWQ1eXg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTikhj-1wwXGIbwxiqTrz7aLohzhwHi_vgR4EEFob@mail.gmail.com>
	<AANLkTimtoy9zWv74_KwwRhU-UQny1u0uzHQsy894wXJ8@mail.gmail.com>
	<6d2b4f13da6deb23b8078ffd5fd3e7b1-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRVxdWABbWg==-webmailer2@server06.webmailer.hosteurope.de>
	<AANLkTinJE4eBPO8ydgrunBzWZqeN5Nw4nyuo8GEaRTzr@mail.gmail.com>
	<20100614164348.186267pfu17v2ntw@horde.genesilico.pl>
	<AANLkTikC_fuVP4kDMckW0EqIq0oZWwcHYy-TpbnM2hCl@mail.gmail.com>
	<beb3f08b05db7a4112966711a97b98e0-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XXw9fVw==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTinkppj-wfBPlBtNwub8edWzks3TfmcwKzKDJml5@mail.gmail.com>
Message-ID: <3e6714450418534d741476aa0b64b374-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1WWAhZWg==-webmailer2@server03.webmailer.hosteurope.de>


Hi Peter,

> That helped - thanks. Does your Secstruc object behave like a Python
> sequence (string/list/tuple) in that it has a length and can be sliced

Yes, it does.

> If so then it should be fine to
> store in the SeqRecord's letter_annotation dictionary.

Best,
  Kristian


> On Mon, Jun 14, 2010 at 4:01 PM, Kristian Rother <krother at rubor.de> wrote:
>>
>> Hi,
>>
>> much of what I do with RNA secondary structures strongly depends on
>> iterating base pairs, e.g..
>>
>>>>> sec = Secstruc("(((...)).)")
>>>>> for bp in sec.basepairs():
>>>>> ? ?print bp
>> (0, 9)
>> (1, 7)
>> (2, 6)
>>
>> also:
>>>>> sec.get_helices()
>>>>> sec.get_bulges()
>>>>> sec.get_hairpins()
>>>>> sec.contains_pseudoknot()
>> .. and a couple of similar ones.
>>
>> The reason why I'd prefer to have something more than a string as a sec
>> feature is that I wouldn't want to do all the time:
>>
>> sec = Secstruc(my_seq['secondary_structure'])
>> sec.get_helices()
>>
>> but
>>
>> my_seq['secondary_structure'].get_helices()
>>
>> instead.
>>
>> Best Regards,
>> ? Kristian
>
> That helped - thanks. Does your Secstruc object behave like a Python
> sequence (string/list/tuple) in that it has a length and can be sliced (as
> if acting on the string representation)? If so then it should be fine to
> store in the SeqRecord's letter_annotation dictionary.
>
> Peter
>
>


From anaryin at gmail.com  Mon Jun 14 17:58:56 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Mon, 14 Jun 2010 12:58:56 -0500
Subject: [Biopython-dev] creating Protein(structure) object
In-Reply-To: <851b929e1adfe21a7a4d7e938c0bed7b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XUAtWVg==-webmailer2@server03.webmailer.hosteurope.de>
References: <851b929e1adfe21a7a4d7e938c0bed7b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XUAtWVg==-webmailer2@server03.webmailer.hosteurope.de>
Message-ID: <AANLkTino0pTTYZZaZ9w35ig04AIa01YSRHcpunz91_DU@mail.gmail.com>

Hello Kristian,

The way I'm doing it as a workaround is:

class Protein(Structure):

    def __init__(self, protein):

        Structure.__init__(self, protein.id)

        self.full_id = protein.full_id
        self.child_list = protein.child_list
        self.child_dict = protein.child_dict
        self.parent = protein.parent
        self.xtra = protein.xtra

It works because every method I'm using deepcopies this anyway..

The way of adding the childs seems the correct way to go but it won't copy
headers... should we want this?

Thanks :)

J


From eric.talevich at gmail.com  Mon Jun 14 20:27:24 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 14 Jun 2010 16:27:24 -0400
Subject: [Biopython-dev] creating Protein(structure) object
In-Reply-To: <AANLkTino0pTTYZZaZ9w35ig04AIa01YSRHcpunz91_DU@mail.gmail.com>
References: <851b929e1adfe21a7a4d7e938c0bed7b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XUAtWVg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTino0pTTYZZaZ9w35ig04AIa01YSRHcpunz91_DU@mail.gmail.com>
Message-ID: <AANLkTilwgdOfGyRWz57HA5aPUXzH_vfXh_xHATlejnUn@mail.gmail.com>

Hi guys,

Another convention with the Decorator pattern is to ensure that all of the
method arguments that existed in the original class are also present in the
decorated one. This includes the constructor. Decoration simply adds another
feature to whatever was already there.


Jo?o Rodrigues <anaryin at gmail.com> wrote:

> Hello Kristian,
>
> The way I'm doing it as a workaround is:
>
> class Protein(Structure):
>
>    def __init__(self, protein):
>
>         Structure.__init__(self, protein.id)
>
>        self.full_id = protein.full_id
>        self.child_list = protein.child_list
>        self.child_dict = protein.child_dict
>        self.parent = protein.parent
>        self.xtra = protein.xtra
>


The way the constructors of Structure and other Entity subclasses work is to
create a new object with the appropriate, empty attributes -- i.e. no
children. Other code then attaches children to the class.

To decorate a Structure with Protein-specific functionality, I would
consider:

1. The Entity constructor takes an ID, and creates empty containers for
child Entities. (Models, in this case.) So Protein.__init__ needs to start
like:

class Protein(Structure):
    def __init__(self, id):  # take any keyword arguments?
        Structure.__init__(self, id)
        # handle any keyword arguments here

2. We need to be able to convert an existing Structure to a new Protein.
That's new functionality, so it needs either a keyword argument in __init__,
or a separate method or function. If we add a keyword argument to __init__,
then the implementation is basically two completely different operations
depending on if a Structure was passed or not. Plus, there's still that 'id'
argument to deal with.

3. Instantiating a Protein directly would mean importing the
Bio.Struct.Protein module manually, in addition to "from Bio import Struct".
More to the point, Bio.Struct.Protein consists of lower-level functionality
that a casual Struct user shouldn't have to dig into, as long as
Structure.as_protein() exists. So there's no value in making
Protein.__init__ "do what I mean" at the expense of clarity in the code.
Better to make the code very obvious and explicit here, and focus on API
prettiness from a different angle.

4. The next most convenient place for Structure-to-Protein conversion is on
the Structure class. This presents a nice API that will be sufficient for
most users:

from Bio import Struct
prot = Struct.read('1ABC.pdb').as_protein()

But, going back to OOP principles, the Structure class shouldn't need to
know anything about the Protein class's internals -- though it's free to
call any public method and make things nicer for the user. So, finally, we
need a class method* on Protein that Structure.as_protein() can call.

Hence, Protein.from_structure().

[*] A class method can be called without first instantiating the class.
Since we're trying to construct a new object here, we need to be able to
call this Protein method before the Protein object exists. No worries, just
use the @classmethod decorator.


> It works because every method I'm using deepcopies this anyway..
>

If someone modifies the original Structure object after you've created a
Protein this way -- e.g. renumbering residues, or with their own function --
it will also modify the Protein object, since lists and dicts are shared. Is
this what you want?

If you're concerned about memory usage, you can also look at implementing
__deepcopy__.


> The way of adding the childs seems the correct way to go but it won't copy
> headers... should we want this?
>

You code for copying the Structure's children looks right to me, except I
think it's best to be little paranoid with Python lists and make deep copies
anyway. I suppose you could also copy any header info that's relevant to
proteins, using the same approach.

Best,
Eric


From anaryin at gmail.com  Tue Jun 15 03:06:03 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Mon, 14 Jun 2010 22:06:03 -0500
Subject: [Biopython-dev] creating Protein(structure) object
In-Reply-To: <AANLkTilwgdOfGyRWz57HA5aPUXzH_vfXh_xHATlejnUn@mail.gmail.com>
References: <851b929e1adfe21a7a4d7e938c0bed7b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl1XUAtWVg==-webmailer2@server03.webmailer.hosteurope.de>
	<AANLkTino0pTTYZZaZ9w35ig04AIa01YSRHcpunz91_DU@mail.gmail.com> 
	<AANLkTilwgdOfGyRWz57HA5aPUXzH_vfXh_xHATlejnUn@mail.gmail.com>
Message-ID: <AANLkTil0k2HJEfXVc0_xUv39qKbQxk9oPAeiegO6aVVO@mail.gmail.com>

Ok, thanks for the long explanation!

I'll merge what you and Kristian said and come up with a better interface.
As is, I call is like this:

s = Struct.read("1abc.pdb") # by the way, I added a trick to avoid the
mandatory name of the structure
p = s.as_protein()

Best

J


From jblanca at btc.upv.es  Tue Jun 15 05:55:45 2010
From: jblanca at btc.upv.es (Jose Blanca)
Date: Tue, 15 Jun 2010 07:55:45 +0200
Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method?
In-Reply-To: <AANLkTik6TlYQER9MOEf56NZi0bNLANe_uJK3qjSKlQVG@mail.gmail.com>
References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com>
	<320fb6e00910010204x5035a6edjde030e3072f5f91b@mail.gmail.com>
	<AANLkTik6TlYQER9MOEf56NZi0bNLANe_uJK3qjSKlQVG@mail.gmail.com>
Message-ID: <201006150755.45162.jblanca@btc.upv.es>

On Monday 14 June 2010 16:44:50 Peter wrote:
> Hi all,
>
> You may recall late last year I posted about adding a reverse
> complement method to the SeqRecord, and addition support:
> http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006850.htm
>l http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006851.html
>
> SeqRecord addition was included in Biopython 1.53, but
> not the reverse_complement() method - which is something
> I wanted to use again today to reverse complement an
> annotated GenBank file and have all the SeqFeature
> locations flipped for me. I've rescued my old code and
> its unit tests and created a new branch for it:
> http://github.com/peterjc/biopython/commits/seqrecord-rc
>
> As I said at the end of last year, I think the general idea of
> a SeqRecord reverse_complement() method is nice but the
> details about handling the annotation is tricky. When we
> discussed slicing and addition, it was agreed that we
> should be cautious to avoid blindly transferring annotation
> inappropriately. The code on this branch allows the user to
> choose for each annotation type if it should be dropped
> (False), kept (True) or set to a supplied new value. The
> docstring has examples of how this works (which double
> as doctests).

Having a reverse_complement method would be useful for us. But it could be 
quite tricky to reverse complement some features. For instance we have SNP 
features that include a reference nucleotide. We would had to complement that 
nucleotide too.

Regards,


-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)


From biopython at maubp.freeserve.co.uk  Tue Jun 15 09:08:14 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 15 Jun 2010 10:08:14 +0100
Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method?
In-Reply-To: <201006150755.45162.jblanca@btc.upv.es>
References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com>
	<320fb6e00910010204x5035a6edjde030e3072f5f91b@mail.gmail.com>
	<AANLkTik6TlYQER9MOEf56NZi0bNLANe_uJK3qjSKlQVG@mail.gmail.com>
	<201006150755.45162.jblanca@btc.upv.es>
Message-ID: <AANLkTikDBQHR2DJ2wKoBfMWsj8VifkgOeBkpdFfE5KVG@mail.gmail.com>

On Tue, Jun 15, 2010 at 6:55 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
>
> Having a reverse_complement method would be useful for us. But it could be
> quite tricky to reverse complement some features. For instance we have SNP
> features that include a reference nucleotide. We would had to complement that
> nucleotide too.
>

Could you give an example? I assume you are talking about the annotation
of the feature (i.e. the qualifiers dictionary of a SeqFeature object).

Peter


From jblanca at btc.upv.es  Tue Jun 15 09:23:27 2010
From: jblanca at btc.upv.es (Jose Blanca)
Date: Tue, 15 Jun 2010 11:23:27 +0200
Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method?
In-Reply-To: <AANLkTikDBQHR2DJ2wKoBfMWsj8VifkgOeBkpdFfE5KVG@mail.gmail.com>
References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com>
	<201006150755.45162.jblanca@btc.upv.es>
	<AANLkTikDBQHR2DJ2wKoBfMWsj8VifkgOeBkpdFfE5KVG@mail.gmail.com>
Message-ID: <201006151123.27158.jblanca@btc.upv.es>

On Tuesday 15 June 2010 11:08:14 Peter wrote:
> On Tue, Jun 15, 2010 at 6:55 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> > Having a reverse_complement method would be useful for us. But it could
> > be quite tricky to reverse complement some features. For instance we have
> > SNP features that include a reference nucleotide. We would had to
> > complement that nucleotide too.
>
> Could you give an example? I assume you are talking about the annotation
> of the feature (i.e. the qualifiers dictionary of a SeqFeature object).

That is right in some instances the qualifiers should be modified. For 
instance if we have an ORF with a qualifier 'forward':True, it should be 
changed. I don't think this change can be done automatically .


-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)


From biopython at maubp.freeserve.co.uk  Tue Jun 15 09:42:47 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 15 Jun 2010 10:42:47 +0100
Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method?
In-Reply-To: <201006151123.27158.jblanca@btc.upv.es>
References: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com>
	<201006150755.45162.jblanca@btc.upv.es>
	<AANLkTikDBQHR2DJ2wKoBfMWsj8VifkgOeBkpdFfE5KVG@mail.gmail.com>
	<201006151123.27158.jblanca@btc.upv.es>
Message-ID: <AANLkTili8DK7OJqeqtEgY6dqnGWUEYpOZhWYJC6dWPYi@mail.gmail.com>

On Tue, Jun 15, 2010 at 10:23 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> On Tuesday 15 June 2010 11:08:14 Peter wrote:
>> On Tue, Jun 15, 2010 at 6:55 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
>> > Having a reverse_complement method would be useful for us. But it could
>> > be quite tricky to reverse complement some features. For instance we have
>> > SNP features that include a reference nucleotide. We would had to
>> > complement that nucleotide too.
>>
>> Could you give an example? I assume you are talking about the annotation
>> of the feature (i.e. the qualifiers dictionary of a SeqFeature object).
>
> That is right in some instances the qualifiers should be modified. For
> instance if we have an ORF with a qualifier 'forward':True, it should be
> changed. I don't think this change can be done automatically .

Yes, that sort of thing would be very difficult to do automatically. We come
back to the question of what the default should be - blindly copy, or
just drop this information. I would say for most feature annotation (and
I am thinking about GenBank and EMBL style files here) there isn't
anything strand specific to worry about, so in general copying is fine.
Clearly this is not a safe assumption for SNP features.

Peter


From krother at rubor.de  Tue Jun 15 14:06:52 2010
From: krother at rubor.de (Kristian Rother)
Date: Tue, 15 Jun 2010 16:06:52 +0200
Subject: [Biopython-dev] RNA Alphabet: request for comments
Message-ID: <485134f2f1ebae4701d6fbcdfcdee3ee-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5UWABeXQ==-webmailer2@server02.webmailer.hosteurope.de>


Hi,

I've commited a proof-of-concept implementation how modified RNA bases
could be made compatible to Biopython Alphabets. Comments are very
welcome, especially because I had to change two lines in the Seq class to
make it work.

The code can be viewed on:
http://github.com/krother/biopython/commit/d9f942936d6165703512099a6a2d84452fea27aa
(on github: krother/biopython, branch rna_alphabet).

The two main classes are:
RNAAlphabetEntry(str) that contains different abbreviations for one base.
and
ModifiedRNAString(str) that behaves like a string except that it iterates
through RNAAlphabetEntry objects.

Thus, you can do:

>>> from Bio.Alphabet.ModifiedRNAAlphabet import modified_rna
>>> from Bio.Seq import Seq
>>> from Bio.RNA.ModifiedRNAString import ModifiedRNAString
>>>
>>> mod_seq = ModifiedRNAString('AA:"A')
>>> seq = Seq(mod_seq, modified_rna)
>>> for char in seq:
>>>     print char
adenosine
adenosine
2-O-methyladenosine
1-methyladenosine
adenosine

(see Unit test for details).

Best Regards,
    Kristian


From biopython at maubp.freeserve.co.uk  Tue Jun 15 14:46:10 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 15 Jun 2010 15:46:10 +0100
Subject: [Biopython-dev] RNA Alphabet: request for comments
In-Reply-To: <485134f2f1ebae4701d6fbcdfcdee3ee-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5UWABeXQ==-webmailer2@server02.webmailer.hosteurope.de>
References: <485134f2f1ebae4701d6fbcdfcdee3ee-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5UWABeXQ==-webmailer2@server02.webmailer.hosteurope.de>
Message-ID: <AANLkTimG8g7OvFhTmyUL-rkQD5QAqp3XAK7QMRLL0Qbb@mail.gmail.com>

On Tue, Jun 15, 2010 at 3:06 PM, Kristian Rother <krother at rubor.de> wrote:
>
> Hi,
>
> I've commited a proof-of-concept implementation how modified RNA bases
> could be made compatible to Biopython Alphabets. Comments are very
> welcome, especially because I had to change two lines in the Seq class to
> make it work.
>
> The code can be viewed on:
> http://github.com/krother/biopython/commit/d9f942936d6165703512099a6a2d84452fea27aa
> (on github: krother/biopython, branch rna_alphabet).
>
> The two main classes are:
> RNAAlphabetEntry(str) that contains different abbreviations for one base.
> and
> ModifiedRNAString(str) that behaves like a string except that it iterates
> through RNAAlphabetEntry objects.
>

Why not create a Seq subclass instead of your class ModifiedRNAString(str)?
This would then implement suitable (reverse) complement etc.

I would also have __iter__ and __getitem__ for a single letter return
an instance
of RNAAlphabetEntry (which would act like a single character string).

Peter


From bugzilla-daemon at portal.open-bio.org  Tue Jun 15 16:23:00 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 15 Jun 2010 12:23:00 -0400
Subject: [Biopython-dev] [Bug 3060] Add ungap method to the SeqRecord?
In-Reply-To: <bug-3060-42@http.bugzilla.open-bio.org/>
Message-ID: <201006151623.o5FGN0K6028619@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3060


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-15 12:22 EST -------
Patch applied to this branch:
http://github.com/peterjc/biopython/tree/seqrecord-rc


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From krother at rubor.de  Wed Jun 16 08:32:29 2010
From: krother at rubor.de (Kristian Rother)
Date: Wed, 16 Jun 2010 10:32:29 +0200
Subject: [Biopython-dev] RNA Alphabet: request for comments
Message-ID: <771c2d4a49dc24d2b4fe1438a13b361d-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SXwlbVg==-webmailer2@server02.webmailer.hosteurope.de>


Hi Peter,

> Why not create a Seq subclass instead of your class ModifiedRNAString(str)?

This turned out to be a lot simpler. Worked right away. New commit at:

http://github.com/krother/biopython/commit/b0a6071f2b08a4f9bfee33a8d675c0e21b60ba70

more comments welcome.

Next steps from my side would be:

1) add all modifications to the Alphabet.
2) add some RNA-specific methods.
3) add more tests.
4) sync with latest master branch.
5) request code merge.

Best regards,
     Kristian


Quoting Peter <biopython at maubp.freeserve.co.uk>:

> On Tue, Jun 15, 2010 at 3:06 PM, Kristian Rother <krother at rubor.de> wrote:
>>
>> Hi,
>>
>> I've commited a proof-of-concept implementation how modified RNA bases
>> could be made compatible to Biopython Alphabets. Comments are very
>> welcome, especially because I had to change two lines in the Seq class to
>> make it work.
>>
>> The code can be viewed on:
>> http://github.com/krother/biopython/commit/d9f942936d6165703512099a6a2d84452fea27aa
>> (on github: krother/biopython, branch rna_alphabet).
>>
>> The two main classes are:
>> RNAAlphabetEntry(str) that contains different abbreviations for one base.
>> and
>> ModifiedRNAString(str) that behaves like a string except that it iterates
>> through RNAAlphabetEntry objects.
>>
>
> Why not create a Seq subclass instead of your class ModifiedRNAString(str)?
> This would then implement suitable (reverse) complement etc.
>
> I would also have __iter__ and __getitem__ for a single letter return
> an instance
> of RNAAlphabetEntry (which would act like a single character string).
>
> Peter
>
>
>
>


From biopython at maubp.freeserve.co.uk  Wed Jun 16 08:51:03 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 16 Jun 2010 09:51:03 +0100
Subject: [Biopython-dev] RNA Alphabet: request for comments
In-Reply-To: <771c2d4a49dc24d2b4fe1438a13b361d-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SXwlbVg==-webmailer2@server02.webmailer.hosteurope.de>
References: <771c2d4a49dc24d2b4fe1438a13b361d-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SXwlbVg==-webmailer2@server02.webmailer.hosteurope.de>
Message-ID: <AANLkTimATWDdddH5wmvD5i2BPRPvaJsb0qmqLVEDzfFe@mail.gmail.com>

On Wed, Jun 16, 2010 at 9:32 AM, Kristian Rother <krother at rubor.de> wrote:
>
> Hi Peter,
>
>> Why not create a Seq subclass instead of your class ModifiedRNAString(str)?
>
> This turned out to be a lot simpler. Worked right away. New commit at:
>
> http://github.com/krother/biopython/commit/b0a6071f2b08a4f9bfee33a8d675c0e21b60ba70
>
> more comments welcome.

Why do you need the  _set_sequence method? Why not just put that
small piece of code inside the __init__ method?

> Next steps from my side would be:
>
> 1) add all modifications to the Alphabet.
> 2) add some RNA-specific methods.
> 3) add more tests.
> 4) sync with latest master branch.
> 5) request code merge.
>
> Best regards,
> ? ? Kristian

If this works out we should look at doing a Protein 3-letter code version
for use with PDB sequences (I'm thinking about the modified amino acids).

Peter


From krother at rubor.de  Wed Jun 16 09:03:37 2010
From: krother at rubor.de (Kristian Rother)
Date: Wed, 16 Jun 2010 11:03:37 +0200
Subject: [Biopython-dev] RNA Alphabet: request for comments
In-Reply-To: <AANLkTimATWDdddH5wmvD5i2BPRPvaJsb0qmqLVEDzfFe@mail.gmail.com>
References: <771c2d4a49dc24d2b4fe1438a13b361d-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SXwlbVg==-webmailer2@server02.webmailer.hosteurope.de>
	<AANLkTimATWDdddH5wmvD5i2BPRPvaJsb0qmqLVEDzfFe@mail.gmail.com>
Message-ID: <ba1c601c7f33e7f6ae3f22729e528388-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SUQheVw==-webmailer2@server02.webmailer.hosteurope.de>


Hi Peter,

> Why do you need the  _set_sequence method? Why not just put that
> small piece of code inside the __init__ method?

In _set_sequence there'll be a small parser taking care of modifications
where the one-letter abbreviations do not suffice. E.g. a sequence could
be

"CCC022UCCC"

(22U is a 5-hydroxyuridine).

--> being parsed into a list of RNAAlphabetEntries
['C','C','C','22U','C','C','C']

So the code will grow a little, but the basic idea stays the same.

If someone wants a one-letter representation, it could be "CCCxCCC", but
this is degenerate because 'x' is used for several modifications.

Best Regards,
   Kristian


>>> Why not create a Seq subclass instead of your class
>>> ModifiedRNAString(str)?
>>
>> This turned out to be a lot simpler. Worked right away. New commit at:
>>
>> http://github.com/krother/biopython/commit/b0a6071f2b08a4f9bfee33a8d675c0e21b60ba70
>>
>> more comments welcome.
>
> Why do you need the  _set_sequence method? Why not just put that
> small piece of code inside the __init__ method?
>
>> Next steps from my side would be:
>>
>> 1) add all modifications to the Alphabet.
>> 2) add some RNA-specific methods.
>> 3) add more tests.
>> 4) sync with latest master branch.
>> 5) request code merge.
>>
>> Best regards,
>> ? ? Kristian
>
> If this works out we should look at doing a Protein 3-letter code version
> for use with PDB sequences (I'm thinking about the modified amino acids).
>
> Peter
>
>


From biopython at maubp.freeserve.co.uk  Wed Jun 16 09:41:35 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 16 Jun 2010 10:41:35 +0100
Subject: [Biopython-dev] RNA Alphabet: request for comments
In-Reply-To: <ba1c601c7f33e7f6ae3f22729e528388-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SUQheVw==-webmailer2@server02.webmailer.hosteurope.de>
References: <771c2d4a49dc24d2b4fe1438a13b361d-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SXwlbVg==-webmailer2@server02.webmailer.hosteurope.de>
	<AANLkTimATWDdddH5wmvD5i2BPRPvaJsb0qmqLVEDzfFe@mail.gmail.com>
	<ba1c601c7f33e7f6ae3f22729e528388-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlheRl5SUQheVw==-webmailer2@server02.webmailer.hosteurope.de>
Message-ID: <AANLkTimp5cvKxczZYPBM3n47CBTtltrNLDOUxgCzYfoq@mail.gmail.com>

On Wed, Jun 16, 2010 at 10:03 AM, Kristian Rother <krother at rubor.de> wrote:
>
> Hi Peter,
>
>> Why do you need the ?_set_sequence method? Why not just put that
>> small piece of code inside the __init__ method?
>
> In _set_sequence there'll be a small parser taking care of modifications
> where the one-letter abbreviations do not suffice. E.g. a sequence could
> be
>
> "CCC022UCCC"
>
> (22U is a 5-hydroxyuridine).
>
> --> being parsed into a list of RNAAlphabetEntries
> ['C','C','C','22U','C','C','C']
>
> So the code will grow a little, but the basic idea stays the same.
>
> If someone wants a one-letter representation, it could be "CCCxCCC", but
> this is degenerate because 'x' is used for several modifications.
>
> Best Regards,
> ? Kristian

Thinking ahead, we are planning to make the Seq objects use string
comparison instead of object identity. When that happens, I would
suggest in your subclass you implement the the equality method so
that if you are comparing against another instance of the modified RNA
Seq compare at the more detailed "22U" level, and if not then for
compatibility compare at the single letter level ("x" even though degenerate).

Peter


From bugzilla-daemon at portal.open-bio.org  Wed Jun 16 12:43:07 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 16 Jun 2010 08:43:07 -0400
Subject: [Biopython-dev] [Bug 3100] New: Bio.PDB.ResidueDepth distance
	calculation error
Message-ID: <bug-3100-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3100

           Summary: Bio.PDB.ResidueDepth distance calculation error
           Product: Biopython
           Version: 1.54b
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: andres.colubri at gmail.com


ResidueDepth.py in Bio.PDB contains an error at line 100:

d2=sum(d*d, 1)

This uses the built-in sum() function, which just sums all the elements of d*d,
starting at 1. But it should use numpy's sum instead:

d2=numpy.sum(d*d, 1)

To check the error, try the following code:

from Bio.PDB import
from Bio.PDB.ResidueDepth import
parser = PDBParser()
str = parser.get_structure('test', '3M38.pdb')
surf = get_surface('3M38.pdb', PDB_TO_XYZR='./pdb_to_xyzr', MSMS='./msms')
print min_dist(surf[10], surf)

3M38.pdb could be replaced by any other pdb file. The result of this
calculation printed to the console should be zero, since we are calculating the
minimum distance to the surface of a point belonging to the surface. But this
gives a value greater than zero.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From lueck at ipk-gatersleben.de  Wed Jun 16 13:18:00 2010
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Wed, 16 Jun 2010 15:18:00 +0200
Subject: [Biopython-dev] EuroSciPy 2010 conference in Paris
In-Reply-To: <AANLkTimJSOqEgUHokfs3P-6MNS5yKxl4_CNB5f8-X0AR@mail.gmail.com>
References: <AANLkTimJSOqEgUHokfs3P-6MNS5yKxl4_CNB5f8-X0AR@mail.gmail.com>
Message-ID: <001a01cb0d56$581dd610$1022a8c0@ipkgatersleben.de>

Hello!

Sorry for the late reply but I just came back from my holidays.
I have been to EuroSciPy 2009 and it's was really great (I also gave a talk
where biopython was several times mentioned ;-). Since it's was problematic
to go last time, I decided to skip it this year (principally I have to come
private). Unfortunately I hear now that the biopython people will be there
and I would be very interested to meet you, since I'm using biopython a lot.


I have to see what I still can do.
Would be great to see us!

Stefanie 

-----Urspr?ngliche Nachricht-----
Von: biopython-dev-bounces at lists.open-bio.org
[mailto:biopython-dev-bounces at lists.open-bio.org] Im Auftrag von Peter
Gesendet: Samstag, 5. Juni 2010 16:50
An: Biopython-Dev Mailing List
Betreff: [Biopython-dev] EuroSciPy 2010 conference in Paris

Hi all,

Are any Biopython folk planning to be at the EuroSciPy
conference in Paris this year (July 2010)? They are still
finalising the Scientific track, but the list of tutorials is
quite interesting already:

http://www.euroscipy.org/conference/euroscipy2010

Peter
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev


From bugzilla-daemon at portal.open-bio.org  Fri Jun 18 13:19:02 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 18 Jun 2010 09:19:02 -0400
Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastaq
In-Reply-To: <bug-3102-42@http.bugzilla.open-bio.org/>
Message-ID: <201006181319.o5IDJ2Oj022977@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3102


cjfields at bioperl.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|bioperl-guts-l at bioperl.org  |biopython-dev at biopython.org


------- Comment #3 from cjfields at bioperl.org  2010-06-18 09:18 EST -------
(In reply to comment #2)
> (In reply to comment #1)
> > I'm making a wild guess that this is Biopython and not BioPerl.  
> 
> Yes, it's Biopython, Can you halp me, please? or can you give me a link where
> to find the answer for my problem? Thank you very much. 

Reassigning to the Biopython devs.  This should go to their list now, hopefully
you'll get a response.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri Jun 18 13:45:37 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 18 Jun 2010 09:45:37 -0400
Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastq
In-Reply-To: <bug-3102-42@http.bugzilla.open-bio.org/>
Message-ID: <201006181345.o5IDjbNB023730@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3102


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Error converting sff into   |Error converting sff into
                   |fastaq                      |fastq


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-18 09:45 EST -------
Thanks Chris.

Giorgio - Could you confirm which version of Biopython are you using?

To me the error message suggests the SFF file is corrupted (damaged). Is it
very large? Could you attach it to this bug (or email it to me personally) to
check?

Have you been able to process the SFF file with any other tools (e.g.
sff_extract which should work on Windows/Linux/Mac, or the Roche tools which
are Linux only)?

If you copied the SFF file over your network, or over the internet from your
sequencing center, perhaps there was an error there. Could you try
re-downloading the SFF file?

Regards,

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri Jun 18 15:03:45 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 18 Jun 2010 11:03:45 -0400
Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastq
In-Reply-To: <bug-3102-42@http.bugzilla.open-bio.org/>
Message-ID: <201006181503.o5IF3j23025689@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3102


------- Comment #5 from gcasaburi at tiscali.it  2010-06-18 11:03 EST -------
(In reply to comment #4)
> Thanks Chris.
> Giorgio - Could you confirm which version of Biopython are you using?
> To me the error message suggests the SFF file is corrupted (damaged). Is it
> very large? Could you attach it to this bug (or email it to me personally) to
> check?
> Have you been able to process the SFF file with any other tools (e.g.
> sff_extract which should work on Windows/Linux/Mac, or the Roche tools which
> are Linux only)?
> If you copied the SFF file over your network, or over the internet from your
> sequencing center, perhaps there was an error there. Could you try
> re-downloading the SFF file?
> Regards,
> Peter
Thank u for the answer. I have the last version of Biopython, The file is 1,12
giga, so i think is difficult to attach the file. The file has been taken
directly from the usb port of the 454 with a pendrive and now is in a normal
PC. With Biopthon i'v been able to read and open this sff file, but at the end
of the reading appers the message (Value error:...). So when i try to convert
the file in fasta the same message apper to be, bloking any work. So why the
file is open reading, with all information (flow, lewnght) but impossible to
edit, convert??? Thank u hope u can help us.
Grater from ITALY


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri Jun 18 15:28:01 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 18 Jun 2010 11:28:01 -0400
Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastq
In-Reply-To: <bug-3102-42@http.bugzilla.open-bio.org/>
Message-ID: <201006181528.o5IFS1iY026418@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3102


------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-18 11:28 EST -------
(In reply to comment #5)
> Thank u for the answer. I have the last version of Biopython,

Good.

> The file is 1,12 giga, so i think is difficult to attach the file.

Yes, too big to attach or email :(

> The file has been taken directly from the usb port of the 454 with a
> pendrive and now is in a normal PC.

I would try copying it again using a different USB memory stick / pen drive.

> With Biopthon i'v been able to read and open this sff file, but at the end
> of the reading appers the message (Value error:...). So when i try to convert
> the file in fasta the same message apper to be, bloking any work. So why the
> file is open reading, with all information (flow, lewnght) but impossible to
> edit, convert??? Thank u hope u can help us.
> Grater from ITALY

It sounds like there is an error is near the end of the file. You can open the
file and read lots of reads up until the error. If you use Bio.SeqIO.parse()
or Bio.SeqIO.convert() these will fail once you get to the bad read. Perhaps
the file is truncated (only partly copied)?

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri Jun 18 17:35:00 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 18 Jun 2010 13:35:00 -0400
Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastq
In-Reply-To: <bug-3102-42@http.bugzilla.open-bio.org/>
Message-ID: <201006181735.o5IHZ0SW030183@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3102


------- Comment #7 from gcasaburi at tiscali.it  2010-06-18 13:35 EST -------
(In reply to comment #6)
> (In reply to comment #5)
> > Thank u for the answer. I have the last version of Biopython,
> 
> Good.
> 
> > The file is 1,12 giga, so i think is difficult to attach the file.
> 
> Yes, too big to attach or email :(
> 
> > The file has been taken directly from the usb port of the 454 with a
> > pendrive and now is in a normal PC.
> 
> I would try copying it again using a different USB memory stick / pen drive.
> 
> > With Biopthon i'v been able to read and open this sff file, but at the end
> > of the reading appers the message (Value error:...). So when i try to convert
> > the file in fasta the same message apper to be, bloking any work. So why the
> > file is open reading, with all information (flow, lewnght) but impossible to
> > edit, convert??? Thank u hope u can help us.
> > Grater from ITALY
> 
> It sounds like there is an error is near the end of the file. You can open the
> file and read lots of reads up until the error. If you use Bio.SeqIO.parse()
> or Bio.SeqIO.convert() these will fail once you get to the bad read. Perhaps
> the file is truncated (only partly copied)?
> 
> Peter
> 
I will try to recopy the file on another pendrive.  I thought like you, may be
the file has a corruption at the end. I don't think is  truncated, in fact is a
.sff that represents one region of the "ptp", but the same error appers with
another file .sff2  that represents the second region of the "ptp" (diveded in
two regions for the same "run", totally 2 regions, each for one sample,  two
samples in total).  So   i don't know if there is a syntax command to modify
the error value. 
Thank you
Giorgio


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Tue Jun 22 13:11:15 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 22 Jun 2010 09:11:15 -0400
Subject: [Biopython-dev] [Bug 3060] Add ungap method to the SeqRecord?
In-Reply-To: <bug-3060-42@http.bugzilla.open-bio.org/>
Message-ID: <201006221311.o5MDBF8o003119@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3060


------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-22 09:11 EST -------
(In reply to comment #0)
> My motivating example is to take an ACE file loaded with SeqIO, remove the
> gaps, and output the contigs as FASTQ or QUAL files. This requires the
> per-letter-annotation to be sliced to match the ungapped sequence.
> 
> Likewise any features fully contained within ungapped regions should be
> retained and their co-ordinates shifted. I'm not sure if we should do anything
> about features spanning a gap - the simple option which I have implemented is
> they are lost. This is done via the existing SeqRecord slicing and addition
> code.

I've been trying building SeqFeature objects for the reads in an ACE file,
http://github.com/peterjc/biopython/tree/ace-reads

In this case when I call the SeqRecord ungap method, many of my read features
are lost with the current implementation (because they included gaps). This
also showed the ungap code to be quite slow for features. I'm going to have
another look at this.

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Tue Jun 22 14:58:39 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 22 Jun 2010 10:58:39 -0400
Subject: [Biopython-dev] [Bug 3060] Add ungap method to the SeqRecord?
In-Reply-To: <bug-3060-42@http.bugzilla.open-bio.org/>
Message-ID: <201006221458.o5MEwd0I005797@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3060


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1482 is|0                           |1
           obsolete|                            |


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-22 10:58 EST -------
(From update of attachment 1482)
(In reply to comment #3)
> 
> I've been trying building SeqFeature objects for the reads in an ACE file,
> http://github.com/peterjc/biopython/tree/ace-reads
> 
> In this case when I call the SeqRecord ungap method, many of my read features
> are lost with the current implementation (because they included gaps). This
> also showed the ungap code to be quite slow for features. I'm going to have
> another look at this.

My new code handles SeqFeature ungapping so as to preserve all the features by
adjusting their end points. This is also much faster:

http://github.com/peterjc/biopython/tree/ungap2


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From anaryin at gmail.com  Tue Jun 22 19:25:17 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 22 Jun 2010 14:25:17 -0500
Subject: [Biopython-dev] Parsing "element" out of PDB file
Message-ID: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com>

Hello all,

I've been using some non-standard pdb files outputted by some programs and
they miss the chemical element column in each ATOM line. I was looking at
the PDBParser code and element is dealt with like this:

        if element is None:
            import warnings
            from PDBExceptions import PDBConstructionWarning
            warnings.warn("Atom object (name=%s) without element" % name,
                          PDBConstructionWarning)
            element = "?"
            print name, "--> ?"
        elif len(element)>2 or element != element.upper() or element !=
element.strip():
            raise ValueError(element)
        self.element=element


In my case, the element line is not "None" but just an empty string - ' ' -
which fails these tests and is then passed on. This would be no problem at
all, but I've added a "mass" attribute to the Atom object defined like this:

        self.mass = IUPACData.atom_weigths[element]

I've added the ? to the atom_weights list as I thought it would deal with
the empty element cases.

I'd suggest adding to the first if statement a test to check if the element
string is empty and if so, treat it as None.

        if element is None or element is '':
            import warnings
            from PDBExceptions import PDBConstructionWarning
            warnings.warn("Atom object (name=%s) without element" % name,
                          PDBConstructionWarning)
            element = "?"
            print name, "--> ?"
        elif len(element)>2 or element != element.upper() or element !=
element.strip():
            raise ValueError(element)
        self.element=element


What do you think?

Best!

Jo?o [...] Rodrigues
@ http://doeidoei.wordpress.org


From biopython at maubp.freeserve.co.uk  Wed Jun 23 09:11:06 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 23 Jun 2010 10:11:06 +0100
Subject: [Biopython-dev] Parsing "element" out of PDB file
In-Reply-To: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com>
References: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com>
Message-ID: <AANLkTinz_AEn08DU-61V1i5xGA6N5sxYDf-JUCJXmrNH@mail.gmail.com>

On Tue, Jun 22, 2010 at 8:25 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> Hello all,
>
> I've been using some non-standard pdb files outputted by some programs and
> they miss the chemical element column in each ATOM line. I was looking at
> the PDBParser code and element is dealt with like this:
>
> ? ? ? ?if element is None:
> ? ? ? ? ? ?import warnings
> ? ? ? ? ? ?from PDBExceptions import PDBConstructionWarning
> ? ? ? ? ? ?warnings.warn("Atom object (name=%s) without element" % name,
> ? ? ? ? ? ? ? ? ? ? ? ? ?PDBConstructionWarning)
> ? ? ? ? ? ?element = "?"
> ? ? ? ? ? ?print name, "--> ?"
> ? ? ? ?elif len(element)>2 or element != element.upper() or element !=
> element.strip():
> ? ? ? ? ? ?raise ValueError(element)
> ? ? ? ?self.element=element
>
>
> In my case, the element line is not "None" but just an empty string - ' ' -
> which fails these tests and is then passed on.

That makes sense, since element=line[76:78].strip() will give an empty
string. A change as you suggest makes sense, but I think just using
"if element:" would be nicer.

Peter


From biopython at maubp.freeserve.co.uk  Wed Jun 23 10:28:22 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 23 Jun 2010 11:28:22 +0100
Subject: [Biopython-dev] Running unit tests on non-English OS
Message-ID: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>

Hi all,

>From some unit test output posted by Manabu Ishii via Twitter I
think the test suite is having problems checking for external tools
on non-English operating systems (e.g. Debian in Japanese):
http://d.hatena.ne.jp/manabou/20100619
http://twitter.com/manabou

I've tried to update a few to do a better job (test_Muscle_tool.py,
test_Clustalw_tool.py and test_Emboss.py), but what I really need
is someone to run the test suite on a non English system - ideally
without all these command line tools installed. The tests should
notice when the tool is missing, and be skipped without errors.

Could anyone with a non-English OS try running the latest code
from git (or even the latest release) to see if you get similar
problems?

Thanks,

Peter


From bugzilla-daemon at portal.open-bio.org  Wed Jun 23 13:21:25 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 23 Jun 2010 09:21:25 -0400
Subject: [Biopython-dev] [Bug 3102] Error converting sff into fastq
In-Reply-To: <bug-3102-42@http.bugzilla.open-bio.org/>
Message-ID: <201006231321.o5NDLPm0017094@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3102


------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-23 09:21 EST -------
Hi Giorgio,

Did coping the file again help?

In addition to trying to read the SFF files with other tools (like sff_extract
or the Roche ssfinfo) as suggested, I have some additional things you could
try.

Firstly try this private function to see how many reads there should be:

filename = r"C:\Users\Giorgio Casaburi\Desktop\sff\GIK1EHM01.sff"
from Bio import SeqIO
print SeqIO.SffIO._sff_file_header(open(filename, "rb"))[3]

Then compare this to the number of reads you could extract up until the error.

Secondly, see if the index can be loaded or not:

filename = r"C:\Users\Giorgio Casaburi\Desktop\sff\GIK1EHM01.sff"
from Bio import SeqIO
d = SeqIO.index(filename, "sff")
print len(d)

If it is just one or two bad reads, this may allow you to jump to specific
records (and so avoid getting stuck on the bad ones).

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From anaryin at gmail.com  Wed Jun 23 16:52:47 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Wed, 23 Jun 2010 11:52:47 -0500
Subject: [Biopython-dev] Parsing "element" out of PDB file
In-Reply-To: <AANLkTinz_AEn08DU-61V1i5xGA6N5sxYDf-JUCJXmrNH@mail.gmail.com>
References: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com> 
	<AANLkTinz_AEn08DU-61V1i5xGA6N5sxYDf-JUCJXmrNH@mail.gmail.com>
Message-ID: <AANLkTin8KM8BYJc9sr-KX1o1l7zJ5A901Tv0QvEvd0nt@mail.gmail.com>

Ok, I've changed it in my local branch to if not element since that covers
both None and empty strings.

Best,

Jo?o [...] Rodrigues
@ http://doeidoei.wordpress.org


On Wed, Jun 23, 2010 at 4:11 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Tue, Jun 22, 2010 at 8:25 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> > Hello all,
> >
> > I've been using some non-standard pdb files outputted by some programs
> and
> > they miss the chemical element column in each ATOM line. I was looking at
> > the PDBParser code and element is dealt with like this:
> >
> >        if element is None:
> >            import warnings
> >            from PDBExceptions import PDBConstructionWarning
> >            warnings.warn("Atom object (name=%s) without element" % name,
> >                          PDBConstructionWarning)
> >            element = "?"
> >            print name, "--> ?"
> >        elif len(element)>2 or element != element.upper() or element !=
> > element.strip():
> >            raise ValueError(element)
> >        self.element=element
> >
> >
> > In my case, the element line is not "None" but just an empty string - ' '
> -
> > which fails these tests and is then passed on.
>
> That makes sense, since element=line[76:78].strip() will give an empty
> string. A change as you suggest makes sense, but I think just using
> "if element:" would be nicer.
>
> Peter
>


From biopython at maubp.freeserve.co.uk  Thu Jun 24 08:26:50 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 24 Jun 2010 09:26:50 +0100
Subject: [Biopython-dev] Parsing "element" out of PDB file
In-Reply-To: <AANLkTin8KM8BYJc9sr-KX1o1l7zJ5A901Tv0QvEvd0nt@mail.gmail.com>
References: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com>
	<AANLkTinz_AEn08DU-61V1i5xGA6N5sxYDf-JUCJXmrNH@mail.gmail.com>
	<AANLkTin8KM8BYJc9sr-KX1o1l7zJ5A901Tv0QvEvd0nt@mail.gmail.com>
Message-ID: <AANLkTilOh1wWfJZexI47ohbrnharVbXFvMIUyB4L9YAW@mail.gmail.com>

On Wed, Jun 23, 2010 at 5:52 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> Ok, I've changed it in my local branch to if not element since that covers
> both None and empty strings.
>
> Best,
>
> Jo?o [...] Rodrigues
> @ http://doeidoei.wordpress.org

I've you've done that little change as a single commit, then I can use
git cherry-pick to apply it to the master branch. But first you need to
push this work to github.com

Peter


From biopython at maubp.freeserve.co.uk  Thu Jun 24 08:32:46 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 24 Jun 2010 09:32:46 +0100
Subject: [Biopython-dev] Parsing "element" out of PDB file
In-Reply-To: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com>
References: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com>
Message-ID: <AANLkTin5kMjfMe-hpviCH4TIo4Cz1vfnwPihMRUGqLPa@mail.gmail.com>

On Tue, Jun 22, 2010 at 8:25 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> Hello all,
>
> I've been using some non-standard pdb files outputted by some programs and
> they miss the chemical element column in each ATOM line. ... This would be no
> problem at all, but I've added a "mass" attribute to the Atom object defined like this:
>
> ? ? ? ?self.mass = IUPACData.atom_weigths[element]
>
> I've added the ? to the atom_weights list as I thought it would deal with
> the empty element cases.

I wonder if using None or NAN would be better than zero here? Or just an
exception. This is difficult for me to say without a better idea of what you
will be using the atomic weights for.

On a separate point, if you have an old fashioned PDB file without the element
column, you can probably work out the element anyway. For example CA in
a normal amino acids residue means the alpha carbon, so the element is
carbon (although in a HETATM there is a possibility it is Calcium I think).
So I think it would be possible to infer the element in many cases (but not
all). However, this is going to be a reasonable amount of work to write and
test. How common are this kind of PDB file for the work you are doing - do
many modelling packages omit the element?

Have you contacted the program authors to request they include the
element column in future?

Peter


From anaryin at gmail.com  Thu Jun 24 16:36:36 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 24 Jun 2010 11:36:36 -0500
Subject: [Biopython-dev] Parsing "element" out of PDB file
In-Reply-To: <AANLkTin5kMjfMe-hpviCH4TIo4Cz1vfnwPihMRUGqLPa@mail.gmail.com>
References: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com> 
	<AANLkTin5kMjfMe-hpviCH4TIo4Cz1vfnwPihMRUGqLPa@mail.gmail.com>
Message-ID: <AANLkTim22bd6-wSUY_3bp9FQOG3k5uVCXBt-6N-lU85G@mail.gmail.com>

>
> I wonder if using None or NAN would be better than zero here? Or just an
> exception. This is difficult for me to say without a better idea of what
> you
> will be using the atomic weights for.
>

Right now I'm just using them for the center of mass calculation.


>
> On a separate point, if you have an old fashioned PDB file without the
> element
> column, you can probably work out the element anyway. For example CA in
> a normal amino acids residue means the alpha carbon, so the element is
> carbon (although in a HETATM there is a possibility it is Calcium I think).
> So I think it would be possible to infer the element in many cases (but not
> all). However, this is going to be a reasonable amount of work to write and
> test.


>From non HETATMs its possible from the first letter of the atom name (or it
is H if the first letter is a digit). For HETATMs, names match elements
IIRC.

Do you think it's worth the try? It shouldn't be hard to write and the cases
where it would fail would be sporadic.


> How common are this kind of PDB file for the work you are doing - do
> many modelling packages omit the element?


> Have you contacted the program authors to request they include the
> element column in future?
>

Well... several packages make this, specially webservers.. Contacting them
authors wouldn't bring those many favourable answers IMO.


I've commited it here:
http://github.com/JoaoRodrigues/biopython/commit/29f48e8f97870530520884fa6b8c9b70d87ba8bc

I commented out the self.mass part since we're still working on it.

Best,

J


From biopython at maubp.freeserve.co.uk  Thu Jun 24 16:54:41 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 24 Jun 2010 17:54:41 +0100
Subject: [Biopython-dev] Parsing "element" out of PDB file
In-Reply-To: <AANLkTim22bd6-wSUY_3bp9FQOG3k5uVCXBt-6N-lU85G@mail.gmail.com>
References: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com>
	<AANLkTin5kMjfMe-hpviCH4TIo4Cz1vfnwPihMRUGqLPa@mail.gmail.com>
	<AANLkTim22bd6-wSUY_3bp9FQOG3k5uVCXBt-6N-lU85G@mail.gmail.com>
Message-ID: <AANLkTikhslYyt6mn8Gtd_WrWmNsEpYkuCSD8cuEtXXv9@mail.gmail.com>

On Thu, Jun 24, 2010 at 5:36 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
>>
>> I wonder if using None or NAN would be better than zero here? Or just an
>> exception. This is difficult for me to say without a better idea of what
>> you will be using the atomic weights for.
>>
>
> Right now I'm just using them for the center of mass calculation.
>

Well if you don't know an atom's mass, you can't calculate the real
center of mass. Maybe this should throw an exception?

>> On a separate point, if you have an old fashioned PDB file without the
>> element column, you can probably work out the element anyway. ...
>
> From non HETATMs its possible from the first letter of the atom name (or it
> is H if the first letter is a digit). For HETATMs, names match elements
> IIRC.
>
> Do you think it's worth the try? It shouldn't be hard to write and the cases
> where it would fail would be sporadic.

Eric - what do you think?

>> How common are this kind of PDB file for the work you are doing - do
>> many modelling packages omit the element?
>
>
>> Have you contacted the program authors to request they include the
>> element column in future?
>>
>
> Well... several packages make this, specially webservers.. Contacting them
> authors wouldn't bring those many favourable answers IMO.

I'd ask politely anyway ;)

> I've commited it here:
> http://github.com/JoaoRodrigues/biopython/commit/29f48e8f97870530520884fa6b8c9b70d87ba8bc
>
> I commented out the self.mass part since we're still working on it.

I've cherry-picked that for the trunk - could you test the master branch
please (just to make sure this worked as you expected)?

Thanks,

Peter


From eric.talevich at gmail.com  Thu Jun 24 18:05:11 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 24 Jun 2010 14:05:11 -0400
Subject: [Biopython-dev] Parsing "element" out of PDB file
In-Reply-To: <AANLkTikhslYyt6mn8Gtd_WrWmNsEpYkuCSD8cuEtXXv9@mail.gmail.com>
References: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com> 
	<AANLkTin5kMjfMe-hpviCH4TIo4Cz1vfnwPihMRUGqLPa@mail.gmail.com> 
	<AANLkTim22bd6-wSUY_3bp9FQOG3k5uVCXBt-6N-lU85G@mail.gmail.com> 
	<AANLkTikhslYyt6mn8Gtd_WrWmNsEpYkuCSD8cuEtXXv9@mail.gmail.com>
Message-ID: <AANLkTimvyJSycfL2707i-DyizcE5xcJ950duGV8MqiSt@mail.gmail.com>

On Thu, Jun 24, 2010 at 12:54 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Thu, Jun 24, 2010 at 5:36 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> >>
> >> I wonder if using None or NAN would be better than zero here? Or just an
> >> exception. This is difficult for me to say without a better idea of what
> >> you will be using the atomic weights for.
> >>
> >
> > Right now I'm just using them for the center of mass calculation.
> >
>
> Well if you don't know an atom's mass, you can't calculate the real
> center of mass. Maybe this should throw an exception?
>

And the center of mass calculation was for coarse-graining structures,
right? What would be most useful there?

(a) Give unknown atoms a weight of 0.0, so CoM essentially disregards them
(b) Give unknown atoms a weight of None, and have CoM check for this and
disregard those atoms (similar effect) -- preferably issuing a warning
(c) Like (b), but CoM raises an exception
(d) Give CoM a keyword argument for how to treat this (e.g.
strict=True/False), so course-graining can be permissive but direct use of
CoM can raise an exception if desired. (However, if warnings are used then
the warnings module already lets you convert specific warnings into
exceptions.)


 >> On a separate point, if you have an old fashioned PDB file without the
> >> element column, you can probably work out the element anyway. ...
> >
> > From non HETATMs its possible from the first letter of the atom name (or
> it
> > is H if the first letter is a digit). For HETATMs, names match elements
> > IIRC.
> >
> > Do you think it's worth the try? It shouldn't be hard to write and the
> cases
> > where it would fail would be sporadic.
>
> Eric - what do you think?
>

Sounds useful to me. Where would it fail, and how should failures be
treated? Unrecognized atom names, and then issue a warning and leave the
element attribute blank? (See options above...)

Cheers,
Eric


From anaryin at gmail.com  Thu Jun 24 18:25:45 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 24 Jun 2010 13:25:45 -0500
Subject: [Biopython-dev] Parsing "element" out of PDB file
In-Reply-To: <AANLkTimvyJSycfL2707i-DyizcE5xcJ950duGV8MqiSt@mail.gmail.com>
References: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com> 
	<AANLkTin5kMjfMe-hpviCH4TIo4Cz1vfnwPihMRUGqLPa@mail.gmail.com> 
	<AANLkTim22bd6-wSUY_3bp9FQOG3k5uVCXBt-6N-lU85G@mail.gmail.com> 
	<AANLkTikhslYyt6mn8Gtd_WrWmNsEpYkuCSD8cuEtXXv9@mail.gmail.com> 
	<AANLkTimvyJSycfL2707i-DyizcE5xcJ950duGV8MqiSt@mail.gmail.com>
Message-ID: <AANLkTin26e0xe-fpkJ-uBUVRT3zuyh_vDYoXquK2VOL6@mail.gmail.com>

>
> And the center of mass calculation was for coarse-graining structures,
> right? What would be most useful there?
>

> (a) Give unknown atoms a weight of 0.0, so CoM essentially disregards them
>

CoM counts with the number of atoms so 0.0 will not work anyways actually.


>  (b) Give unknown atoms a weight of None, and have CoM check for this and
> disregard those atoms (similar effect) -- preferably issuing a warning
>

I'd prefer this. Exclude atoms from the calculation. But then this might
have an impact in the location of the mass..


> (c) Like (b), but CoM raises an exception
> (d) Give CoM a keyword argument for how to treat this (e.g.
> strict=True/False), so course-graining can be permissive but direct use of
> CoM can raise an exception if desired. (However, if warnings are used then
> the warnings module already lets you convert specific warnings into
> exceptions.)
>

My suggestion. CoM can be either geometrical or gravitical. The first
assumes equal mass for everyone, the second does not. If there's a mass that
doesn't exist, the CoM would default to geometrical and issue a warning.
Having a flag in CoM can also be valuable but I guess this would be
redundant with the warning/exception (permissive/strict) in the Atom class.


>
>
>  >> On a separate point, if you have an old fashioned PDB file without the
>> >> element column, you can probably work out the element anyway. ...
>> >
>> > From non HETATMs its possible from the first letter of the atom name (or
>> it
>> > is H if the first letter is a digit). For HETATMs, names match elements
>> > IIRC.
>> >
>> > Do you think it's worth the try? It shouldn't be hard to write and the
>> cases
>> > where it would fail would be sporadic.
>>
>> Eric - what do you think?
>>
>
> Sounds useful to me. Where would it fail, and how should failures be
> treated? Unrecognized atom names, and then issue a warning and leave the
> element attribute blank? (See options above...)
>

I'd implement it in the Atom class. Instead of having this check (lines
75-76):

        elif len(element)>2 or element != element.upper() or element !=
element.strip():
            raise ValueError(element)

there would be a check against IUPACData.atom_weight.keys(). If the element
is not found, then it would try to check the atom name and issue a warning.
If this fails, exception thrown.

Sounds good?

Best!

J


From anaryin at gmail.com  Thu Jun 24 20:25:23 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 24 Jun 2010 15:25:23 -0500
Subject: [Biopython-dev] Parsing "element" out of PDB file
In-Reply-To: <AANLkTin26e0xe-fpkJ-uBUVRT3zuyh_vDYoXquK2VOL6@mail.gmail.com>
References: <AANLkTinM9pwht5_zta170QDipBen-XeAHoXBEvKBC_7e@mail.gmail.com> 
	<AANLkTin5kMjfMe-hpviCH4TIo4Cz1vfnwPihMRUGqLPa@mail.gmail.com> 
	<AANLkTim22bd6-wSUY_3bp9FQOG3k5uVCXBt-6N-lU85G@mail.gmail.com> 
	<AANLkTikhslYyt6mn8Gtd_WrWmNsEpYkuCSD8cuEtXXv9@mail.gmail.com> 
	<AANLkTimvyJSycfL2707i-DyizcE5xcJ950duGV8MqiSt@mail.gmail.com> 
	<AANLkTin26e0xe-fpkJ-uBUVRT3zuyh_vDYoXquK2VOL6@mail.gmail.com>
Message-ID: <AANLkTinQ9E9wekA7ZttIGavZDDIjACJhJ18QaqB2Ra83@mail.gmail.com>

Ok, I was looking at the element attribution and there's a slight problem. I
thought I could easily fetch if the atom is from an ATOM or HETATM, but
since the "parenting" of the Atom is only done *after* the Atom is created,
there is no way (as is) of knowing where it comes from. Therefore, I thought
of the following work around. *hetero_flag* is already defined when the Atom
is created. It could be passed to the Atom as another of its arguments.

It would then be a conditional like this inside the Atom class:

if not element or element not in IUPACData:

  if hetatm:
    if atom.name in IUPACData:
      element = atom.name
    else:
      element = ?
  else: # Not HETATM
    t_element = atom.name[0] if not atom.name[0].isdigit() else atom.name[1]
    if t_element in IUPACData:
       element = t_element
    else:
       element = ?

else: # Has element and it is in IUPACData
   element = element

The advantage is that either if you don't give an element or if it fails the
IUPACData check, it will try to recover it from the atom name.

It also makes it possible to thrown an exception when the element is not
found. Or a warning since for now, only the CoM function uses it and it has
a failsafe against it (defaults to geometrical).

Opinions?

Jo?o


From bugzilla-daemon at portal.open-bio.org  Fri Jun 25 11:49:35 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 25 Jun 2010 07:49:35 -0400
Subject: [Biopython-dev] [Bug 2738] Speed up GenBank parsing,
	in particular location parsing
In-Reply-To: <bug-2738-42@http.bugzilla.open-bio.org/>
Message-ID: <201006251149.o5PBnZpA007121@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2738


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1327 is|0                           |1
           obsolete|                            |


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri Jun 25 11:51:16 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 25 Jun 2010 07:51:16 -0400
Subject: [Biopython-dev] [Bug 2738] Speed up GenBank parsing,
	in particular location parsing
In-Reply-To: <bug-2738-42@http.bugzilla.open-bio.org/>
Message-ID: <201006251151.o5PBpGE9007286@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2738


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1329 is|0                           |1
           obsolete|                            |


------- Comment #14 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-25 07:51 EST -------
(From update of attachment 1329)
I've got a branch using regular expressions which seems to cover all the
location strings I've found in testing. It is at least twice the speed of the
old parser.

http://github.com/peterjc/biopython/tree/location-parsing2


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Fri Jun 25 15:21:46 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 25 Jun 2010 16:21:46 +0100
Subject: [Biopython-dev] Re-written GenBank/EMBL feature location parsing
Message-ID: <AANLkTikgejWghQbe4LJnx82u7sCEi2A911O3BIg6JijW@mail.gmail.com>

Hi all,

I've been working on and off recently on rewriting the location
parsing for GenBank/EMBL features:
http://bugzilla.open-bio.org/show_bug.cgi?id=2738

I have a branch ready for public testing,
http://github.com/peterjc/biopython/commits/location-parsing2

The old code is still there (and indeed right now gets used as a fall
back with a warning if an unrecognised location is seen). I'd like to
label it (plus Bio.Parsers and Bio.Parsers.spark) as obsolete for the
next release, and then deprecate them the subsequence release.

The old code takes each location string, parses it with SPARK and
generates a set of token objects for each element (see the code in
Bio.GenBank.LocationParser) and then turns that into SeqFeature
location and position objects. All this object creation is probably a
major reason why the old code is slow.

The new code takes each location string, and parses it with a mix
of regular expressions and simple Python code, and then builds
the SeqFeature location and position objects. On my tests this is
at least twice as fast, typically between three and four times faster.

The intention is this parser change will result in no functional
changes at all.

As part of this work I have been extending the feature unit tests,
and have also run some more extensive additional tests locally
(GenBank files for plants, viruses, environmental samples etc).
I'm reasonably sure this covers all the location variants... but
with GenBank and EMBL files you can never be sure ;)

Would anyone like to volunteer to test the new branch before
I merge it to the trunk? I'm also interested in comments on the
code itself. Note I have tried to avoid any refactoring until the
old code is actually deprecated.

Thanks,

Peter


From bugzilla-daemon at portal.open-bio.org  Fri Jun 25 17:46:14 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 25 Jun 2010 13:46:14 -0400
Subject: [Biopython-dev] [Bug 3103] New: Possibly corrupt -
	ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML
Message-ID: <bug-3103-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3103

           Summary: Possibly corrupt - ncbi_taxonomy_mollusca.xml.zip in
                    Tests/PhyloXML
           Product: Biopython
           Version: 1.54
          Platform: Other
        OS/Version: Linux
            Status: NEW
          Severity: minor
          Priority: P5
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: vimalkumarvelayudhan at gmail.com


I created an RPM recently for Biopython version 1.54 and got this error from
rpmlint

python-biopython.i586:???W:???unable-to-read-zip???/usr/share/python-biopython/Tests/PhyloXML/ncbi_taxonomy_mollusca.xml.zip:???Bad???magic???number???for???central???directory

This appears for both the .tar.gz and the .zip version. I could do a manual
unzip of the file though.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Sun Jun 27 15:31:11 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 27 Jun 2010 11:31:11 -0400
Subject: [Biopython-dev] [Bug 3103] Possibly corrupt -
	ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML
In-Reply-To: <bug-3103-42@http.bugzilla.open-bio.org/>
Message-ID: <201006271531.o5RFVBTP001043@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3103


------- Comment #1 from eric.talevich at gmail.com  2010-06-27 11:31 EST -------
Interesting. Where did you get this release of Biopython 1.54? From PyPI, or
GitHub?

I downloaded this file from phyloxml.org originally, and haven't changed it.
This file is used in the unit tests, and Python's zipfile library doesn't seem
to have any trouble opening it. The 'file' command on Ubuntu 10.04 identifies
it as:
"Zip archive data, at least v2.0 to extract"

It's actually not a very important part of the unit tests anyway, so if it's
causing you trouble, I could give you a patch to remove this file from the unit
tests.

(If you're taking patches, there's a bug in Bio.Phylo's Nexus parsing that I'd
like to include a fix for, too. It's fixed in Biopython's trunk already, but
slipped past our release process for v.1.54.)


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Sun Jun 27 16:45:28 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 27 Jun 2010 12:45:28 -0400
Subject: [Biopython-dev] [Bug 3103] Possibly corrupt -
	ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML
In-Reply-To: <bug-3103-42@http.bugzilla.open-bio.org/>
Message-ID: <201006271645.o5RGjSBd019564@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3103


------- Comment #2 from vimalkumarvelayudhan at gmail.com  2010-06-27 12:45 EST -------
The archives were downloaded from 
http://biopython.org/DIST/biopython-1.54.tar.gz
http://biopython.org/DIST/biopython-1.54.zip

I could remove the zip file during the build process and can also patch the
Phylo.Nexus for the next release if you could forward it to me.


(In reply to comment #1)
> Interesting. Where did you get this release of Biopython 1.54? From PyPI, or
> GitHub?
> 
> I downloaded this file from phyloxml.org originally, and haven't changed it.
> This file is used in the unit tests, and Python's zipfile library doesn't seem
> to have any trouble opening it. The 'file' command on Ubuntu 10.04 identifies
> it as:
> "Zip archive data, at least v2.0 to extract"
> 
> It's actually not a very important part of the unit tests anyway, so if it's
> causing you trouble, I could give you a patch to remove this file from the unit
> tests.
> 
> (If you're taking patches, there's a bug in Bio.Phylo's Nexus parsing that I'd
> like to include a fix for, too. It's fixed in Biopython's trunk already, but
> slipped past our release process for v.1.54.)
> 


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Sun Jun 27 22:21:43 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 27 Jun 2010 23:21:43 +0100
Subject: [Biopython-dev] Running unit tests on non-English OS
In-Reply-To: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
References: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
Message-ID: <AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>

On Wed, Jun 23, 2010 at 11:28 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> From some unit test output posted by Manabu Ishii via Twitter I
> think the test suite is having problems checking for external tools
> on non-English operating systems (e.g. Debian in Japanese):
> http://d.hatena.ne.jp/manabou/20100619
> http://twitter.com/manabou
>
> I've tried to update a few to do a better job (test_Muscle_tool.py,
> test_Clustalw_tool.py and test_Emboss.py), but what I really need
> is someone to run the test suite on a non English system - ideally
> without all these command line tools installed. The tests should
> notice when the tool is missing, and be skipped without errors.
>
> Could anyone with a non-English OS try running the latest code
> from git (or even the latest release) to see if you get similar
> problems?

I've also included an idea from Manabu Ishii to set environment
variable LANG=C to get the default of USA English. This should
work on Linux etc, and is probably harmless on Windows.

Again, testing would be most welcome (any non-English OS),

Thanks

Peter


From bugzilla-daemon at portal.open-bio.org  Mon Jun 28 12:23:25 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 28 Jun 2010 08:23:25 -0400
Subject: [Biopython-dev] [Bug 3103] Possibly corrupt -
	ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML
In-Reply-To: <bug-3103-42@http.bugzilla.open-bio.org/>
Message-ID: <201006281223.o5SCNPog015539@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3103


------- Comment #3 from eric.talevich at gmail.com  2010-06-28 08:23 EST -------
Created an attachment (id=1517)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1517&action=view)
Patch to remove ncbi_xml_mollusca.xml.zip from the Phylo unit test

This patch should fix the problem reported in Bug 3103. Created with git
format-patch.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Mon Jun 28 12:25:20 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 28 Jun 2010 08:25:20 -0400
Subject: [Biopython-dev] [Bug 3103] Possibly corrupt -
	ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML
In-Reply-To: <bug-3103-42@http.bugzilla.open-bio.org/>
Message-ID: <201006281225.o5SCPKo9015639@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3103


------- Comment #4 from eric.talevich at gmail.com  2010-06-28 08:25 EST -------
Created an attachment (id=1518)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1518&action=view)
Patch to fix a bug in NexusIO

This patch fixes another bug in NexusIO, parsing the support values on
branches.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From k.okonechnikov at gmail.com  Mon Jun 28 17:55:30 2010
From: k.okonechnikov at gmail.com (Konstantin Okonechnikov)
Date: Tue, 29 Jun 2010 00:55:30 +0700
Subject: [Biopython-dev] Running unit tests on non-English OS
In-Reply-To: <AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>
References: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
	<AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>
Message-ID: <AANLkTinK3XuRt01-GfjNw45zNqAjXzVhFHT2Pyj-HVmH@mail.gmail.com>

Peter,
I have built and run the latest code from git on Russian Ubuntu 10.4.
Entrez tests have failed. Muscle, clustal and emboss tests have been skipped
successfully.
The tests have been executed from build.py script and I am not sure how to
generate test report. Redirecting the script output to file didn't help.


On Mon, Jun 28, 2010 at 5:21 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Wed, Jun 23, 2010 at 11:28 AM, Peter <biopython at maubp.freeserve.co.uk>
> wrote:
> > Hi all,
> >
> > From some unit test output posted by Manabu Ishii via Twitter I
> > think the test suite is having problems checking for external tools
> > on non-English operating systems (e.g. Debian in Japanese):
> > http://d.hatena.ne.jp/manabou/20100619
> > http://twitter.com/manabou
> >
> > I've tried to update a few to do a better job (test_Muscle_tool.py,
> > test_Clustalw_tool.py and test_Emboss.py), but what I really need
> > is someone to run the test suite on a non English system - ideally
> > without all these command line tools installed. The tests should
> > notice when the tool is missing, and be skipped without errors.
> >
> > Could anyone with a non-English OS try running the latest code
> > from git (or even the latest release) to see if you get similar
> > problems?
>
> I've also included an idea from Manabu Ishii to set environment
> variable LANG=C to get the default of USA English. This should
> work on Linux etc, and is probably harmless on Windows.
>
> Again, testing would be most welcome (any non-English OS),
>
> Thanks
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


-- 
Best regards,
        Konstantin


From biopython at maubp.freeserve.co.uk  Tue Jun 29 09:57:27 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 29 Jun 2010 10:57:27 +0100
Subject: [Biopython-dev] Running unit tests on non-English OS
In-Reply-To: <AANLkTinK3XuRt01-GfjNw45zNqAjXzVhFHT2Pyj-HVmH@mail.gmail.com>
References: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
	<AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>
	<AANLkTinK3XuRt01-GfjNw45zNqAjXzVhFHT2Pyj-HVmH@mail.gmail.com>
Message-ID: <AANLkTin9qFGyTce87cQ5imM0SFxwNnOJjEvEPX2d16tB@mail.gmail.com>

On Mon, Jun 28, 2010 at 6:55 PM, Konstantin Okonechnikov
<k.okonechnikov at gmail.com> wrote:
> Peter,
> I have built and run the latest code from git on Russian Ubuntu 10.4.

Thank you,

> Entrez tests have failed.

That can happen due to network problems. I'd like to see the error though.

> Muscle, clustal and emboss tests have been skipped successfully.

Good :)

> The tests have been executed from build.py script and I am not sure how to
> generate test report. Redirecting the script output to file didn't help.

I normally just run "python setup.py test" from the source directory or
"python run_tests.py" from the Tests subdirectory at the terminal, and
copy and paste the interesting bits of the output.

If you want to capture the test output to a file, you should probably redirect
both stdout and stderr:

python run_tests.py &> output.txt

Regards,

Peter


From bugzilla-daemon at portal.open-bio.org  Tue Jun 29 19:08:45 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 29 Jun 2010 15:08:45 -0400
Subject: [Biopython-dev] [Bug 3103] Possibly corrupt -
	ncbi_taxonomy_mollusca.xml.zip in Tests/PhyloXML
In-Reply-To: <bug-3103-42@http.bugzilla.open-bio.org/>
Message-ID: <201006291908.o5TJ8j66032031@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3103


vimalkumarvelayudhan at gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #5 from vimalkumarvelayudhan at gmail.com  2010-06-29 15:08 EST -------
Thank you. RPMs packaged with patches applied and can be found at
http://download.opensuse.org/repositories/science:/vlinux/


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From k.okonechnikov at gmail.com  Wed Jun 30 03:27:20 2010
From: k.okonechnikov at gmail.com (Konstantin Okonechnikov)
Date: Wed, 30 Jun 2010 10:27:20 +0700
Subject: [Biopython-dev] Running unit tests on non-English OS
In-Reply-To: <AANLkTin9qFGyTce87cQ5imM0SFxwNnOJjEvEPX2d16tB@mail.gmail.com>
References: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
	<AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>
	<AANLkTinK3XuRt01-GfjNw45zNqAjXzVhFHT2Pyj-HVmH@mail.gmail.com>
	<AANLkTin9qFGyTce87cQ5imM0SFxwNnOJjEvEPX2d16tB@mail.gmail.com>
Message-ID: <AANLkTimvkHrAXaSBhgZoKF710jEAGPLt2IZOT1M5nogE@mail.gmail.com>

Peter,
actually the problems with Entrez tools are Unicode related.
I suppose, that the test failures are related with  the current working dir
path: it contains a non-English word in it, thus it can not be represented
as an ascii string.
Also there are similar problems with Genbank to Sql tests.

Please, see the error-log attached.

On Tue, Jun 29, 2010 at 4:57 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Mon, Jun 28, 2010 at 6:55 PM, Konstantin Okonechnikov
> <k.okonechnikov at gmail.com> wrote:
> > Peter,
> > I have built and run the latest code from git on Russian Ubuntu 10.4.
>
> Thank you,
>
> > Entrez tests have failed.
>
> That can happen due to network problems. I'd like to see the error though.
>
> > Muscle, clustal and emboss tests have been skipped successfully.
>
> Good :)
>
> > The tests have been executed from build.py script and I am not sure how
> to
> > generate test report. Redirecting the script output to file didn't help.
>
> I normally just run "python setup.py test" from the source directory or
> "python run_tests.py" from the Tests subdirectory at the terminal, and
> copy and paste the interesting bits of the output.
>
> If you want to capture the test output to a file, you should probably
> redirect
> both stdout and stderr:
>
> python run_tests.py &> output.txt
>
> Regards,
>
> Peter
>


-- 
Best regards,
        Konstantin
-------------- next part --------------
running test
test_Ace ... ok
test_AlignIO ... ok
test_AlignIO_convert ... ok
test_BioSQL ... FAIL
test_BioSQL_SeqIO ... /home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/Loader.py:797: UserWarning: order location operators are not fully supported
  % feature.location_operator)
/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/Loader.py:797: UserWarning: bond location operators are not fully supported
  % feature.location_operator)
ok
test_CAPS ... ok
test_Clustalw ... ok
test_Clustalw_tool ... skipping. Install clustalw or clustalw2 if you want to use Bio.Clustalw.
test_Cluster ... ok
test_CodonTable ... ok
test_CodonUsage ... ok
test_Compass ... ok
test_Crystal ... ok
test_Dialign_tool ... skipping. Install DIALIGN2-2 if you want to use the Bio.Align.Applications wrapper.
test_DocSQL ... skipping. Install MySQLdb if you want to use Bio.DocSQL.
test_Emboss ... skipping. Install EMBOSS if you want to use Bio.Emboss.
test_EmbossPhylipNew ... skipping. Install the Emboss package 'PhylipNew' if you want to use the Bio.Emboss.Applications wrappers for phylogenetic tools.
test_EmbossPrimer ... ok
test_Entrez ... FAIL
test_Enzyme ... ok
test_FSSP ... ok
test_Fasta ... ok
test_File ... ok
test_GACrossover ... ok
test_GAMutation ... ok
test_GAOrganism ... ok
test_GAQueens ... ok
test_GARepair ... ok
test_GASelection ... ok
test_GFF ... skipping. Environment is not configured for this test (not important if you do not plan to use Bio.GFF).
test_GFF2 ... skipping. Install MySQLdb if you want to use Bio.GFF.
test_GenBank ... ok
test_GenomeDiagram ... skipping. Install reportlab if you want to use Bio.Graphics.
test_GraphicsBitmaps ... skipping. Install ReportLab if you want to use Bio.Graphics.
test_GraphicsChromosome ... skipping. Install reportlab if you want to use Bio.Graphics.
test_GraphicsDistribution ... skipping. Install reportlab if you want to use Bio.Graphics.
test_GraphicsGeneral ... skipping. Install reportlab if you want to use Bio.Graphics.
test_HMMCasino ... ok
test_HMMGeneral ... ok
test_HotRand ... ok
test_IsoelectricPoint ... ok
test_KDTree ... ok
test_KEGG ... ok
test_KeyWList ... ok
test_Location ... ok
test_LocationParser ... ok
test_LogisticRegression ... ok
test_MEME ... ok
test_Mafft_tool ... skipping. Install MAFFT if you want to use the Bio.Align.Applications wrapper.
test_MarkovModel ... ok
test_Medline ... ok
test_Motif ... ok
test_Muscle_tool ... skipping. Install MUSCLE if you want to use the Bio.Align.Applications wrapper.
test_NCBIStandalone ... ok
test_NCBITextParser ... ok
test_NCBIXML ... ok
test_NCBI_BLAST_tools ... skipping. Install the NCBI BLAST+ command line tools if you want to use the Bio.Blast.Applications wrapper.
test_NCBI_qblast ... ok
test_NNExclusiveOr ... ok
test_NNGene ... ok
test_NNGeneral ... ok
test_Nexus ... ok
test_PDB ... ok
test_ParserSupport ... ok
test_Pathway ... ok
test_Phd ... ok
test_Phylo ... ok
test_PhyloXML ... ok
test_Phylo_depend ... skipping. Install NetworkX if you want to use Bio.Phylo._utils.
test_PopGen_FDist ... skipping. Install FDist if you want to use Bio.PopGen.FDist.
test_PopGen_FDist_nodepend ... ok
test_PopGen_GenePop ... skipping. Install GenePop if you want to use Bio.PopGen.GenePop.
test_PopGen_GenePop_EasyController ... skipping. Install GenePop if you want to use Bio.PopGen.GenePop.
test_PopGen_GenePop_nodepend ... ok
test_PopGen_SimCoal ... skipping. Install SIMCOAL2 if you want to use Bio.PopGen.SimCoal.
test_PopGen_SimCoal_nodepend ... ok
test_Prank_tool ... skipping. Install PRANK if you want to use the Bio.Align.Applications wrapper.
test_Probcons_tool ... skipping. Install PROBCONS if you want to use the Bio.Align.Applications wrapper.
test_ProtParam ... ok
test_Restriction ... ok
test_SCOP_Astral ... ok
test_SCOP_Cla ... ok
test_SCOP_Des ... ok
test_SCOP_Dom ... ok
test_SCOP_Hie ... ok
test_SCOP_Raf ... ok
test_SCOP_Residues ... ok
test_SCOP_Scop ... ok
test_SVDSuperimposer ... ok
test_SeqIO ... ok
test_SeqIO_FastaIO ... ok
test_SeqIO_QualityIO ... ok
test_SeqIO_convert ... ok
test_SeqIO_features ... ok
test_SeqIO_index ... ok
test_SeqIO_online ... ok
test_SeqRecord ... ok
test_SeqUtils ... ok
test_Seq_objs ... ok
test_SubsMat ... ok
test_SwissProt ... ok
test_TCoffee_tool ... skipping. Install TCOFFEE if you want to use the Bio.Align.Applications wrapper.
test_UniGene ... ok
test_UniGene_obsolete ... ok
test_Wise ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise.
test_align ... ok
test_geo ... ok
test_interpro ... ok
test_kNN ... ok
test_lowess ... ok
test_pairwise2 ... ok
test_prodoc ... ok
test_property_manager ... ok
test_prosite1 ... ok
test_prosite2 ... ok
test_prosite_patterns ... ok
test_psw ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise.
test_seq ... ok
test_translate ... ok
test_trie ... ok
test_triefind ... ok
Bio.Application docstring test ... ok
Bio.Seq docstring test ... ok
Bio.SeqFeature docstring test ... ok
Bio.SeqRecord docstring test ... ok
Bio.SeqIO docstring test ... ok
Bio.SeqIO.AceIO docstring test ... ok
Bio.SeqIO.PhdIO docstring test ... ok
Bio.SeqIO.QualityIO docstring test ... ok
Bio.SeqIO.SffIO docstring test ... ok
Bio.SeqUtils docstring test ... ok
Bio.Align docstring test ... ok
Bio.Align.Generic docstring test ... ok
Bio.AlignIO docstring test ... ok
Bio.AlignIO.StockholmIO docstring test ... ok
Bio.Blast.Applications docstring test ... ok
Bio.Clustalw docstring test ... ok
Bio.Emboss.Applications docstring test ... ok
Bio.KEGG.Compound docstring test ... ok
Bio.KEGG.Enzyme docstring test ... ok
Bio.Wise docstring test ... FAIL
Bio.Wise.psw docstring test ... ok
Bio.Motif docstring test ... ok
Bio.Statistics.lowess docstring test ... ok
======================================================================
ERROR: GenBank file to BioSQL and back to a GenBank file, NC_000932.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 423, in test_NC_000932
    self.loop(os.path.join(os.getcwd(), "GenBank", "NC_000932.gb"), "gb")
  File "test_BioSQL.py", line 456, in loop
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL and back to a GenBank file, NC_005816.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 419, in test_NC_005816
    self.loop(os.path.join(os.getcwd(), "GenBank", "NC_005816.gb"), "gb")
  File "test_BioSQL.py", line 456, in loop
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL and back to a GenBank file, NT_019265.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 427, in test_NT_019265
    self.loop(os.path.join(os.getcwd(), "GenBank", "NT_019265.gb"), "gb")
  File "test_BioSQL.py", line 456, in loop
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL and back to a GenBank file, arab1.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 447, in test_arab1
    self.loop(os.path.join(os.getcwd(), "GenBank", "arab1.gb"), "gb")
  File "test_BioSQL.py", line 456, in loop
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL and back to a GenBank file, cor6_6.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 443, in test_cor6_6
    self.loop(os.path.join(os.getcwd(), "GenBank", "cor6_6.gb"), "gb")
  File "test_BioSQL.py", line 456, in loop
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL and back to a GenBank file, noref.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 435, in test_no_ref
    self.loop(os.path.join(os.getcwd(), "GenBank", "noref.gb"), "gb")
  File "test_BioSQL.py", line 456, in loop
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL and back to a GenBank file, one_of.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 439, in test_one_of
    self.loop(os.path.join(os.getcwd(), "GenBank", "one_of.gb"), "gb")
  File "test_BioSQL.py", line 456, in loop
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL and back to a GenBank file, protein_refseq2.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 431, in test_protein_refseq2
    self.loop(os.path.join(os.getcwd(), "GenBank", "protein_refseq2.gb"), "gb")
  File "test_BioSQL.py", line 456, in loop
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL, then again to a new namespace, NC_000932.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 496, in test_NC_000932
    self.trans(os.path.join(os.getcwd(), "GenBank", "NC_000932.gb"), "gb")
  File "test_BioSQL.py", line 529, in trans
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL, then again to a new namespace, NC_005816.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 492, in test_NC_005816
    self.trans(os.path.join(os.getcwd(), "GenBank", "NC_005816.gb"), "gb")
  File "test_BioSQL.py", line 529, in trans
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL, then again to a new namespace, NT_019265.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 500, in test_NT_019265
    self.trans(os.path.join(os.getcwd(), "GenBank", "NT_019265.gb"), "gb")
  File "test_BioSQL.py", line 529, in trans
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL, then again to a new namespace, arab1.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 520, in test_arab1
    self.trans(os.path.join(os.getcwd(), "GenBank", "arab1.gb"), "gb")
  File "test_BioSQL.py", line 529, in trans
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL, then again to a new namespace, cor6_6.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 516, in test_cor6_6
    self.trans(os.path.join(os.getcwd(), "GenBank", "cor6_6.gb"), "gb")
  File "test_BioSQL.py", line 529, in trans
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL, then again to a new namespace, noref.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 508, in test_no_ref
    self.trans(os.path.join(os.getcwd(), "GenBank", "noref.gb"), "gb")
  File "test_BioSQL.py", line 529, in trans
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL, then again to a new namespace, one_of.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 512, in test_one_of
    self.trans(os.path.join(os.getcwd(), "GenBank", "one_of.gb"), "gb")
  File "test_BioSQL.py", line 529, in trans
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: GenBank file to BioSQL, then again to a new namespace, protein_refseq2.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 504, in test_protein_refseq2
    self.trans(os.path.join(os.getcwd(), "GenBank", "protein_refseq2.gb"), "gb")
  File "test_BioSQL.py", line 529, in trans
    db = server.new_database(db_name)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 142, in new_database
    self.adaptor.execute(sql, (db_name,authority, description))
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/BioSeqDatabase.py", line 336, in execute
    self.dbutils.execute(self.cursor, sql, args)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/BioSQL/DBUtils.py", line 45, in execute
    cursor.execute(sql.replace("%s", "?"), args or ())
ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

======================================================================
ERROR: Test parsing XML returned by EFetch, Journals database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 3451, in test_journals
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by EFetch, Nucleotide database (first test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 3893, in test_nucleotide1
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by EFetch, Protein database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 4045, in test_nucleotide2
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by EFetch, OMIM database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 3607, in test_omim
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by EFetch, PubMed database (first test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 3034, in test_pubmed1
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by EFetch, PubMed database (second test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 3237, in test_pubmed2
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by EFetch, Taxonomy database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 3784, in test_taxonomy
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML output returned by EGQuery (first test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 2706, in test_egquery1
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML output returned by EGQuery (second test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 2858, in test_egquery2
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing database list returned by EInfo
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 26, in test_list
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing database info returned by EInfo
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 72, in test_pubmed
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing cancerchromosomes links returned by ELink
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 2690, in test_cancerchromosomes
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing medline indexed articles returned by ELink
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 1965, in test_medline
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing Nucleotide to Protein links returned by ELink
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 1239, in test_nucleotide
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing pubmed links returned by ELink (first test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 934, in test_pubmed1
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing pubmed links returned by ELink (second test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 1253, in test_pubmed2
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing pubmed link returned by ELink (third test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 2404, in test_pubmed3
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing pubmed links returned by ELink (fourth test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 2431, in test_pubmed4
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing pubmed links returned by ELink (fifth test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 2499, in test_pubmed5
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing pubmed links returned by ELink (sixth test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 2669, in test_pubmed6
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by EPost
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 535, in test_epost
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by EPost with an invalid id (overflow tag)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 553, in test_invalid
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by EPost with incorrect arguments
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 545, in test_wrong
    self.assertRaises(RuntimeError, Entrez.read, handle)
  File "/usr/lib/python2.6/unittest.py", line 336, in failUnlessRaises
    callableObj(*args, **kwargs)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESearch from the Journals database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 322, in test_journals
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESearch when no items were found
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 502, in test_notfound
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESearch from the Nucleotide database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 444, in test_nucleotide
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESearch from PubMed Central
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 366, in test_pmc
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESearch from the Protein database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 479, in test_protein
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESearch from PubMed (first test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 107, in test_pubmed1
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESearch from PubMed (second test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 136, in test_pubmed2
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESearch from PubMed (third test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 289, in test_pubmed3
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML output returned by ESpell
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 3013, in test_espell
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESummary from the Journals database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 653, in test_journals
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESummary from the Nucleotide database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 766, in test_nucleotide
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESummary from the Protein database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 727, in test_protein
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESummary from PubMed
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 576, in test_pubmed
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESummary from the Structure database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 805, in test_structure
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESummary from the Taxonomy database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 855, in test_taxonomy
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESummary from the UniSTS database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 895, in test_unists
    record = Entrez.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
ERROR: Test parsing XML returned by ESummary with incorrect arguments
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez.py", line 921, in test_wrong
    self.assertRaises(RuntimeError, Entrez.read, handle)
  File "/usr/lib/python2.6/unittest.py", line 336, in failUnlessRaises
    callableObj(*args, **kwargs)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/__init__.py", line 262, in read
    record = handler.read(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 108, in read
    self.parser.ParseFile(handle)
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Entrez/Parser.py", line 348, in externalEntityRefHandler
    path = os.path.join(self.dtd_dir, filename)
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 11: ordinal not in range(128)

======================================================================
FAIL: Doctest: Bio.Wise._build_align_cmdline
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.6/doctest.py", line 2152, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for Bio.Wise._build_align_cmdline
  File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Wise/__init__.py", line 23, in _build_align_cmdline

----------------------------------------------------------------------
File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Wise/__init__.py", line 26, in Bio.Wise._build_align_cmdline
Failed example:
    _build_align_cmdline(["dnal"], ("seq1.fna", "seq2.fna"), "/tmp/output", kbyte=100000)
Expected:
    'dnal -kbyte 100000 seq1.fna seq2.fna > /tmp/output'
Got:
    'dnal -kbyte 100000 -quiet seq1.fna seq2.fna > /tmp/output'
----------------------------------------------------------------------
File "/home/okko/??????/biopython/build/lib.linux-i686-2.6/Bio/Wise/__init__.py", line 28, in Bio.Wise._build_align_cmdline
Failed example:
    _build_align_cmdline(["psw"], ("seq1.faa", "seq2.faa"), "/tmp/output_aa")
Expected:
    'psw -kbyte 300000 seq1.faa seq2.faa > /tmp/output_aa'
Got:
    'psw -kbyte 300000 -quiet seq1.faa seq2.faa > /tmp/output_aa'


----------------------------------------------------------------------
Ran 144 tests in 192.676 seconds

FAILED (failures = 3)

From biopython at maubp.freeserve.co.uk  Wed Jun 30 10:19:19 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 30 Jun 2010 11:19:19 +0100
Subject: [Biopython-dev] Running unit tests on non-English OS
In-Reply-To: <AANLkTimvkHrAXaSBhgZoKF710jEAGPLt2IZOT1M5nogE@mail.gmail.com>
References: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
	<AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>
	<AANLkTinK3XuRt01-GfjNw45zNqAjXzVhFHT2Pyj-HVmH@mail.gmail.com>
	<AANLkTin9qFGyTce87cQ5imM0SFxwNnOJjEvEPX2d16tB@mail.gmail.com>
	<AANLkTimvkHrAXaSBhgZoKF710jEAGPLt2IZOT1M5nogE@mail.gmail.com>
Message-ID: <AANLkTin57wgJmKehUbeB21KBu1GMcc-N4PlzfycQJ54t@mail.gmail.com>

On Wed, Jun 30, 2010 at 4:27 AM, Konstantin Okonechnikov
<k.okonechnikov at gmail.com> wrote:
> Peter,
> actually the problems with Entrez tools are Unicode related.
> I suppose, that the test failures are related with? the current working dir
> path: it contains a non-English word in it, thus it can not be represented
> as an ascii string.
> Also there are similar problems with Genbank to Sql tests.
>
> Please, see the error-log attached.

Thank you for the error log. Yes, there do seem to be problems
with having the source code under a unicode path. Could you
try moving the folder from /home/okko/??????/biopython to
/home/okko/biopython and repeat the test? That would help
confirm this hypothesis.

Peter


From biopython at maubp.freeserve.co.uk  Wed Jun 30 12:47:14 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 30 Jun 2010 13:47:14 +0100
Subject: [Biopython-dev] Running unit tests on non-English OS
In-Reply-To: <AANLkTin57wgJmKehUbeB21KBu1GMcc-N4PlzfycQJ54t@mail.gmail.com>
References: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
	<AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>
	<AANLkTinK3XuRt01-GfjNw45zNqAjXzVhFHT2Pyj-HVmH@mail.gmail.com>
	<AANLkTin9qFGyTce87cQ5imM0SFxwNnOJjEvEPX2d16tB@mail.gmail.com>
	<AANLkTimvkHrAXaSBhgZoKF710jEAGPLt2IZOT1M5nogE@mail.gmail.com>
	<AANLkTin57wgJmKehUbeB21KBu1GMcc-N4PlzfycQJ54t@mail.gmail.com>
Message-ID: <AANLkTinOiSlYeKJM_HLlQq_bYtjTOQU5fMSVe7JIamqC@mail.gmail.com>

On Wed, Jun 30, 2010 at 11:19 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Wed, Jun 30, 2010 at 4:27 AM, Konstantin Okonechnikov
> <k.okonechnikov at gmail.com> wrote:
>> Peter,
>> actually the problems with Entrez tools are Unicode related.
>> I suppose, that the test failures are related with? the current working dir
>> path: it contains a non-English word in it, thus it can not be represented
>> as an ascii string.
>> Also there are similar problems with Genbank to Sql tests.
>>
>> Please, see the error-log attached.
>
> Thank you for the error log. Yes, there do seem to be problems
> with having the source code under a unicode path. Could you
> try moving the folder from /home/okko/??????/biopython to
> /home/okko/biopython and repeat the test? That would help
> confirm this hypothesis.

I created a similar directory name on my (English) version of
Mac OS X, and get the same Entrez failure.

Peter


From biopython at maubp.freeserve.co.uk  Wed Jun 30 13:05:53 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 30 Jun 2010 14:05:53 +0100
Subject: [Biopython-dev] Running unit tests on non-English OS
In-Reply-To: <AANLkTinOiSlYeKJM_HLlQq_bYtjTOQU5fMSVe7JIamqC@mail.gmail.com>
References: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
	<AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>
	<AANLkTinK3XuRt01-GfjNw45zNqAjXzVhFHT2Pyj-HVmH@mail.gmail.com>
	<AANLkTin9qFGyTce87cQ5imM0SFxwNnOJjEvEPX2d16tB@mail.gmail.com>
	<AANLkTimvkHrAXaSBhgZoKF710jEAGPLt2IZOT1M5nogE@mail.gmail.com>
	<AANLkTin57wgJmKehUbeB21KBu1GMcc-N4PlzfycQJ54t@mail.gmail.com>
	<AANLkTinOiSlYeKJM_HLlQq_bYtjTOQU5fMSVe7JIamqC@mail.gmail.com>
Message-ID: <AANLkTinmc1pgGd3vdHh-zoeXJ4jCIQaP2RdlxhQyk7a-@mail.gmail.com>

On Wed, Jun 30, 2010 at 1:47 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> I created a similar directory name on my (English) version of
> Mac OS X, and get the same Entrez failure.
>

Hi Konstantin,

Could you retest using the latest code from github? I hope that now
test_Entrez.py will work for you.

Thanks,

Peter


From biopython at maubp.freeserve.co.uk  Wed Jun 30 13:31:58 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 30 Jun 2010 14:31:58 +0100
Subject: [Biopython-dev] Running unit tests on non-English OS
In-Reply-To: <AANLkTinmc1pgGd3vdHh-zoeXJ4jCIQaP2RdlxhQyk7a-@mail.gmail.com>
References: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
	<AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>
	<AANLkTinK3XuRt01-GfjNw45zNqAjXzVhFHT2Pyj-HVmH@mail.gmail.com>
	<AANLkTin9qFGyTce87cQ5imM0SFxwNnOJjEvEPX2d16tB@mail.gmail.com>
	<AANLkTimvkHrAXaSBhgZoKF710jEAGPLt2IZOT1M5nogE@mail.gmail.com>
	<AANLkTin57wgJmKehUbeB21KBu1GMcc-N4PlzfycQJ54t@mail.gmail.com>
	<AANLkTinOiSlYeKJM_HLlQq_bYtjTOQU5fMSVe7JIamqC@mail.gmail.com>
	<AANLkTinmc1pgGd3vdHh-zoeXJ4jCIQaP2RdlxhQyk7a-@mail.gmail.com>
Message-ID: <AANLkTimQz379gZrGxwwvpKOi_lspOn5dzKQFYbdUQfAF@mail.gmail.com>

On Wed, Jun 30, 2010 at 2:05 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Wed, Jun 30, 2010 at 1:47 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>>
>> I created a similar directory name on my (English) version of
>> Mac OS X, and get the same Entrez failure.
>>
>
> Hi Konstantin,
>
> Could you retest using the latest code from github? I hope that now
> test_Entrez.py will work for you.

The second update should also fix test_BioSQL.py as well.

Peter


From biopython at maubp.freeserve.co.uk  Wed Jun 30 14:24:57 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 30 Jun 2010 15:24:57 +0100
Subject: [Biopython-dev] Running unit tests on non-English OS
In-Reply-To: <AANLkTimXNStsgSo2zbBz3TGfWnxB_Dn-XZCraPMD4H6M@mail.gmail.com>
References: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
	<AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>
	<AANLkTinK3XuRt01-GfjNw45zNqAjXzVhFHT2Pyj-HVmH@mail.gmail.com>
	<AANLkTin9qFGyTce87cQ5imM0SFxwNnOJjEvEPX2d16tB@mail.gmail.com>
	<AANLkTimvkHrAXaSBhgZoKF710jEAGPLt2IZOT1M5nogE@mail.gmail.com>
	<AANLkTin57wgJmKehUbeB21KBu1GMcc-N4PlzfycQJ54t@mail.gmail.com>
	<AANLkTinOiSlYeKJM_HLlQq_bYtjTOQU5fMSVe7JIamqC@mail.gmail.com>
	<AANLkTinmc1pgGd3vdHh-zoeXJ4jCIQaP2RdlxhQyk7a-@mail.gmail.com>
	<AANLkTimQz379gZrGxwwvpKOi_lspOn5dzKQFYbdUQfAF@mail.gmail.com>
	<AANLkTimXNStsgSo2zbBz3TGfWnxB_Dn-XZCraPMD4H6M@mail.gmail.com>
Message-ID: <AANLkTikVjsHaO3EZNJgB7wWZ6G3Yde-_AkQuxUNDNZQt@mail.gmail.com>

On Wed, Jun 30, 2010 at 2:59 PM, Konstantin Okonechnikov
<k.okonechnikov at gmail.com> wrote:
> The fixes work!
> Only one test fails, but it doesn't look related to non-English OS
> problems.? I've attached the new test log.

Great :)

I hadn't done anything about the Bio.Wise docstring test failure yet,
but it isn't linked to the non-English OS at all. I'll start a new thread...

Peter


From bugzilla-daemon at portal.open-bio.org  Wed Jun 30 15:22:16 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 30 Jun 2010 11:22:16 -0400
Subject: [Biopython-dev] [Bug 2738] Speed up GenBank parsing,
	in particular location parsing
In-Reply-To: <bug-2738-42@http.bugzilla.open-bio.org/>
Message-ID: <201006301522.o5UFMGvo028548@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2738


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk  2010-06-30 11:22 EST -------
I've merged my github branch into the master.

Marking as fixed.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Wed Jun 30 15:23:12 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 30 Jun 2010 16:23:12 +0100
Subject: [Biopython-dev] Re-written GenBank/EMBL feature location parsing
In-Reply-To: <AANLkTikgejWghQbe4LJnx82u7sCEi2A911O3BIg6JijW@mail.gmail.com>
References: <AANLkTikgejWghQbe4LJnx82u7sCEi2A911O3BIg6JijW@mail.gmail.com>
Message-ID: <AANLkTimEfOqNcUA91D7hVZEdX0AaR5JjhIo0eWMvh1tV@mail.gmail.com>

On Fri, Jun 25, 2010 at 4:21 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> I've been working on and off recently on rewriting the location
> parsing for GenBank/EMBL features:
> http://bugzilla.open-bio.org/show_bug.cgi?id=2738
>
> I have a branch ready for public testing, ... Would anyone like
> to volunteer to test the new branch before I merge it to the trunk?

I've just merged it - testing and feedback still welcome of course.

Peter


From biopython at maubp.freeserve.co.uk  Wed Jun 30 14:38:59 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 30 Jun 2010 15:38:59 +0100
Subject: [Biopython-dev] Running unit tests on non-English OS
In-Reply-To: <AANLkTikVjsHaO3EZNJgB7wWZ6G3Yde-_AkQuxUNDNZQt@mail.gmail.com>
References: <AANLkTinCo369OdHQRNULYgMUFBWqPkv6nENIfoLoX_lg@mail.gmail.com>
	<AANLkTinuZHLN-WjyNgj1HkzyKn-MLttASgxF5dqmQKWw@mail.gmail.com>
	<AANLkTinK3XuRt01-GfjNw45zNqAjXzVhFHT2Pyj-HVmH@mail.gmail.com>
	<AANLkTin9qFGyTce87cQ5imM0SFxwNnOJjEvEPX2d16tB@mail.gmail.com>
	<AANLkTimvkHrAXaSBhgZoKF710jEAGPLt2IZOT1M5nogE@mail.gmail.com>
	<AANLkTin57wgJmKehUbeB21KBu1GMcc-N4PlzfycQJ54t@mail.gmail.com>
	<AANLkTinOiSlYeKJM_HLlQq_bYtjTOQU5fMSVe7JIamqC@mail.gmail.com>
	<AANLkTinmc1pgGd3vdHh-zoeXJ4jCIQaP2RdlxhQyk7a-@mail.gmail.com>
	<AANLkTimQz379gZrGxwwvpKOi_lspOn5dzKQFYbdUQfAF@mail.gmail.com>
	<AANLkTimXNStsgSo2zbBz3TGfWnxB_Dn-XZCraPMD4H6M@mail.gmail.com>
	<AANLkTikVjsHaO3EZNJgB7wWZ6G3Yde-_AkQuxUNDNZQt@mail.gmail.com>
Message-ID: <AANLkTinxk2s14jnD1C9jAnqFvrZQZjdvicd7T2p027yf@mail.gmail.com>

On Wed, Jun 30, 2010 at 3:24 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Wed, Jun 30, 2010 at 2:59 PM, Konstantin Okonechnikov
> <k.okonechnikov at gmail.com> wrote:
>> The fixes work!
>> Only one test fails, but it doesn't look related to non-English OS
>> problems.? I've attached the new test log.
>
> Great :)
>
> I hadn't done anything about the Bio.Wise docstring test failure yet,
> but it isn't linked to the non-English OS at all. I'll start a new thread...
>

Solved. The doctest was working UNLESS the test output was
being sent to a file.

Peter